Evaluating the Accuracy and Consistency of AI-Powered Marking for A-Level History Assessments

Abstract

This whitepaper presents an evaluation of the performance of our AI-powered marking platform when applied to A-level History assessments. Specifically, it focuses on two core dimensions: accuracy (alignment with human markers) and consistency (stability of marks when the same responses are remarked). Our findings show that the SmartEducator AI system achieves 86.6% alignment with human markers and a consistency rate of 92.3% when remarking the same student responses. These results are benchmarked against established human marking variability, demonstrating that the platform outperforms human markers in terms of accuracy and consistency. 

  1. Introduction

Teacher workload is a pressing issue in education systems worldwide, with marking being one of the most time-consuming and cognitively demanding tasks for teachers, particularly in essay-based subjects like English and History. Advances in machine learning and natural language processing offer new opportunities to automate or assist in assessment marking, with the potential to improve speed, objectivity, and consistency. This whitepaper evaluates the performance of our AI-powered marking tool in a real-world A-level History exam setting. 

 

  1. Methodology

2.1 Dataset
We tested the platform on a dataset composed of 12 extended-answer responses written by multiple students for an A-level History exam. Each question had been previously marked by human examiners. The dataset allowed for analysis across a range of student performance levels and question types (e.g., source-based, evaluative, thematic). 

2.2 AI Marking Model
The AI marking engine orchestrates several large language models to provide marks based on criteria defined in mark schemes and rubrics. As such, our AI marking engine can assess adherence to mark scheme/rubric, depth of analysis, use of historical evidence, and written communication — aligned with national curriculum standards. 

2.3 Evaluation Metrics 

  • Accuracy (Alignment Rate): The percentage of AI-assigned marks that fall within the same mark as human examiners. 
  • Consistency (Repeatability Rate): The percentage of times the AI assigns the same mark to the same answer upon re-marking. 

 

  1. Results
Metric  AI Performance 
Accuracy (Alignment to Human Marks)  86.6% 
Consistency (Repeatability)  92.3% 

 

  1. Benchmarking Against Human Markers

Studies providing equivalent statistics for human markers are uncommon, however, a 2018 research paper released by OfQual titled “Marking Consistency Metrics”[1] provides a good summary of the existing research.  

Ofqual’s paper shows that essay-based subjects are the most difficult subjects to mark. For History, if a definitive mark is agreed upon by consensus for a given essay, only 55% of teachers will award that mark. This implies that the level of accuracy across markers is low. By contrast, our AI-powered system achieved 86.6% accuracy under similar conditions.  

Inspection of Figure 7 in OfQual’s paper shows that the probability of being awarded the definitive grade (i.e. awards a number of marks that are within the grade boundary of the consensus-agreed grade) for History varies between 35% and 70% depending on the marker (excluding outliers) implying that there is low consistency between markers. Our AI-powered system achieves a 92.3% consistency when marking History essays. 

 

  1. Discussion

These findings indicate that our AI-powered platform offers: 

  • Superior accuracy: Approximately 30% more accurate than human examiners. 
  • Superior consistency: Significantly higher than the average human repeatability in History marking. 

This level of consistency reduces the likelihood of students receiving different marks due to marker variability and contributes to a fairer assessment process. 

Moreover, AI marking can be delivered at scale and with near-instant turnaround, enabling more timely feedback to students and reducing teacher workload. 

We also recognise that AI marking is best used in partnership with teachers, serving as a decision-support tool rather than a complete replacement. 

 

  1. Conclusion

The results of this evaluation suggest that AI-powered marking can play a significant role in alleviating teacher workload while maintaining — and even improving upon — current levels of marking fairness and consistency. This paper demonstrates our commitment to transparency, rigor, and continued collaboration with educators and regulators. 

 

Reference: 

[1] - Ofqual (2018). "Marking Consistency Metrics"
https://assets.publishing.service.gov.uk/media/5bfbfd70e5274a0fb775cca3/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf 

Want to help shape the future of education?

Click the button to sign up to the SmartEducator Platform Now!

Scroll to Top