Automated Essay Scoring (AES) systems are often accurate but difficult to explain in a way that teachers and students can understand. This paper presents a two-stage AES pipeline that combines strong score prediction with short natural-language justifications. First, a DeBERTaV3 encoder is fine-tuned for holistic score prediction using a regression-style scorer. The continuous outputs are then converted into the final discrete score bands using a fixed set of validation-derived cut points, keeping the inference stage deterministic. Second, to make outputs interpretable, we use a UnifiedQA (T5-based) model to generate concise justifications as answers to structured, rubric-like questions (e.g., strengths, needed improvements, prompt rel-evance), conditioned on the essay and prompt. This design keeps the scoring model unchanged while adding an explanation layer that supports qualitative inspection and reporting. Experiments on ASAP2 AES benchmark and evaluation using Quadratic Weighted Kappa (QWK) show strong agreement with human scores, achieving 0.8329 QWK on validation and 0.8151 QWK on the held-out test set, along with low mean absolute error.
This paper presented a two-module AES pipeline that outputs (i) a discrete holistic score on a 1β6 ordinal scale and (ii) a short natural-language justification. The scoring module fine-tunes DeBERTaV3-small as a prompt-conditioned regressor by encoding the prompt and essay as a single sequence. Discrete scores are produced by applying fixed score boundaries learned on validation data and then frozen for evaluation. On the held-out test set (n = 2475), the system achieved QWK = 0.8151, MAE = 0.3786, and Accuracy
= 0.6372, and it improved over the baseline transformer scoring results reported in the earlier draft. Error analysis via the confusion matrix showed that most disagreements occur between adjacent score levels, indicating that ordinal structure is largely preserved while ambiguity concentrates near neighboring score boundaries.
To support interpretability without altering the scoring path, a separate UnifiedQA-based justification module generated brief prompt-aligned statements (strengths, improvements, and relevance). This separation keeps the numeric scorer stable while providing an inspection and feedback surface that can be reviewed by instructors or used to accompany predictions in user-facing settings.
Several directions follow from current AES trends and the observed error structure. First, boundary-sensitive errors motivate learning strategies that explicitly optimize ordinal consistency and robustness near class transitions, including score-aware training objectives and boundary-focused data augmentation. Second, extending the pipeline to cross-prompt and low-resource settings requires prompt-invariant modeling and stronger generalization controls, as emphasized by recent work on scoring-invariance and cross-prompt trait scoring [28]. Third, fairness and stability analyses should be incor-porated as standard reporting: prior studies show that AES systems can exhibit subgroup sensitivity and variability that must be audited under realistic educational constraints [19], [20]. Fourth, hybrid systems that combine a discriminative scorer with trait-based or rubric-driven LLM scoring remain a promising direction for improving transparency and aligning model outputs with human grading constructs [18]. Finally, justification quality should be evaluated with human-centered protocols that measure actionability, consistency, and equity of feedback across learner groups and genres, consistent with recent empirical investigations of LLM-based writing feedback [29].
References
[1]B. Beigman Klebanov and N. Madnani, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Cham, Switzerland: Springer, 2022.
[2]H. Misgna, B.-W. On, I. Lee, and G. S. Choi, βA survey on deep learning-based automated essay scoring and feedback generation,β Artifi-cial Intelligence Review, vol. 58, art. no. 36, 2025, doi: 10.1007/s10462-024-11017-5.
[3]P. He, X. Liu, J. Gao, and W. Chen, βDeBERTa: Decoding-Enhanced BERT with Disentangled Attention,β arXiv preprint arXiv:2006.03654, 2020.
[4]P. He, X. Liu, J. Gao, and W. Chen, βDeBERTaV3: Improving De-BERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing,β arXiv preprint arXiv:2111.09543, 2021.
[5]R. K. R. Chavva, S. R. Muthyam, M. S. Seelam, and N. Nalli-boina, βA Transformer-Based Approach for Enhancing Automated Essay Scoring,β in 2024 1st International Conference on Advanced Com-puting and Emerging Technologies (ACET), Aug. 2024, pp. 1β6, doi: 10.1109/ACET61898.2024.10730000.
[6]C. R. K. Reddy, A. K. Tulasi, M. Maturi, and A. Nagam, βContext-Aware Automated Essay Scoring with MLM-Pretrained T5 Trans-former,β in 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), Jun. 2025, pp. 1439β1443, doi: 10.1109/ICIRCA65293.2025.11089875.
[7]A. Pack, A. Barrett, and J. Escalante, βLarge language models and automated essay scoring of English language learner writing: Insights into validity and reliability,β Computers and Education: Artificial Intel-ligence, vol. 6, art. no. 100234, 2024, doi: 10.1016/j.caeai.2024.100234.
[8]N. M. Bui and J. S. Barrot, βChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring,β Education and Information Technologies, vol. 30, pp. 2041β2058, 2025, doi: 10.1007/s10639-024-12891-w.
[9]D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi, βUnifiedQA: Crossing Format Boundaries with a Single QA System,β in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
[10]C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, βExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,β Journal of Machine Learning Research, vol. 21, no. 140, pp. 1β67, 2020.
[11]J. C. Li and H. T. Ng, βAutomated Essay Scoring: Recent Successes and Future Directions,β in Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024.
[12]K. Taghipour and H. T. Ng, βA Neural Approach to Automated Essay Scoring,β in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
[13]N. Ait Khayi and V. Rus, βAutomated Essay Scoring Using Discourse External Knowledge,β in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2024, pp. 7154β7160, doi: 10.24963/ijcai.2024/791.
[14]H. Do, Y. Kim, and G. Lee, βAutoregressive Score Generation for Multi-trait Essay Scoring,β in Findings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1659β1666.
[15]S. Li and V. Ng, βConundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art,β in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024, pp. 7661β7681, doi: 10.18653/v1/2024.acl-long.414.
[16]S. Li and V. Ng, βICLE++: Modeling Fine-Grained Traits for Holistic Essay Scoring,β in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2024), 2024, pp. 8465β8486, doi: 10.18653/v1/2024.naacl-long.468.
[17]Y. Wang, R. Hu, and Z. Zhao, βBeyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals,β in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 8906β8925, doi: 10.18653/v1/2024.findings-emnlp.520.
[18]S. Lee, Y. Cai, D. Meng, Z. Wang, and Y. Wu, βUnleashing Large Language Modelsβ Proficiency in Zero-shot Essay Scoring,β in Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 2024, pp. 181β198, doi: 10.18653/v1/2024.findings-emnlp.10.
[19]N.-J. Schaller, Y. Ding, A. Horbach, J. Meyer, and T. Jansen, βFairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on German Learner Essays from Secondary Education,β in Proc. 19th Workshop on Innovative Use of NLP for Building Educational Applica-tions (BEA 2024), Mexico City, Mexico, 2024, pp. 210β221. [Online].
Available: https://aclanthology.org/2024.bea-1.18/
[20]F. GarcΒ΄Δ±a-Varela, M. Nussbaum, M. Mendoza, C. MartΒ΄Δ±nez-Troncoso, and Z. Bekerman, βChatGPT as a Stable and Fair Tool for Automated Essay Scoring,β Education Sciences, vol. 15, no. 8, Art. no. 946, 2025, doi: 10.3390/educsci15080946.
[21]S. A. Crossley, P. Baffour, L. Burleigh, and J. King, βA large-scale corpus for assessing source-based writing quality: ASAP 2.0,β Assessing Writ-ing, vol. 65, Art. no. 100954, Jul. 2025, doi: 10.1016/j.asw.2025.100954.
[22]The Learning Agency Lab, βASAP 2.0 Dataset,β 2024. [Online]. Available: https://the-learning-agency-lab.com/learning-exchange/asap-2-0-dataset/ (accessed Feb. 06, 2026).
[23]LEAR Lab, βDatasets: The Learning Agency Lab β Automated Es-say Scoring 2.0,β [Online]. Available: https://learlab.org/data/ (accessed Feb. 06, 2026).
[24]A. Doewes, N. Kurdhi, and A. Saxena, βEvaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring,β in Proc. 16th Int. Conf. Educational
Data Mining (EDM), 2023, pp. 103β113. [Online]. Available: https://educationaldatamining.org/EDM2023/proceedings/2023.EDM-long-papers.9/2023.EDM-long-papers.9.pdf
[25]J. Cohen, βWeighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,β Psychological Bulletin, vol. 70, no. 4, pp. 213β220, Oct. 1968, doi: 10.1037/h0026256.
[26]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, βBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,β in Proc. NAACL-HLT, 2019, pp. 4171β4186, doi: 10.18653/v1/N19-1423.
[27]Y. Liu et al., βRoBERTa: A Robustly Optimized BERT Pretraining Approach,β arXiv:1907.11692, 2019.
[28]J. Wang and S. Yu, βImproving Prompt Generalization for Cross-prompt Essay Trait Scoring from the Scoring-invariance Perspective,β in Find-ings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 2025, pp. 2633β2646, doi: 10.18653/v1/2025.findings-emnlp.142.
[29]M. Jovic, S. Papakonstantinidis, and R. Kirkpatrick, βFrom red ink to algorithms: investigating the use of large language models in academic writing feedback,β Language Testing in Asia, vol. 15, Art. no. 59, 2025, doi: 10.1186/s40468-025-00389-2.