DeBERTaV3-Based Automated Essay Scoring with Unified QA-Generated Natural Language Justifications | IJCSE Volume 10 – Issue 3 | IJCSE-V10I3P9

IJCSE International Journal of Computer Science Engineering Logo

International Journal of Computer Science Engineering Techniques

ISSN: 2455-135X
Volume 10, Issue 3  |  Published:
Author

Abstract

Automated Essay Scoring (AES) systems are often accurate but difficult to explain in a way that teachers and students can understand. This paper presents a two-stage AES pipeline that combines strong score prediction with short natural-language justifications. First, a DeBERTaV3 encoder is fine-tuned for holistic score prediction using a regression-style scorer. The continuous outputs are then converted into the final discrete score bands using a fixed set of validation-derived cut points, keeping the inference stage deterministic. Second, to make outputs interpretable, we use a UnifiedQA (T5-based) model to generate concise justifications as answers to structured, rubric-like questions (e.g., strengths, needed improvements, prompt rel-evance), conditioned on the essay and prompt. This design keeps the scoring model unchanged while adding an explanation layer that supports qualitative inspection and reporting. Experiments on ASAP2 AES benchmark and evaluation using Quadratic Weighted Kappa (QWK) show strong agreement with human scores, achieving 0.8329 QWK on validation and 0.8151 QWK on the held-out test set, along with low mean absolute error.

Keywords

Automated Essay Scoring (AES), DeBERTaV3, UnifiedQA, natural language justification, interpretable scoring, Quadratic Weighted Kappa (QWK)

Conclusion

This paper presented a two-module AES pipeline that outputs (i) a discrete holistic score on a 1–6 ordinal scale and (ii) a short natural-language justification. The scoring module fine-tunes DeBERTaV3-small as a prompt-conditioned regressor by encoding the prompt and essay as a single sequence. Discrete scores are produced by applying fixed score boundaries learned on validation data and then frozen for evaluation. On the held-out test set (n = 2475), the system achieved QWK = 0.8151, MAE = 0.3786, and Accuracy = 0.6372, and it improved over the baseline transformer scoring results reported in the earlier draft. Error analysis via the confusion matrix showed that most disagreements occur between adjacent score levels, indicating that ordinal structure is largely preserved while ambiguity concentrates near neighboring score boundaries. To support interpretability without altering the scoring path, a separate UnifiedQA-based justification module generated brief prompt-aligned statements (strengths, improvements, and relevance). This separation keeps the numeric scorer stable while providing an inspection and feedback surface that can be reviewed by instructors or used to accompany predictions in user-facing settings. Several directions follow from current AES trends and the observed error structure. First, boundary-sensitive errors motivate learning strategies that explicitly optimize ordinal consistency and robustness near class transitions, including score-aware training objectives and boundary-focused data augmentation. Second, extending the pipeline to cross-prompt and low-resource settings requires prompt-invariant modeling and stronger generalization controls, as emphasized by recent work on scoring-invariance and cross-prompt trait scoring [28]. Third, fairness and stability analyses should be incor-porated as standard reporting: prior studies show that AES systems can exhibit subgroup sensitivity and variability that must be audited under realistic educational constraints [19], [20]. Fourth, hybrid systems that combine a discriminative scorer with trait-based or rubric-driven LLM scoring remain a promising direction for improving transparency and aligning model outputs with human grading constructs [18]. Finally, justification quality should be evaluated with human-centered protocols that measure actionability, consistency, and equity of feedback across learner groups and genres, consistent with recent empirical investigations of LLM-based writing feedback [29].

References

[1]B. Beigman Klebanov and N. Madnani, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Cham, Switzerland: Springer, 2022. [2]H. Misgna, B.-W. On, I. Lee, and G. S. Choi, β€œA survey on deep learning-based automated essay scoring and feedback generation,” Artifi-cial Intelligence Review, vol. 58, art. no. 36, 2025, doi: 10.1007/s10462-024-11017-5. [3]P. He, X. Liu, J. Gao, and W. Chen, β€œDeBERTa: Decoding-Enhanced BERT with Disentangled Attention,” arXiv preprint arXiv:2006.03654, 2020. [4]P. He, X. Liu, J. Gao, and W. Chen, β€œDeBERTaV3: Improving De-BERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing,” arXiv preprint arXiv:2111.09543, 2021. [5]R. K. R. Chavva, S. R. Muthyam, M. S. Seelam, and N. Nalli-boina, β€œA Transformer-Based Approach for Enhancing Automated Essay Scoring,” in 2024 1st International Conference on Advanced Com-puting and Emerging Technologies (ACET), Aug. 2024, pp. 1–6, doi: 10.1109/ACET61898.2024.10730000. [6]C. R. K. Reddy, A. K. Tulasi, M. Maturi, and A. Nagam, β€œContext-Aware Automated Essay Scoring with MLM-Pretrained T5 Trans-former,” in 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), Jun. 2025, pp. 1439–1443, doi: 10.1109/ICIRCA65293.2025.11089875. [7]A. Pack, A. Barrett, and J. Escalante, β€œLarge language models and automated essay scoring of English language learner writing: Insights into validity and reliability,” Computers and Education: Artificial Intel-ligence, vol. 6, art. no. 100234, 2024, doi: 10.1016/j.caeai.2024.100234. [8]N. M. Bui and J. S. Barrot, β€œChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring,” Education and Information Technologies, vol. 30, pp. 2041–2058, 2025, doi: 10.1007/s10639-024-12891-w. [9]D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi, β€œUnifiedQA: Crossing Format Boundaries with a Single QA System,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. [10]C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, β€œExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [11]J. C. Li and H. T. Ng, β€œAutomated Essay Scoring: Recent Successes and Future Directions,” in Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024. [12]K. Taghipour and H. T. Ng, β€œA Neural Approach to Automated Essay Scoring,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. [13]N. Ait Khayi and V. Rus, β€œAutomated Essay Scoring Using Discourse External Knowledge,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2024, pp. 7154–7160, doi: 10.24963/ijcai.2024/791. [14]H. Do, Y. Kim, and G. Lee, β€œAutoregressive Score Generation for Multi-trait Essay Scoring,” in Findings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1659–1666. [15]S. Li and V. Ng, β€œConundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024, pp. 7661–7681, doi: 10.18653/v1/2024.acl-long.414. [16]S. Li and V. Ng, β€œICLE++: Modeling Fine-Grained Traits for Holistic Essay Scoring,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2024), 2024, pp. 8465–8486, doi: 10.18653/v1/2024.naacl-long.468. [17]Y. Wang, R. Hu, and Z. Zhao, β€œBeyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals,” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 8906–8925, doi: 10.18653/v1/2024.findings-emnlp.520. [18]S. Lee, Y. Cai, D. Meng, Z. Wang, and Y. Wu, β€œUnleashing Large Language Models’ Proficiency in Zero-shot Essay Scoring,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 2024, pp. 181–198, doi: 10.18653/v1/2024.findings-emnlp.10. [19]N.-J. Schaller, Y. Ding, A. Horbach, J. Meyer, and T. Jansen, β€œFairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on German Learner Essays from Secondary Education,” in Proc. 19th Workshop on Innovative Use of NLP for Building Educational Applica-tions (BEA 2024), Mexico City, Mexico, 2024, pp. 210–221. [Online]. Available: https://aclanthology.org/2024.bea-1.18/ [20]F. GarcΒ΄Δ±a-Varela, M. Nussbaum, M. Mendoza, C. MartΒ΄Δ±nez-Troncoso, and Z. Bekerman, β€œChatGPT as a Stable and Fair Tool for Automated Essay Scoring,” Education Sciences, vol. 15, no. 8, Art. no. 946, 2025, doi: 10.3390/educsci15080946. [21]S. A. Crossley, P. Baffour, L. Burleigh, and J. King, β€œA large-scale corpus for assessing source-based writing quality: ASAP 2.0,” Assessing Writ-ing, vol. 65, Art. no. 100954, Jul. 2025, doi: 10.1016/j.asw.2025.100954. [22]The Learning Agency Lab, β€œASAP 2.0 Dataset,” 2024. [Online]. Available: https://the-learning-agency-lab.com/learning-exchange/asap-2-0-dataset/ (accessed Feb. 06, 2026). [23]LEAR Lab, β€œDatasets: The Learning Agency Lab – Automated Es-say Scoring 2.0,” [Online]. Available: https://learlab.org/data/ (accessed Feb. 06, 2026). [24]A. Doewes, N. Kurdhi, and A. Saxena, β€œEvaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring,” in Proc. 16th Int. Conf. Educational Data Mining (EDM), 2023, pp. 103–113. [Online]. Available: https://educationaldatamining.org/EDM2023/proceedings/2023.EDM-long-papers.9/2023.EDM-long-papers.9.pdf [25]J. Cohen, β€œWeighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, Oct. 1968, doi: 10.1037/h0026256. [26]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, β€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186, doi: 10.18653/v1/N19-1423. [27]Y. Liu et al., β€œRoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692, 2019. [28]J. Wang and S. Yu, β€œImproving Prompt Generalization for Cross-prompt Essay Trait Scoring from the Scoring-invariance Perspective,” in Find-ings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 2025, pp. 2633–2646, doi: 10.18653/v1/2025.findings-emnlp.142. [29]M. Jovic, S. Papakonstantinidis, and R. Kirkpatrick, β€œFrom red ink to algorithms: investigating the use of large language models in academic writing feedback,” Language Testing in Asia, vol. 15, Art. no. 59, 2025, doi: 10.1186/s40468-025-00389-2.
Β© 2025 International Journal of Computer Science Engineering Techniques (IJCSE).
Submit Your Paper