Automated Code Assessment: From Dataset Construction To Distributed LLM-Powered Grading
Mina Atef Yousef Wahba
Automated Code Assessment: From Dataset Construction To Distributed LLM-Powered Grading /Mina Atef Yousef Wahba - 2025 - 93 p. ill. 21 cm.
Supervisor:
Ghada Khoriba
Tamer Arafa
Thesis (M.A.)—Nile University, Egypt, 2025 .
"Includes bibliographical references"
Contents:
Contents
LIST OF FIGURES xi
LIST OF TABLES xi
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution of Educational Technology . . . . . . . . . . . . . . . . . . 2
1.1.2 Limitations of Traditional Grading in Programming . . . . . . . . . . . 3
1.1.3 Emergence and Potential of LLMs and Big Data . . . . . . . . . . . . 4
1.2 Research Contributions and Significance . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Background and Taxonomy of LLMs . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 General Purpose LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Domain Specific LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 MoE Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 LLMs in Education Applications . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 LLM in Grading and Feedback Systems . . . . . . . . . . . . . . . . . . . . . 13
2.4 Infrastructure for Scalable LLM Deployment . . . . . . . . . . . . . . . . . . 15
2.5 Identified Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
CONTENTS
3 Methodology 19
3.1 Dataset Collection and Pre-processing . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Real-Time Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Data Augmentation and Annotation . . . . . . . . . . . . . . . . . . . 23
3.1.3 Annotation of Real Campus Data . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Synthetic Data Generation from Programming Books . . . . . . . . . . 25
3.2 BeGrading: Fine-Tuned Model for Automated Programming Grading . . . . . 26
4 Results & Discussion 28
4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Hardware Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Dataset Analysis and Characteristics . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Real-Time Processing Performance . . . . . . . . . . . . . . . . . . . 33
4.2.2 Dataset Augmentation and Annotation . . . . . . . . . . . . . . . . . 36
4.3 Begrading: Fine-tuned Model Results . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Effect of Dataset Grade Correction on High-Variance Records . . . . . 45
4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Sensitivity to Grading Guidance . . . . . . . . . . . . . . . . . . . . . 49
4.4.3 Effectiveness of Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Conclusions and Future Work 56
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A Appendix 73
Abstract:
rading programming assignments in educational settings presents several challenges, including computational resource constraints, inconsistent evaluation criteria, subjectivity, and delayed feedback delivery, all of which can hinder student learning progress. Although large
language models (LLMs) demonstrate promising capabilities for automated code evaluation,
existing approaches face limitations in scalability, processing efficiency, and deployment feasibility in resource-constrained educational environments. This research presents a comprehensive solution that addresses these challenges through an integrated architecture that combines
real student-annotated data, synthetic data generation, specialized model fine-tuning, and a distributed computing infrastructure.
We develop a framework for creating high-quality training datasets through both authentic student submissions and a synthetic data generation architecture that produces realistic programming assignments and student responses, while addressing concerns regarding privacy and data
insufficiency. Based on this enhanced dataset, we fine-tune BeGrading, a specialized LLM optimized for comprehensive code evaluation across multiple dimensions, including correctness,
efficiency, and coding style. BeGrading achieves superior performance with an absolute difference rate of 19% (± 0.95 of 5) compared to the Codestral-22B model of reference, demonstrating effective optimization for educational settings.
To overcome computational limitations, we implement a scalable, distributed infrastructure that
utilizes Apache Spark Streaming with GPU-accelerated worker nodes, enabling the real-time
processing of high-volume grading tasks. Our experimental evaluation demonstrates significant
performance improvements, with inference time reductions of up to 50% using dual workers
and 70-80% with three workers, while maintaining grading accuracy. The system includes
robust failover mechanisms that ensure continuous operation and a reliable distribution of tasks
between nodes. This work presents a practical and scalable solution for automated programming
assessment that significantly reduces instructor workload while providing students with timely,
consistent, and objective feedback. This approach ultimately enhances educational outcomes by
improving efficiency and reliability in code evaluation processes.
Keywords
Large language model, grade, spark, cluster, programming education, synthetic data generation,
fine-tuning architecture
Text in English, abstracts in English and Arabic
0009-0000-0316-2724 ORCID
InformaticsIFM
Dissertation, Academic
610
Automated Code Assessment: From Dataset Construction To Distributed LLM-Powered Grading /Mina Atef Yousef Wahba - 2025 - 93 p. ill. 21 cm.
Supervisor:
Ghada Khoriba
Tamer Arafa
Thesis (M.A.)—Nile University, Egypt, 2025 .
"Includes bibliographical references"
Contents:
Contents
LIST OF FIGURES xi
LIST OF TABLES xi
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution of Educational Technology . . . . . . . . . . . . . . . . . . 2
1.1.2 Limitations of Traditional Grading in Programming . . . . . . . . . . . 3
1.1.3 Emergence and Potential of LLMs and Big Data . . . . . . . . . . . . 4
1.2 Research Contributions and Significance . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Background and Taxonomy of LLMs . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 General Purpose LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Domain Specific LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 MoE Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 LLMs in Education Applications . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 LLM in Grading and Feedback Systems . . . . . . . . . . . . . . . . . . . . . 13
2.4 Infrastructure for Scalable LLM Deployment . . . . . . . . . . . . . . . . . . 15
2.5 Identified Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
CONTENTS
3 Methodology 19
3.1 Dataset Collection and Pre-processing . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Real-Time Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Data Augmentation and Annotation . . . . . . . . . . . . . . . . . . . 23
3.1.3 Annotation of Real Campus Data . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Synthetic Data Generation from Programming Books . . . . . . . . . . 25
3.2 BeGrading: Fine-Tuned Model for Automated Programming Grading . . . . . 26
4 Results & Discussion 28
4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Hardware Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Dataset Analysis and Characteristics . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Real-Time Processing Performance . . . . . . . . . . . . . . . . . . . 33
4.2.2 Dataset Augmentation and Annotation . . . . . . . . . . . . . . . . . 36
4.3 Begrading: Fine-tuned Model Results . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Effect of Dataset Grade Correction on High-Variance Records . . . . . 45
4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Sensitivity to Grading Guidance . . . . . . . . . . . . . . . . . . . . . 49
4.4.3 Effectiveness of Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Conclusions and Future Work 56
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A Appendix 73
Abstract:
rading programming assignments in educational settings presents several challenges, including computational resource constraints, inconsistent evaluation criteria, subjectivity, and delayed feedback delivery, all of which can hinder student learning progress. Although large
language models (LLMs) demonstrate promising capabilities for automated code evaluation,
existing approaches face limitations in scalability, processing efficiency, and deployment feasibility in resource-constrained educational environments. This research presents a comprehensive solution that addresses these challenges through an integrated architecture that combines
real student-annotated data, synthetic data generation, specialized model fine-tuning, and a distributed computing infrastructure.
We develop a framework for creating high-quality training datasets through both authentic student submissions and a synthetic data generation architecture that produces realistic programming assignments and student responses, while addressing concerns regarding privacy and data
insufficiency. Based on this enhanced dataset, we fine-tune BeGrading, a specialized LLM optimized for comprehensive code evaluation across multiple dimensions, including correctness,
efficiency, and coding style. BeGrading achieves superior performance with an absolute difference rate of 19% (± 0.95 of 5) compared to the Codestral-22B model of reference, demonstrating effective optimization for educational settings.
To overcome computational limitations, we implement a scalable, distributed infrastructure that
utilizes Apache Spark Streaming with GPU-accelerated worker nodes, enabling the real-time
processing of high-volume grading tasks. Our experimental evaluation demonstrates significant
performance improvements, with inference time reductions of up to 50% using dual workers
and 70-80% with three workers, while maintaining grading accuracy. The system includes
robust failover mechanisms that ensure continuous operation and a reliable distribution of tasks
between nodes. This work presents a practical and scalable solution for automated programming
assessment that significantly reduces instructor workload while providing students with timely,
consistent, and objective feedback. This approach ultimately enhances educational outcomes by
improving efficiency and reliability in code evaluation processes.
Keywords
Large language model, grade, spark, cluster, programming education, synthetic data generation,
fine-tuning architecture
Text in English, abstracts in English and Arabic
0009-0000-0316-2724 ORCID
InformaticsIFM
Dissertation, Academic
610