TY - GEN
T1 - Deep learning in automated essay scoring
AU - Boulanger, David
AU - Kumar, Vivekanandan
N1 - Publisher Copyright:
© Springer International Publishing AG, part of Springer Nature 2018.
PY - 2018
Y1 - 2018
N2 - This paper explores the application of deep learning in automated essay scoring (AES). It uses the essay dataset #8 from the Automated Student Assessment Prize competition, hosted by the Kaggle platform, and a state-of-the-art Suite of Automatic Linguistic Analysis Tools (SALAT) to extract 1,463 writing features. A non-linear regressor deep neural network is trained to predict holistic scores on a scale of 10–60. This study shows that deep learning holds the promise to improve significantly the accuracy of AES systems, but that the current dataset and most essay datasets fall short of providing them with enough expertise (hand-graded essays) to exploit that potential. After the tuning of different sets of hyperparameters, the results show that the levels of agreement, as measured by the quadratic weighted kappa metric, obtained on the training, validation, and testing sets are 0.84, 0.63, and 0.58, respectively, while an ensemble (bagging) produced a kappa value of 0.80 on the testing set. Finally, this paper upholds that more than 1,000 hand-graded essays per writing construct would be necessary to adequately train the predictive student models on automated essay scoring, provided that all score categories are equally or fairly represented in the sample dataset.
AB - This paper explores the application of deep learning in automated essay scoring (AES). It uses the essay dataset #8 from the Automated Student Assessment Prize competition, hosted by the Kaggle platform, and a state-of-the-art Suite of Automatic Linguistic Analysis Tools (SALAT) to extract 1,463 writing features. A non-linear regressor deep neural network is trained to predict holistic scores on a scale of 10–60. This study shows that deep learning holds the promise to improve significantly the accuracy of AES systems, but that the current dataset and most essay datasets fall short of providing them with enough expertise (hand-graded essays) to exploit that potential. After the tuning of different sets of hyperparameters, the results show that the levels of agreement, as measured by the quadratic weighted kappa metric, obtained on the training, validation, and testing sets are 0.84, 0.63, and 0.58, respectively, while an ensemble (bagging) produced a kappa value of 0.80 on the testing set. Finally, this paper upholds that more than 1,000 hand-graded essays per writing construct would be necessary to adequately train the predictive student models on automated essay scoring, provided that all score categories are equally or fairly represented in the sample dataset.
KW - Automated essay scoring
KW - Deep learning
KW - Writing analytics
UR - http://www.scopus.com/inward/record.url?scp=85048348232&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-91464-0_30
DO - 10.1007/978-3-319-91464-0_30
M3 - Published Conference contribution
AN - SCOPUS:85048348232
SN - 9783319914633
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 294
EP - 299
BT - Intelligent Tutoring Systems - 14th International Conference, ITS 2018, Proceedings
A2 - Vassileva, Julita
A2 - Nkambou, Roger
A2 - Azevedo, Roger
T2 - 14th International Conference on Intelligent Tutoring Systems, ITS 2018
Y2 - 11 June 2018 through 15 June 2018
ER -