education automated essay scoring

ORIGINAL RESEARCH article

Explainable automated essay scoring: deep learning really has pedagogical value.

$\r\nVivekanandan Kumar$

School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

Figure 4. Summary plot listing the 32 most important features globally.

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

^ https://www.kaggle.com/c/asap-aes
^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Automated Essay Scoring and Revising Based on Open-Source Large Language Models

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, automatic essay scoring: design and implementation of automatic amharic essay scoring system using latent semantic analysis, a ranked-based learning approach to automated essay scoring.

Automated essay scoring is the computer techniques and algorithms that evaluate and score essays automatically. Compared with human rater, automated essay scoring has the advantage of fairness, less human resource cost and timely feedback. In previous ...

Automated Essay Scoring via Example-Based Learning

Automated essay scoring (AES) is the task of assigning grades to essays. It can be applied for quality assessment as well as pricing on User Generated Content. Previous works mainly consider using the prompt information for scoring. However, some ...

Information

Published in.

IEEE Computer Society Press

Washington, DC, United States

Publication History

Research-article

Contributors

Other metrics, bibliometrics, article metrics.

0 Total Citations
0 Total Downloads
Downloads (Last 12 months) 0
Downloads (Last 6 weeks) 0

View options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
Download citation
Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Springer Nature - PMC COVID-19 Collection

An automated essay scoring systems: a systematic literature review

Dadi ramesh.

1 School of Computer Science and Artificial Intelligence, SR University, Warangal, TS India

2 Research Scholar, JNTU, Hyderabad, India

Suresh Kumar Sanampudi

3 Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS India

Associated Data

Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10462-021-10068-2.

Introduction

Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiple-choice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question.

Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. ( 2001 ) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. ( 1999 ) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. ( 2002 ) proposed E-rater and Intellimetric by Rudner et al. ( 2006 ) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang ( 2002 ), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. ( 2017 ) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems.

Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student's responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation.

The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.

Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood ( 2011 ) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems.

Burrows et al. ( 2015 ) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review.

Ke et al. ( 2019 ) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in ( 2019 ) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail.

Klebanov et al. ( 2020 ). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges.

This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies' findings and addresses the research domain's specific research questions. Our research methodology uses guidelines given by Kitchenham et al. ( 2009 ) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation.

We addressed our research method, research questions, and the selection process in Sect. 2 , and the results of the research questions have discussed in Sect. 3 . And the synthesis of all the research questions addressed in Sect. 4 . Conclusion and possible future work discussed in Sect. 5 .

Research method

We framed the research questions with PICOC criteria.

Population (P) Student essays and answers evaluation systems.

Intervention (I) evaluation techniques, data sets, features extraction methods.

Comparison (C) Comparison of various approaches and results.

Outcomes (O) Estimate the accuracy of AES systems,

Context (C) NA.

Research questions

To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ):

RQ1 what are the datasets available for research on automated essay grading?

The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts.

RQ2 what are the features extracted for the assessment of essays?

The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features.

RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms?

The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique.

RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented?

It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems.

RQ5 What are the challenges/limitations in the current research?

The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.

Search process

We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. ( 2011 ) led to research this domain.

Search Strings : We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.

Selection criteria

After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific.

Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages.

Inclusion criteria 2 We included the papers implemented on the AI approach and excluded the traditional methods for the review.

Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech.

Exclusion criteria We removed the papers in the form of review papers, survey papers, and state of the art papers.

Quality assessment

In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article's quality. We included the documents that have clearly explained the approach they used, the result analysis and validation.

The quality checklist questions are framed based on the guidelines from Kitchenham et al. ( 2009 ). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study.

Quality Assessment 1: Internal validity.

Quality Assessment 2: External validity.

Quality Assessment 3: Bias.

The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table Table1. 1 . After Quality Assessment, the final list of papers for review is shown in Table Table2. 2 . The complete selection process is shown in Fig. Fig.1. 1 . The total number of selected papers in year wise as shown in Fig. Fig.2. 2 .

Quality assessment analysis

Number of papers	Quality assessment score
50	3
12	2
59	1
23	0

Final list of papers

Data base	Paper count
ACL	28
ACM	5
IEEE Explore	19
Springer	5
Other	5
Total	62

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig1_HTML.jpg

Selection process

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig2_HTML.jpg

Year wise publications

What are the datasets available for research on automated essay grading?

To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) Yannakoudakis et al. ( 2011 ) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40.

Bailey and Meurers ( 2008 ), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea ( 2009 ). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters.

Dzikovska et al. ( 2012 ) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a ; b ) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain."

In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“ https://www.kaggle.com/c/asap-sas/ ” ) essays and short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP + + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses.

Correnti et al. ( 2013 ) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. ( 2013 ) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment.

The TOEFL11 corpus Blanchard et al. ( 2013 ) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high.

International Corpus of Learner English (ICLE) Granger et al. ( 2009 ) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence.

Argument Annotated Essays (AAE) Stab and Gurevych ( 2014 ) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. ( 2015 ) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table Table3 3 illustrates all datasets related to AES systems.

ALL types Datasets used in Automatic scoring systems

Data Set	Language	Total responses	Number of prompts
Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE)	English	1244
CREE	English	566
CS	English	630
SRA	English	3000	56
SCIENTSBANK(SemEval-2013)	English	10,000	197
ASAP-AES	English	17,450	8
ASAP-SAS	English	17,207	10
ASAP + +	English	10,696	6
power grading	English	700
TOEFL11	English	1100	8
International Corpus of Learner English (ICLE)	English	3663

Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statistical-based features Contreras et al. ( 2018 ); Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ) 2. Style-based (Syntax) features Cummins et al. ( 2016 ); Darwish and Mohamed ( 2020 ); Ke et al. ( 2019 ). 3. Content-based features Dong et al. ( 2017 ). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table Table4 4 represents all set of features used for essay grading.

Types of features

Statistical features	Style based features	Content based features
Essay length with respect to the number of words	Sentence structure	Cohesion between sentences in a document
Essay length with respect to sentence	POS	Overlapping (prompt)
Average sentence length	Punctuation	Relevance of information
Average word length	Grammatical	Semantic role of words
N-gram	Logical operators	Correctness
	Vocabulary	Consistency
		Sentence expressing key concepts

We studied all the feature extracting NLP libraries as shown in Fig. Fig.3. that 3 . that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay's semantic features. To find semantic features Word2Vec Mikolov et al. ( 2013 ), GloVe Jeffrey Pennington et al. ( 2014 ) is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. Fig.4 4 as observed that non-content-based feature extraction is higher than content-based.

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig3_HTML.jpg

Usages of tools

An external file that holds a picture, illustration, etc.
Object name is 10462_2021_10068_Fig4_HTML.jpg

Number of papers on content based features

RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?

The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. ( 2016 ). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson's Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.

RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?

After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.

All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question's topic. Since the last three years, most AES systems developed made use of the concept of the neural network.

Regression based models

Mohler and Mihalcea ( 2009 ). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path's length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features.

Adamson et al. ( 2014 ) proposed an automatic essay grading system which is a statistical-based approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532.

Cummins et al. ( 2016 ). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system.

Sultan et al. ( 2016 ). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay's final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887.

Contreras et al. ( 2018 ). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5.

Darwish and Mohamed ( 2020 ) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features.

Süzen Neslihan et al. ( 2020 ) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay's completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student's response and provides feedback. The correlation between the student answer to model answer is 0.81.

Classification based Models

Persing and Ng ( 2013 ) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score.

Sakaguchi et al. ( 2015 ) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores.

Mathias and Bhattacharyya ( 2018a ; b ) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset ( https://www.kaggle.com/c/asap-sas/ ).

Ke et al. ( 2019 ) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not.

Salim et al. ( 2019 ) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.

Neural network models

Shehab et al. ( 2016 ) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665.

Kopparapu and De ( 2016 ) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score.

Dong and Zhang ( 2016 ) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence's content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754.

Taghipour and Ng ( 2016 ) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708.

Dong et al. ( 2017 ). Proposed an Attention-based scoring system with CNN + LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764.

Riordan et al. ( 2017 ) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer's sore. The accuracy of the neural network resulted in a QWK of 0.90.

Zhao et al. ( 2017 ) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network.

Mathias and Bhattacharyya ( 2018a ; b ) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text's perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa.

Nguyen and Dery ( 2016 ). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%.

Ruseti et al. ( 2018 ) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the max-pooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55.

Wang et al. ( 2018a ; b ) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%.

Dasgupta et al. ( 2018 ) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764.

Liang et al. ( 2018 ) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay's coherence. The average QWK score of SBLSTMA is 0.801.

Liu et al. ( 2019 ) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709.

Pedro Uria Rodriguez et al. ( 2019 ) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.

Xia et al. ( 2019 ) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870.

Kumar et al. ( 2019 ) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English.

Jiaqi Lun et al. ( 2020 ) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect.

Zhu and Sun ( 2020 ) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. Bi-LSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.

Ontology based approach

Mohler et al. ( 2011 ) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system.

Ramachandran et al. ( 2015 ) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78.

Zupanc et al. ( 2017 ) proposed sentence similarity networks to find the essay's score. Ajetunmobi and Daramola ( 2017 ) recommended an ontology-based information extraction approach and domain-based ontology to find the score.

Speech response scoring

Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang ( 2013 ), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. ( 2015 ) worked on feature selection from speech data and trained SVM. Malinin et al. ( 2016 ) used neural network models to train the data. Loukina et al. ( 2017 ). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. ( 2017 ). Wu Xixin et al. ( 2020 ) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. ( 2017 ) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. ( 2018 ). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.

The state of the art

This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table Table5 5 with a comparative study of the AES systems.

State of the art

System	Approach	Dataset	Features applied	Evaluation metric and results
Mohler and Mihalcea in ( )	shortest path similarity, LSA regression model		Word vector	Finds the shortest path
Niraj Kumar and Lipika Dey. In ( )	Word-Graph	ASAP Kaggle	Content and style-based features	63.81% accuracy
Alex Adamson et al. in ( )	LSA regression model	ASAP Kaggle	Statistical features	QWK 0.532
Nguyen and Dery ( )	LSTM (single layer bidirectional)	ASAP Kaggle	Statistical features	90% accuracy
Keisuke Sakaguchi et al. in ( )	Classification model	ETS (educational testing services)	Statistical, Style based features	QWK is 0.69
Ramachandran et al. in ( )	regression model	ASAP Kaggle short Answer	Statistical and style-based features	QWK 0.77
Sultan et al. in ( )	Ridge regression model	SciEntBank answers	Statistical features	RMSE 0.887
Dong and Zhang ( )	CNN neural network	ASAP Kaggle	Statistical features	QWK 0.734
Taghipour and Ngl in ( )	CNN + LSTM neural network	ASAP Kaggle	Lookup table (one hot representation of word vector)	QWK 0.761
Shehab et al. in ( )	Learning vector quantization neural network	Mansoura University student's essays	Statistical features	correlation coefficient 0.7665
Cummins et al. in ( )	Regression model	ASAP Kaggle	Statistical features, style-based features	QWK 0.69
Kopparapu and De ( )	Neural network	ASAP Kaggle	Statistical features, Style based
Dong, et al. in ( )	CNN + LSTM neural network	ASAP Kaggle	Word embedding, content based	QWK 0.764
Ajetunmobi and Daramola ( )	WuPalmer algorithm		Statistical features
Siyuan Zhao et al. in ( )	LSTM (memory network)	ASAP Kaggle	Statistical features	QWK 0.78
Mathias and Bhattacharyya ( )	Random Forest Classifier a classification model	ASAP Kaggle	Style and Content based features	Classified which feature set is required
Brian Riordan et al. in ( )	CNN + LSTM neural network	ASAP Kaggle short Answer	Word embeddings	QWK 0.90
Tirthankar Dasgupta et al. in ( )	CNN -bidirectional LSTMs neural network	ASAP Kaggle	Content and physiological features	QWK 0.786
Wu and Shih ( )	Classification model	SciEntBank answers	unigram_recall	Squared correlation coefficient 59.568
			unigram_precision
			unigram_F_measure
			log_bleu_recall
			log_bleu_precision
			log_bleu_F_measure BLUE features
Yucheng Wang, etc.in ( )	Bi-LSTM	ASAP Kaggle	Word embedding sequence	QWK 0.724
Anak Agung Putri Ratna et al. in ( )	Winnowing ALGORITHM			86.86 accuracy
Sharma and Jayagopi ( )	Glove, LSTM neural network	ASAP Kaggle	Hand written essay images	QWK 0.69
Jennifer O. Contreras et al. in ( )	OntoGen (SVM) Linear Regression	University of Benghazi data set	Statistical, style-based features
Mathias, Bhattacharyya ( )	GloVe,LSTM neural network	ASAP Kaggle	Statistical features, style features	Predicted Goodness score for essay
Stefan Ruseti, et al. in ( )	BiGRU Siamese architecture	Amazon Mechanical Turk online research service. Collected summaries	Word embedding	Accuracy 55.2
Zining wang, et al. in ( )	LSTM (semantic) HAN (hierarchical attention network) neural network	ASAP Kaggle	Word embedding	QWK 0.83
Guoxi Liang et al. ( )	Bi-LSTM	ASAP Kaggle	Word embedding, coherence of sentence	QWK 0.801
Ke et al. in ( )	Classification model	ASAP Kaggle	Content based	Pearson’s Correlation Coefficient (PC)-0.39 ME-0.921
Tsegaye Misikir Tashu and Horváth in ( )	Unsupervised learning–Locality Sensitivity Hashing	ASAP Kaggle	Statistical features	root mean squared error
Kumar and Dey ( )	Random Forest CNN, RNN neural network	ASAP Kaggle short Answer	Style and content-based features	QWK 0.82
Pedro Uria Rodriguez et al. ( )	BERT, Xlnet	ASAP Kaggle	Error correction, sequence learning	QWK 0.755
Jiawei Liu et al. ( )	CNN, LSTM, BERT	ASAP Kaggle	semantic data, handcrafted features like grammar correction, essay length, number of sentences, etc	QWK 0.709
Darwish and Mohamed ( )	Multiple Linear Regression	ASAP Kaggle	Style and content-based features	QWK 0.77
Jiaqi Lun et al. ( )	BERT	SemEval-2013	Student Answer, Reference Answer	Accuracy 0.8277 (2-way)
Süzen, Neslihan, et al. ( )	Text mining	introductory computer science class in the University of North Texas, Student Assignments	Sentence similarity	Correlation score 0.81
Wilson Zhu and Yu Sun in ( )	RNN (LSTM, Bi-LSTM)	ASAP Kaggle	Word embedding, grammar count, word count	QWK 0.70
Salim Yafet et al. ( )	XGBoost machine learning classifier	ASAP Kaggle	Word count, POS, parse tree, coherence, cohesion, type token ration	Accuracy 68.12
Andrzej Cader ( )	Deep Neural Network	University of Social Sciences in Lodz students’ answers	asynchronous feature	Accuracy 0.99
Tashu TM, Horváth T ( )	Rule based algorithm, Similarity based algorithm	ASAP Kaggle	Similarity based	Accuracy 0.68
Masaki Uto(B) and Masashi Okano ( )	Item Response Theory Models (CNN-LSTM,BERT)	ASAP Kaggle		QWK 0.749

Comparison of all approaches

In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay's coherence and coherence.

First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table Table6. 6 . The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way.

Vector representation of essays

	Essay	BoW << vector >>	Word2vec << vector >>
Student 1 response	I believe that using computers will benefit us in many ways like talking and becoming friends will others through websites like facebook and mysace	<< 0.00000 0.00000 0.165746 0.280633 … 0.00000 0.280633 0.280633 0.280633 >>	<< 3.9792988e-03 − 1.9810481e-03 1.9830784e-03 9.0381579e-04 − 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e -03 − 2.2331756e-03 − 3.8774475e-03 3.5967759e- 03 − 4.0194849e-03 − 3.0412588e-03 − 2.4055617e-03 4.8296354e-03 2.4813593e-03… − 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04 − 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04 2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>
Student 2 response	More and more people use computers, but not everyone agrees that this benefits society. Those who support advances in technology believe that computers have a positive effect on people	<< 0.26043 0.26043 0.153814 0.000000 … 0.26043 0.000000 0.000000 0.000000 > >	<< 3.9792988e-03 − 1.9810481e- 03 1.9830784e-03 9.0381579e-04 − 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e-03 − 2.2331756e-03 − 3.8774475e-03 3.5967759e-03 − 4.0194849e-03… − 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04 − 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04 3.7868773e-03 − 4.4193151e-03 3.0735810e-03 2.5546195e-03 2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>

Essay

BoW << vector >>

Word2vec << vector >>

Student 1 response

I believe that using computers will benefit us in many ways like talking and becoming friends will others through websites like facebook and mysace

<< 0.00000 0.00000 0.165746 0.280633 … 0.00000 0.280633 0.280633 0.280633 >>

<< 3.9792988e-03 − 1.9810481e-03 1.9830784e-03 9.0381579e-04 − 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e -03 − 2.2331756e-03 − 3.8774475e-03 3.5967759e- 03 − 4.0194849e-03 − 3.0412588e-03 − 2.4055617e-03 4.8296354e-03 2.4813593e-03…

− 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04

− 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04

2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>

Student 2 response

More and more people use computers, but not everyone agrees that this benefits society. Those who support advances in technology believe that computers have a positive effect on people

<< 0.26043 0.26043 0.153814 0.000000 … 0.26043 0.000000 0.000000 0.000000 > >

<< 3.9792988e-03 − 1.9810481e- 03 1.9830784e-03 9.0381579e-04

− 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e-03

− 2.2331756e-03 − 3.8774475e-03 3.5967759e-03 − 4.0194849e-03…

− 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04

− 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04

3.7868773e-03 − 4.4193151e-03 3.0735810e-03 2.5546195e-03

2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>

In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table Table7 7 represents a comparison of Machine Learning models and features extracting methods.

Comparison of models

	BoW	Word2vec
Regression models/classification models	The system implemented with Bow features and regression or classification algorithms will have low cohesion and coherence	The system implemented with Word2vec features and regression or classification algorithms will have low to medium cohesion and coherence
Neural Networks (LSTM)	The system implemented with BoW features and neural network models will have low cohesion and coherence	The system implemented with Word2vec features and neural network model (LSTM) will have medium to high cohesion and coherence

In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table Table8 8 represents all four parameters comparison for essay grading. Table Table9 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.

Comparison of all models with respect to cohesion, coherence, completeness, feedback

Authors	Cohesion	Coherence	Completeness	Feed Back
Mohler and Mihalcea ( )	Low	Low	Low	Low
Mohler et al. ( )	Medium	Low	Medium	Low
Persing and Ng ( )	Medium	Low	Low	Low
Adamson et al. ( )	Low	Low	Low	Low
Ramachandran et al. ( )	Medium	Medium	Low	Low
Sakaguchi et al.. ( ),	Medium	Low	Low	Low
Cummins et al. ( )	Low	Low	Low	Low
Sultan et al. ( )	Medium	Medium	Low	Low
Shehab et al. ( )	Low	Low	Low	Low
Kopparapu and De ( )	Medium	Medium	Low	Low
Dong an Zhang ( )	Medium	Low	Low	Low
Taghipour and Ng ( )	Medium	Medium	Low	Low
Zupanc et al. ( )	Medium	Medium	Low	Low
Dong et al. ( )	Medium	Medium	Low	Low
Riordan et al. ( )	Medium	Medium	Medium	Low
Zhao et al. ( )	Medium	Medium	Low	Low
Contreras et al. ( )	Medium	Low	Low	Low
Mathias and Bhattacharyya ( ; )	Medium	Medium	Low	Low
Mathias and Bhattacharyya ( ; )	Medium	Medium	Low	Low
Nguyen and Dery ( )	Medium	Medium	Medium	Medium
Ruseti et al. ( )	Medium	Low	Low	Low
Dasgupta et al. ( )	Medium	Medium	Low	Low
Liu et al.( )	Low	Low	Low	Low
Wang et al. ( )	Medium	Low	Low	Low
Guoxi Liang et al. ( )	High	High	Low	Low
Wang et al. ( )	Medium	Medium	Low	Low
Chen and Li ( )	Medium	Medium	Low	Low
Li et al. ( )	Medium	Medium	Low	Low
Alva-Manchego et al.( )	Low	Low	Low	Low
Jiawei Liu et al. ( )	High	High	Medium	Low
Pedro Uria Rodriguez et al. ( )	Medium	Medium	Medium	Low
Changzhi Cai( )	Low	Low	Low	Low
Xia et al. ( )	Medium	Medium	Low	Low
Chen and Zhou ( )	Low	Low	Low	Low
Kumar et al. ( )	Medium	Medium	Medium	Low
Ke et al. ( )	Medium	Low	Medium	Low
Andrzej Cader( )	Low	Low	Low	Low
Jiaqi Lun et al. ( )	High	High	Low	Low
Wilson Zhu and Yu Sun ( )	Medium	Medium	Low	Low
Süzen, Neslihan et al. ( )	Medium	Low	Medium	Low
Salim Yafet et al. ( )	High	Medium	Low	Low
Darwish and Mohamed ( )	Medium	Low	Low	Low
Tashu and Horváth ( )	Medium	Medium	Low	Medium
Tashu ( )	Medium	Medium	Low	Low
Masaki Uto(B) and Masashi Okano( )	Medium	Medium	Medium	Medium
Panitan Muangkammuen and Fumiyo Fukumoto( )	Medium	Medium	Medium	Low

comparison of all approaches on various features

Approaches	Grammar	Style (Word choice, sentence structure)	Mechanics (Spelling, punctuation, capitalization)	Development	BoW (tf-idf)	relevance
Mohler and Mihalcea ( )	No	No	No	No	Yes	No
Mohler et al. ( )	Yes	No	No	No	Yes	No
Persing and Ng ( )	Yes	Yes	Yes	No	Yes	Yes
Adamson et al. ( )	Yes	No	Yes	No	Yes	No
Ramachandran et al. ( )	Yes	No	Yes	Yes	Yes	Yes
Sakaguchi et al. ( ),	No	No	Yes	Yes	Yes	Yes
Cummins et al. ( )	Yes	No	Yes	No	Yes	No
Sultan et al. ( )	No	No	No	No	Yes	Yes
Shehab et al. ( )	Yes	Yes	Yes	No	Yes	No
Kopparapu and De ( )	No	No	No	No	Yes	No
Dong and Zhang ( )	Yes	No	Yes	No	Yes	Yes
Taghipour and Ng ( )	Yes	No	No	No	Yes	Yes
Zupanc et al. ( )	No	No	No	No	Yes	No
Dong et al. ( )	No	No	No	No	No	Yes
Riordan et al. ( )	No	No	No	No	No	Yes
Zhao et al. ( )	No	No	No	No	No	Yes
Contreras et al. ( )	Yes	No	No	No	Yes	Yes
Mathias and Bhattacharyya ( , )	No	Yes	Yes	No	No	Yes
Mathias and Bhattacharyya ( , )	Yes	No	Yes	No	Yes	Yes
Nguyen and Dery ( )	No	No	No	No	Yes	Yes
Ruseti et al. ( )	No	No	No	Yes	No	Yes
Dasgupta et al. ( )	Yes	Yes	Yes	Yes	No	Yes
Liu et al.( )	Yes	Yes	No	No	Yes	No
Wang et al. ( )	No	No	No	No	No	Yes
Guoxi Liang et al. ( )	No	No	No	No	No	Yes
Wang et al. ( )	No	No	No	No	No	Yes
Chen and Li ( )	No	No	No	No	No	Yes
Li et al. ( )	Yes	No	No	No	No	Yes
Alva-Manchego et al. ( )	Yes	No	No	Yes	No	Yes
Jiawei Liu et al. ( )	Yes	No	No	Yes	No	Yes
Pedro Uria Rodriguez et al. ( )	No	No	No	No	Yes	Yes
Changzhi Cai( )	No	No	No	No	No	Yes
Xia et al. ( )	No	No	No	No	No	Yes
Chen and Zhou ( )	No	No	No	No	No	Yes
Kumar et al. ( )	Yes	Yes	No	Yes	Yes	Yes
Ke et al. ( )	No	Yes	No	Yes	Yes	Yes
Andrzej Cader( )	No	No	No	No	No	Yes
Jiaqi Lun et al. ( )	No	No	No	No	No	Yes
Wilson Zhu and Yu Sun ( )	No	No	No	No	No	Yes
Süzen, Neslihan, et al. ( )	No	No	No	No	Yes	Yes
Salim Yafet et al. ( )	Yes	Yes	Yes	No	Yes	Yes
Darwish and Mohamed ( )	Yes	Yes	No	No	No	Yes

What are the challenges/limitations in the current research?

From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency.

The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge.

No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson ( 2018 ) explained coherence-based essay evaluation. And Zupanc and Bosnic ( 2014 ) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent."

Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.

Other challenges that influence the Automated Essay Scoring Systems.

All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students' responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems.

The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. ( 2001 ) and Bejar Isaac et al. ( 2014 ), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. ( 2013 ), Higgins and Heilman ( 2014 ), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system.

In deep learning approaches, most of the models automatically read the essay's features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. ( 2019 ), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch ( 2019 ), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring.

Ding et al. ( 2020 ) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. ( 2020 ) and Kumar et al. ( 2020 ) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries' essays and give more options for the students to bluff the automated systems.

Other than NLP and ML techniques for AES. From Wresch ( 1993 ) to Madnani and Cahill ( 2018 ). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response.

Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. ( 2019 ), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness.

Madnani et al. ( 2017a ; b ). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness.

From Berzak et al.'s ( 2018 ) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.

Rupp ( 2018 ), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework.

First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher ( 2012 ) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.

Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review.

Our observations on automated essay grading systems from 2010 to 2020 are as followed:

The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table Table3 3 .
The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters.
In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob."
In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence.
The features extracted from essays in the essay scoring system are classified into 3 type's features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table Table3. 3 . But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. ( 2016 ); Cummins et al. ( 2016 ). Dong et al. ( 2017 ). Dong and Zhang ( 2016 ). Mathias and Bhattacharyya ( 2018a ; b ) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading.
In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems.
In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng ( 2016 ); Persing and Ng ( 2013 ); Wang et al. ( 2018a , 2018b ); Zhao et al. ( 2017 ); Kopparapu and De ( 2016 ), Kumar et al. ( 2019 ); Mathias and Bhattacharyya ( 2018a ; b ); Mohler and Mihalcea ( 2009 ) are used content and statistical-based features. The results are shown in Fig. Fig.3. 3 . And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays.
On the other hand, Horbach and Zesch ( 2019 ); Riordan Brain et al. ( 2019 ); Ding et al. ( 2020 ); Kumar et al. ( 2020 ) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. ( 2019 ), and Madnani et al. ( 2017b ). The fairness of an algorithm is an essential factor to be considered in AES systems.
While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment.
Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“ https://coursera.org/learn//machine-learning//exam ”), NPTEL ( https://swayam.gov.in/explorer ), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.

Conclusion and future work

As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay's completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems.

Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques.

This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.

Below is the link to the electronic supplementary material.

Not Applicable.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Dadi Ramesh, Email: moc.liamg@44hsemaridad .

Suresh Kumar Sanampudi, Email: ni.ca.hutnj@idupmanashserus .

Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading.
Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development
Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE
Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag
Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115
Basu S, Jacobs C, Vanderwende L. Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 2013; 1 :391–402. doi: 10.1162/tacl_a_00236. [ CrossRef ] [ Google Scholar ]
Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59.
Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag
Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag
Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013
Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2).
Burrows S, Gurevych I, Stein B. The eras and trends of automatic short answer grading. Int J Artif Intell Educ. 2015; 25 :60–117. doi: 10.1007/s40593-014-0026-8. [ CrossRef ] [ Google Scholar ]
Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham.
Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag.
Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: 10.1109/IALP.2018.8629256
Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: 10.1109/ICAIBD.2019.8837007
Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6
Correnti R, Matsumura LC, Hamilton L, Wang E. Assessing students’ skills at writing analytically in response to texts. Elem Sch J. 2013; 114 (2):142–177. doi: 10.1086/671936. [ CrossRef ] [ Google Scholar ]
Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics.
Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications
Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102
Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics
Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077
Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162
Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge
Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics
Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http://www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf .
Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).
Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http://imej.wfu.edu/articles/1999/2/04/ index.asp
Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain.
Higgins D, Heilman M. Managing what we can measure: quantifying the susceptibility of automated scoring systems to gaming behavior” Educ Meas Issues Pract. 2014; 33 :36–46. doi: 10.1111/emip.12036. [ CrossRef ] [ Google Scholar ]
Horbach A, Zesch T. The influence of variance in learner answers on automatic content scoring. Front Educ. 2019; 4 :28. doi: 10.3389/feduc.2019.00028. [ CrossRef ] [ Google Scholar ]
https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/attempt
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. [ PMC free article ] [ PubMed ]
Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI
Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004).
Kelley K, Preacher KJ. On effect size. Psychol Methods. 2012; 17 (2):137–152. doi: 10.1037/a0028086. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol. 2009; 51 (1):7–15. doi: 10.1016/j.infsof.2008.09.009. [ CrossRef ] [ Google Scholar ]
Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810).
Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA)
Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523
Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669).
Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796
Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Self-attention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. 10.1007/978-3-030-01716-3_32
Liang G, On B, Jeong D, Kim H, Choi G. Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry. 2018; 10 :682. doi: 10.3390/sym10120682. [ CrossRef ] [ Google Scholar ]
Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press.
Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744
Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT
Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017
Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ACL
Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396
Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).
Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL
Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL
Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL
Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41
Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR
Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575
Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762
Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123
Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11.
Palma D, Atkinson J. Coherence-based automatic essay assessment. IEEE Intell Syst. 2018; 33 (5):26–36. doi: 10.1109/MIS.2018.2877278. [ CrossRef ] [ Google Scholar ]
Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269
Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser. 2001; 2001 (1):i–44. [ Google Scholar ]
Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K. Stumping e-rater: challenging the validity of automated essay scoring. Comput Hum Behav. 2002; 18 (2):103–134. doi: 10.1016/S0747-5632(01)00052-8. [ CrossRef ] [ Google Scholar ]
Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106
Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH
Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168
Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482
Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes' theorem. The Journal of Technology, Learning and Assessment, 1(2).
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).
Rupp A. Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ. 2018; 31 :191–214. doi: 10.1080/08957347.2018.1464448. [ CrossRef ] [ Google Scholar ]
Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham
Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054
Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.
Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70
Shermis MD, Mzumara HR, Olson J, Harrington S. On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ. 2001; 26 (3):247–259. doi: 10.1080/02602930120052404. [ CrossRef ] [ Google Scholar ]
Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56
Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075
Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743.
Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891
Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: 10.1109/ICSC.2020.00046
Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham
Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham
Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham
Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE.
Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP
Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro
Wresch W. The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos. 1993; 10 :45–58. doi: 10.1016/S8755-4615(05)80058-1. [ CrossRef ] [ Google Scholar ]
Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment.
Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional Long-Short Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137
Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189
Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192
Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE.
Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72
Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 200-210).
Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216–222). IEEE.
Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129).
Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. 10.1109/ISEMANTIC.2018.8549789.
Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. 10.1109/ICFHR-2018.2018.00056

DOI: 10.1016/B978-0-08-044894-7.00233-5
Corpus ID: 59680696

Automated Essay Scoring: Writing Assessment and Instruction

M. Shermis , J. Burstein , +1 author K. Zechner
Published 2010
Computer Science, Education, Linguistics

85 Citations

Automated l2 writing performance assessment: a literature review, assessing writing in moocs: automated essay scoring and calibrated peer review™..

Highly Influenced

Automated essay scoring and the future of educational assessment in medical education

An evaluation of automated writing assessment, contrasting state-of-the-art automated scoring of essays: analysis, scoring contrasting state-ofthe-art automated scoring of essays : analysis, advances in the field of automated essay evaluation, automated versus human essay scoring: a comparative study, evaluating china’s automated essay scoring system iwrite, on the relation between automated essay scoring and modern views of the writing construct, 37 references, automated essay evaluation: the criterion online writing service, criterionsm online essay evaluation: an application for automated evaluation of student essays, machine scoring of student essays, on-line grading of student essays: peg goes on the world wide web, automated essay scoring with e‐rater® v.2.0, the computer moves into essay grading: updating the ancient test., identifying off-topic student essays without topic-specific training data, c-rater: automated scoring of short-answer questions, extracting meaningful speech features to support diagnostic feedback: an ecd approach to automated scoring, towards automatic scoring of non-native spontaneous speech, related papers.

Showing 1 through 3 of 0 Related Papers

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, automated essay scoring.

26 papers with code • 1 benchmarks • 1 datasets

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

Source: A Joint Model for Multimodal Document Quality Assessment

Benchmarks Add a Result

--> -->

Trend	Dataset	Best Model	Paper	Code	Compare
		Tran-BERT-MS-ML-R

Most implemented papers

Automated essay scoring based on two-stage learning.

Current state-of-art feature-engineered and end-to-end Automated Essay Score (AES) methods are proven to be unable to detect adversarial samples, e. g. the essays composed of permuted sentences and the prompt-irrelevant essays.

A Neural Approach to Automated Essay Scoring

nusnlp/nea • EMNLP 2016

SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring

Our new method proposes a new \textsc{SkipFlow} mechanism that models relationships between snapshots of the hidden representations of a long short-term memory (LSTM) network as it reads.

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input

Youmna-H/Coherence_AES • NAACL 2018

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences.

Co-Attention Based Neural Network for Source-Dependent Essay Scoring

This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring.

Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay scoring (AES).

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

midas-research/calling-out-bluff • 14 Jul 2020

This number is increasing further due to COVID-19 and the associated automation of education and testing.

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring

Cross-prompt automated essay scoring (AES) requires the system to use non target-prompt essays to award scores to a target-prompt essay.

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

To find out which traits work best for different types of essays, we conduct ablation tests for each of the essay traits.

EXPATS: A Toolkit for Explainable Automated Text Scoring

octanove/expats • 7 Apr 2021

Automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment, are important educational applications of natural language processing.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Computation and Language

Title: automated essay scoring using efficient transformer-based language models.

Abstract: Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education, Linguistics, and Natural Language Processing (NLP). The efficacy of an NLP model in AES tests it ability to evaluate long-term dependencies and extrapolate meaning even when text is poorly written. Large pretrained transformer-based language models have dominated the current state-of-the-art in many NLP tasks, however, the computational requirements of these models make them expensive to deploy in practice. The goal of this paper is to challenge the paradigm in NLP that bigger is better when it comes to AES. To do this, we evaluate the performance of several fine-tuned pretrained NLP models with a modest number of parameters on an AES dataset. By ensembling our models, we achieve excellent results with fewer parameters than most pretrained transformer-based models.

Comments:	11 pages, 1 figure, 3 tables
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	[cs.CL]
	(or [cs.CL] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Linking essay-writing tests using many-facet models and neural automated essay scoring

Original Manuscript
Open access
Published: 20 August 2024

Cite this article

You have full access to this open access article

Masaki Uto ORCID: orcid.org/0000-0002-9330-5158 1 &
Kota Aramaki 1

109 Accesses

Explore all metrics

For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees’ abilities while accounting for the impact of rater characteristics, thereby enhancing the accuracy of ability measurement. However, difficulties can arise when different groups of examinees are evaluated by different sets of raters. In such cases, test linking is essential for unifying the scale of model parameters estimated for individual examinee–rater groups. Traditional test-linking methods typically require administrators to design groups in which either examinees or raters are partially shared. However, this is often impractical in real-world testing scenarios. To address this, we introduce a novel method for linking the parameters of IRT models with rater parameters that uses neural automated essay scoring technology. Our experimental results indicate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability

Integration of Automated Essay Scoring Models Using Item Response Theory

Robust Neural Automated Essay Scoring Using Item Response Theory

Avoid common mistakes on your manuscript.

Introduction

The growing demand for assessing higher-order skills, such as logical reasoning and expressive capabilities, has led to increased interest in essay-writing assessments (Abosalem, 2016 ; Bernardin et al., 2016 ; Liu et al., 2014 ; Rosen & Tager, 2014 ; Schendel & Tolmie, 2017 ). In these assessments, human raters assess the written responses of examinees to specific writing tasks. However, a major limitation of these assessments is the strong influence that rater characteristics, including severity and consistency, have on the accuracy of ability measurement (Bernardin et al., 2016 ; Eckes, 2005 , 2023 ; Kassim, 2011 ; Myford & Wolfe, 2003 ). Several item response theory (IRT) models that incorporate parameters representing rater characteristics have been proposed to mitigate this issue (Eckes, 2023 ; Myford & Wolfe, 2003 ; Uto & Ueno, 2018 ).

The most prominent among them are many-facet Rasch models (MFRMs) (Linacre, 1989 ), and various extensions of MFRMs have been proposed to date (Patz & Junker, 1999 ; Patz et al., 2002 ; Uto & Ueno, 2018 , 2020 ). These IRT models have the advantage of being able to estimate examinee ability while accounting for rater effects, making them more accurate than simple scoring methods based on point totals or averages.

However, difficulties can arise when essays from different groups of examinees are evaluated by different sets of raters, a scenario often encountered in real-world testing. For instance, in academic settings such as university admissions, individual departments may use different pools of raters to assess essays from specific applicant pools. Similarly, in the context of large-scale standardized tests, different sets of raters may be allocated to various test dates or locations. Thus, when applying IRT models with rater parameters to account for such real-world testing cases while also ensuring that ability estimates are comparable across groups of examinees and raters, test linking becomes essential for unifying the scale of model parameters estimated for each group.

Conventional test-linking methods generally require some overlap of examinees or raters across the groups being linked (Eckes, 2023 ; Engelhard, 1997 ; Ilhan, 2016 ; Linacre, 2014 ; Uto, 2021a ). For example, linear linking based on common examinees, a popular linking method, estimates the IRT parameters for shared examinees using data from each group. These estimates are then used to build a linear regression model, which adjusts the parameter scales across groups. However, the design of such overlapping groups can often be impractical in real-world testing environments.

To facilitate test linking in these challenging environments, we introduce a novel method that leverages neural automated essay scoring (AES) technology. Specifically, we employ a cutting-edge deep neural AES method (Uto & Okano, 2021 ) that can predict IRT-based abilities from examinees’ essays. The central concept of our linking method is to construct an AES model using the ability estimates of examinees in a reference group, along with their essays, and then to apply this model to predict the abilities of examinees in other groups. An important point is that the AES model is trained to predict examinee abilities on the scale established by the reference group. This implies that the trained AES model can predict the abilities of examinees in other groups on the ability scale established by the reference group. Therefore, we use the predicted abilities to calculate the linking coefficients required for linear linking and to perform a test linking. In this study, we conducted experiments based on real-world data to demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

It should be noted that previous studies have attempted to employ AES technologies for test linking (Almond, 2014 ; Olgar, 2015 ), but their focus has primarily been on linking tests with varied writing tasks or a mixture of essay tasks and objective items, while overlooking the influence of rater characteristics. This differs from the specific scenarios and goals that our study aims to address. To the best of our knowledge, this is the first study that employs AES technologies to link IRT models incorporating rater parameters for writing assessments without the need for common examinees and raters.

Setting and data

In this study, we assume scenarios in which two groups of examinees respond to the same writing task and their written essays are assessed by two distinct sets of raters following the same scoring rubric. We refer to one group as the reference group , which serves as the basis for the scale, and the other as the focal group , whose scale we aim to align with that of the reference group.

Let $u^{\text {ref}}_{jr}$ be the score assigned by rater $r \in \mathcal {R}^{\text {ref}}$ to the essay of examinee $j \in \mathcal {J}^{\text {ref}}$ , where $\mathcal {R}^{\text {ref}}$ and $\mathcal {J}^{\text {ref}}$ denote the sets of raters and examinees in the reference group, respectively. Then, a collection of scores for the reference group can be defined as

where $\mathcal{K} = \{1,\ldots ,K\}$ represents the rating categories, and $-1$ indicates missing data.

Similarly, a collection of scores for the focal group can be defined as

where $u^{\text {foc}}_{jr}$ indicates the score assigned by rater $r \in \mathcal {R}^{\text {foc}}$ to the essay of examinee $j \in \mathcal {J}^{\text {foc}}$ , and $\mathcal {R}^{\text {foc}}$ and $\mathcal {J}^{\text {foc}}$ represent the sets of raters and examinees in the focal group, respectively.

The primary objective of this study is to apply IRT models with rater parameters to the two sets of data, $\textbf{U}^{\text {ref}}$ and $\textbf{U}^{\text {foc}}$ , and to establish IRT parameter linking without shared examinees and raters: $\mathcal {J}^{\text {ref}} \cap \mathcal {J}^{\text {foc}} = \emptyset $ and $\mathcal {R}^{\text {ref}} \cap \mathcal {R}^{\text {foc}} = \emptyset $ . More specifically, we seek to align the scale derived from $\textbf{U}^{\text {foc}}$ with that of $\textbf{U}^{\text {ref}}$ .

Item response theory

IRT (Lord, 1980 ), a test theory grounded in mathematical models, has recently gained widespread use in various testing situations due to the growing prevalence of computer-based testing. In objective testing contexts, IRT makes use of latent variable models, commonly referred to as IRT models. Traditional IRT models, such as the Rasch model and the two-parameter logistic model, give the probability of an examinee’s response to a test item as a probabilistic function influenced by both the examinee’s latent ability and the item’s characteristic parameters, such as difficulty and discrimination. These IRT parameters can be estimated from a dataset consisting of examinees’ responses to test items.

However, traditional IRT models are not directly applicable to essay-writing test data, where the examinees’ responses to test items are assessed by multiple human raters. Extended IRT models with rater parameters have been proposed to address this issue (Eckes, 2023 ; Jin and Wang, 2018 ; Linacre, 1989 ; Shin et al., 2019 ; Uto, 2023 ; Wilson & Hoskens, 2001 ).

Many-facet Rasch models and their extensions

The MFRM (Linacre, 1989 ) is the most commonly used IRT model that incorporates rater parameters. Although several variants of the MFRM exist (Eckes, 2023 ; Myford & Wolfe, 2004 ), the most representative model defines the probability that the essay of examinee j for a given test item (either a writing task or prompt) i receives a score of k from rater r as

where $\theta _j$ is the latent ability of examinee j , $\beta _{i}$ represents the difficulty of item i , $\beta _{r}$ represents the severity of rater r , and $d_{m}$ is a step parameter denoting the difficulty of transitioning between scores $m-1$ and m . $D = 1.7$ is a scaling constant used to minimize the difference between the normal and logistic distribution functions. For model identification, $\sum _{i} \beta _{i} = 0$ , $d_1 = 0$ , $\sum _{m = 2}^{K} d_{m} = 0$ , and a normal distribution for the ability $\theta _j$ are assumed.

Another popular MFRM is one in which $d_{m}$ is replaced with $d_{rm}$ , a rater-specific step parameter denoting the severity of rater r when transitioning from score $m-1$ to m . This model is often used to investigate variations in rating scale criteria among raters caused by differences in the central tendency, extreme response tendency, and range restriction among raters (Eckes, 2023 ; Myford & Wolfe, 2004 ; Qiu et al., 2022 ; Uto, 2021a ).

A recent extension of the MFRM is the generalized many-facet model (GMFM) (Uto & Ueno, 2020 ) Footnote 1 , which incorporates parameters denoting rater consistency and item discrimination. GMFM defines the probability $P_{ijrk}$ as

where $\alpha _i$ indicates the discrimination power of item i , and $\alpha _r$ indicates the consistency of rater r . For model identification, $\prod _{r} \alpha _i = 1$ , $\sum _{i} \beta _{i} = 0$ , $d_{r1} = 0$ , $\sum _{m = 2}^{K} d_{rm} = 0$ , and a normal distribution for the ability $\theta _j$ are assumed.

In this study, we seek to apply the aforementioned IRT models to data involving a single test item, as detailed in the Setting and data section. When there is only one test item, the item parameters in the above equations become superfluous and can be omitted. Consequently, the equations for these models can be simplified as follows.

MFRM with rater-specific step parameters (referred to as MFRM with RSS in the subsequent sections):

Note that the GMFM can simultaneously capture the following typical characteristics of raters, whereas the MFRM and MFRM with RSS can only consider a subset of these characteristics.

Severity : This refers to the tendency of some raters to systematically assign higher or lower scores compared with other raters regardless of the actual performance of the examinee. This tendency is quantified by the parameter $\beta _r$ .

Consistency : This is the extent to which raters maintain their scoring criteria consistently over time and across different examinees. Consistent raters exhibit stable scoring patterns, which make their evaluations more reliable and predictable. In contrast, inconsistent raters show varying scoring tendencies. This characteristic is represented by the parameter $\alpha _r$ .

Range Restriction : This describes the limited variability in scores assigned by a rater. Central tendency and extreme response tendency are special cases of range restriction. This characteristic is represented by the parameter $d_{rm}$ .

For details on how these characteristics are represented in the GMFM, see the article (Uto & Ueno, 2020 ).

Based on the above, it is evident that both the MFRM and MFRM with RSS are special cases of the GMFM. Specifically, the GMFM with constant rater consistency corresponds to the MFRM with RSS. Moreover, the MFRM with RSS that assumes no differences in the range restriction characteristic among raters aligns with the MFRM.

When the aforementioned IRT models are applied to datasets from multiple groups composed of different examinees and raters, such as $\textbf{U}^{\text {red}}$ and $\textbf{U}^{\text {foc}}$ , the scales of the estimated parameters generally differ among them. This discrepancy arises because IRT permits arbitrary scaling of parameters for each independent dataset. An exception occurs when it is feasible to assume equality in between-test distributions of examinee abilities and rater parameters (Linacre, 2014 ). However, real-world testing conditions may not always satisfy this assumption. Therefore, if the aim is to compare parameter estimates between different groups, test linking is generally required to unify the scale of model parameters estimated from each individual group’s dataset.

One widely used approach for test linking is linear linking . In the context of the essay-writing test considered in this study, implementing linear linking necessitates designing two groups so that there is some overlap in examinees between them. With this design, IRT parameters for the shared examinees are estimated individually for each group. These estimates are then used to construct a linear regression model for aligning the parameter scales across groups, thereby rendering them comparable. We now introduce the mean and sigma method (Kolen & Brennan, 2014 ; Marco, 1977 ), a popular method for linear linking, and illustrate the procedures for parameter linking specifically for the GMFM, as defined in Eq. 7 , because both the MFRM and the MFRM with RSS can be regarded as special cases of the GMFM, as explained earlier.

To elucidate this, let us assume that the datasets corresponding to the reference and focal groups, denoted as $\textbf{U}^{\text {ref}}$ and $\textbf{U}^{\text {foc}}$ , contain overlapping sets of examinees. Furthermore, let us assume that $\hat{\varvec{\theta }}^{\text {foc}}$ , $\hat{\varvec{\alpha }}^{\text {foc}}$ , $\hat{\varvec{\beta }}^{\text {foc}}$ , and $\hat{\varvec{d}}^{\text {foc}}$ are the GMFM parameters estimated from $\textbf{U}^{\text {foc}}$ . The mean and sigma method aims to transform these parameters linearly so that their scale aligns with those estimated from $\textbf{U}^{\text {ref}}$ . This transformation is guided by the equations

where $\tilde{\varvec{\theta }}^{\text {foc}}$ , $\tilde{\varvec{\alpha }}^{\text {foc}}$ , $\tilde{\varvec{\beta }}^{\text {foc}}$ , and $\tilde{\varvec{d}}^{\text {foc}}$ represent the scale-transformed parameters for the focal group. The linking coefficients are defined as

where ${\mu }^{\text {ref}}$ and ${\sigma }^{\text {ref}}$ represent the mean and standard deviation (SD) of the common examinees’ ability values estimated from $\textbf{U}^{\text {ref}}$ , and ${\mu }^{\text {foc}}$ and ${\sigma }^{\text {foc}}$ represent those values obtained from $\textbf{U}^{\text {foc}}$ .

This linear linking method is applicable when there are common examinees across different groups. However, as discussed in the introduction, arranging for multiple groups with partially overlapping examinees (and/or raters) can often be impractical in real-world testing environments. To address this limitation, we aim to facilitate test linking without the need for common examinees and raters by leveraging AES technology.

Automated essay scoring models

Many AES methods have been developed over recent decades and can be broadly categorized into either feature-engineering or automatic feature extraction approaches (Hussein et al., 2019 ; Ke & Ng, 2019 ). The feature-engineering approach predicts essay scores using either a regression or classification model that employs manually designed features, such as essay length and the number of spelling errors (Amorim et al., 2018 ; Dascalu et al., 2017 ; Nguyen & Litman, 2018 ; Shermis & Burstein, 2002 ). The advantages of this approach include greater interpretability and explainability. However, it generally requires considerable effort in developing effective features to achieve high scoring accuracy for various datasets. Automatic feature extraction approaches based on deep neural networks (DNNs) have recently attracted attention as a means of eliminating the need for feature engineering. Many DNN-based AES models have been proposed in the last decade and have achieved state-of-the-art accuracy (Alikaniotis et al., 2016 ; Dasgupta et al., 2018 ; Farag et al., 2018 ; Jin et al., 2018 ; Mesgar & Strube, 2018 ; Mim et al., 2019 ; Nadeem et al., 2019 ; Ridley et al., 2021 ; Taghipour & Ng, 2016 ; Uto, 2021b ; Wang et al., 2018 ). In the next section, we introduce the most widely used DNN-based AES model, which utilizes Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019 ).

BERT-based AES model

BERT, a pre-trained language model developed by Google’s AI language team, achieved state-of-the-art performance in various natural language processing (NLP) tasks in 2019 (Devlin et al., 2019 ). Since then, it has frequently been applied to AES (Rodriguez et al., 2019 ) and automated short-answer grading (Liu et al., 2019 ; Lun et al., 2020 ; Sung et al., 2019 ) and has demonstrated high accuracy.

BERT is structured as a multilayer bidirectional transformer network, where the transformer is a neural network architecture designed to handle ordered sequences of data using an attention mechanism. See Ref. (Vaswani et al., 2017 ) for details of transformers.

BERT undergoes training in two distinct phases, pretraining and fine-tuning . The pretraining phase utilizes massive volumes of unlabeled text data and is conducted through two unsupervised learning tasks, specifically, masked language modeling and next-sentence prediction . Masked language modeling predicts the identities of words that have been masked out of the input text, while next-sequence prediction predicts whether two given sentences are adjacent.

Fine-tuning is required to adapt a pre-trained BERT model for a specific NLP task, including AES. This entails retraining the BERT model using a task-specific supervised dataset after initializing the model parameters with pre-trained values and augmenting with task-specific output layers. For AES applications, the addition of a special token, [CLS] , at the beginning of each input is required. Then, BERT condenses the entire input text into a fixed-length real-valued hidden vector referred to as the distributed text representation , which corresponds to the output of the special token [CLS] (Devlin et al., 2019 ). AES scores can thus be derived by feeding the distributed text representation into a linear layer with sigmoid activation , as depicted in Fig. 1 . More formally, let $ \varvec{h} $ be the distributed text representation. The linear layer with sigmoid activation is defined as $\sigma (\varvec{W}\varvec{h}+\text{ b})$ , where $\varvec{W}$ is a weight matrix and $\text{ b }$ is a bias, both learned during the fine-tuning process. The sigmoid function $\sigma ()$ maps its input to a value between 0 and 1. Therefore, the model is trained to minimize an error loss function between the predicted scores and the gold-standard scores, which are normalized to the [0, 1] range. Moreover, score prediction using the trained model is performed by linearly rescaling the predicted scores back to the original score range.

BERT-based AES model architecture. $w_{jt}$ is the t -th word in the essay of examinee j , $n_j$ is the number of words in the essay, and $\hat{y}_{j}$ represents the predicted score from the model

Problems with AES model training

As mentioned above, to employ BERT-based and other DNN-based AES models, they must be trained or fine-tuned using a large dataset of essays that have been graded by human raters. Typically, the mean-squared error (MSE) between the predicted and the gold-standard scores serves as the loss function for model training. Specifically, let $y_{j}$ be the normalized gold-standard score for the j -th examinee’s essay, and let $\hat{y}_{j}$ be the predicted score from the model. The MSE loss function is then defined as

where J denotes the number of examinees, which is equivalent to the number of essays, in the training dataset.

Here, note that a large-scale training dataset is often created by assigning a few raters from a pool of potential raters to each essay to reduce the scoring burden and to increase scoring reliability. In such cases, the gold-standard score for each essay is commonly determined by averaging the scores given by multiple raters assigned to that essay. However, as discussed in earlier sections, these straightforward average scores are highly sensitive to rater characteristics. When training data includes rater bias effects, an AES model trained on that data can show decreased performance as a result of inheriting these biases (Amorim et al., 2018 ; Huang et al., 2019 ; Li et al., 2020 ; Wind et al., 2018 ). An AES method that uses IRT has been proposed to address this issue (Uto & Okano, 2021 ).

AES method using IRT

The main idea behind the AES method using IRT (Uto & Okano, 2021 ) is to train an AES model using the ability value $\theta _j$ estimated by IRT models with rater parameters, such as MFRM and its extensions, from the data given by multiple raters for each essay, instead of a simple average score. Specifically, AES model training in this method occurs in two steps, as outlined in Fig. 2 .

Estimate the IRT-based abilities $\varvec{\theta }$ from a score dataset, which includes scores given to essays by multiple raters.

Train an AES model given the ability estimates as the gold-standard scores. Specifically, the MSE loss function for training is defined as

where $\hat{\theta }_j$ represents the AES’s predicted ability of the j -th examinee, and $\theta _{j}$ is the gold-standard ability for the examinee obtained from Step 1. Note that the gold-standard scores are rescaled into the range [0, 1] by applying a linear transformation from the logit range $[-3, 3]$ to [0, 1]. See the original paper (Uto & Okano, 2021 ) for details.

Architecture of a BERT-based AES model that uses IRT

A trained AES model based on this method will not reflect bias effects because IRT-based abilities $\varvec{\theta }$ are estimated while removing rater bias effects.

In the prediction phase, the score for an essay from examinee $j^{\prime }$ is calculated in two steps.

Predict the IRT-based ability $\theta _{j^{\prime }}$ for the examinee using the trained AES model, and then linearly rescale it to the logit range $[-3, 3]$ .

Calculate the expected score $\mathbb {E}_{r,k}\left[ P_{j^{\prime }rk}\right] $ , which corresponds to an unbiased original-scaled score, given $\theta _{j'}$ and the rater parameters. This is used as a predicted essay score in this method.

This method originally aimed to train an AES model while mitigating the impact of varying rater characteristics present in the training data. A key feature, however, is its ability to predict an examinee’s IRT-based ability from their essay texts. Our linking approach leverages this feature to enable test linking without requiring common examinees and raters.

Outline of our proposed method, steps 1 and 2

Outline of our proposed method, steps 3–6

Proposed method

The core idea behind our method is to develop an AES model that predicts examinee ability using score and essay data from the reference group, and then to use this model to predict the abilities of examinees in the focal group. These predictions are then used to estimate the linking coefficients for a linear linking. An outline of our method is illustrated in Figs. 3 and 4 . The detailed steps involved in the procedure are as follows.

Estimate the IRT model parameters from the reference group’s data $\textbf{U}^{\text {ref}}$ to obtain $\hat{\varvec{\theta }}^{\text {ref}}$ indicating the ability estimates of the examinees in the reference group.

Use the ability estimates $\hat{\varvec{\theta }}^{\text {ref}}$ and the essays written by the examinees in the reference group to train the AES model that predicts examinee ability.

Use the trained AES model to predict the abilities of examinees in the focal group by inputting their essays. We designate these AES-predicted abilities as $\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}$ from here on. An important point to note is that the AES model is trained to predict ability values on the parameter scale aligned with the reference group’s data, meaning that the predicted abilities for examinees in the focal group follow the same scale.

Estimate the IRT model parameters from the focal group’s data $\textbf{U}^{\text {foc}}$ .

Calculate the linking coefficients A and K using the AES-predicted abilities $\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}$ and the IRT-based ability estimates $\hat{\varvec{\theta }}^{\text {foc}}$ for examinees in the focal group as follows.

where ${\mu }^{\text {foc}}_{\text {pred}}$ and ${\sigma }^{\text {foc}}_{\text {pred}}$ represent the mean and the SD of the AES-predicted abilities $\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}$ , respectively. Furthermore, ${\mu }^{\text {foc}}$ and ${\sigma }^{\text {foc}}$ represent the corresponding values for the IRT-based ability estimates $\hat{\varvec{\theta }}^{\text {foc}}$ .

Apply linear linking based on the mean and sigma method given in Eq. 8 using the above linking coefficients and the parameter estimates for the focal group obtained in Step 4. This procedure yields parameter estimates for the focal group that are aligned with the scale of the parameters of the reference group.

As described in Step 3, the AES model used in our method is trained to predict examinee abilities on the scale derived from the reference data $\textbf{U}^{\text {ref}}$ . Therefore, the abilities predicted by the trained AES model for the examinees in the focal group, denoted as $\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}$ , also follow the ability scale derived from the reference data. Consequently, by using the AES-predicted abilities, we can infer the differences in the ability distribution between the reference and focal groups. This enables us to estimate the linking coefficients, which then allows us to perform linear linking based on the mean and sigma method. Thus, our method allows for test linking without the need for common examinees and raters.

It is important to note that the current AES model for predicting examinees’ abilities does not necessarily offer sufficient prediction accuracy for individual ability estimates. This implies that their direct use in mid- to high-stakes assessments could be problematic. Therefore, we focus solely on the mean and SD values of the ability distribution based on predicted abilities, rather than using individual predicted ability values. Our underlying assumption is that these AES models can provide valuable insights into differences in the ability distribution across various groups, even though the individual predictions might be somewhat inaccurate, thereby substantiating their utility for test linking.

Experiments

In this section, we provide an overview of the experiments we conducted using actual data to evaluate the effectiveness of our method.

Actual data

We used the dataset previously collected in Uto and Okano ( 2021 ). It consists of essays written in English by 1805 students from grades 7 to 10 along with scores from 38 raters for these essays. The essays originally came from the ASAP (Automated Student Assessment Prize) dataset, which is a well-known benchmark dataset for AES studies. The raters were native English speakers recruited from Amazon Mechanical Turk (AMT), a popular crowdsourcing platform. To alleviate the scoring burden, only a few raters were assigned to each essay, rather than having all raters evaluate every essay. Rater assignment was conducted based on a systematic links design (Shin et al., 2019 ; Uto, 2021a ; Wind & Jones, 2019 ) to achieve IRT-scale linking. Consequently, each rater evaluated approximately 195 essays, and each essay was graded by four raters on average. The raters were asked to grade the essays using a holistic rubric with five rating categories, which is identical to the one used in the original ASAP dataset. The raters were provided no training before the scoring process began. The average Pearson correlation between the scores from AMT raters and the ground-truth scores included in the original ASAP dataset was 0.70 with an SD of 0.09. The minimum and maximum correlations were 0.37 and 0.81, respectively. Furthermore, we also calculated the intraclass correlation coefficient (ICC) between the scores from each AMT rater and the ground-truth scores. The average ICC was 0.60 with an SD of 0.15, and the minimum and maximum ICCs were 0.29 and 0.79, respectively. The calculation of the correlation coefficients and ICC for each AMT rater excluded essays that the AMT rater did not assess. Furthermore, because the ground-truth scores were given as the total scores from two raters, we divided them by two in order to align the score scale with the AMT raters’ scores.

For further analysis, we also evaluated the ICC among the AMT raters as their interrater reliability. In this analysis, missing value imputation was required because all essays were evaluated by a subset of AMT raters. Thus, we first applied multiple imputation with predictive mean matching to the AMT raters’ score dataset. In this process, we generated five imputed datasets. For each imputed dataset, we calculated the ICC among all AMT raters. Finally, we aggregated the ICC values from each imputed dataset to calculate the mean ICC and its SD. The results revealed a mean ICC of 0.43 with an SD of 0.01.

These results suggest that the reliability of raters is not necessarily high. This variability in scoring behavior among raters underscores the importance of applying IRT models with rater parameters. For further details of the dataset see Uto and Okano ( 2021 ).

Experimental procedures

Using this dataset, we conducted the following experiment for three IRT models with rater parameters, MFRM, MFRM with RSS, and GMFM, defined by Eqs. 5 , 6 , and 7 , respectively.

We estimated the IRT parameters from the dataset using the No-U-Turn sampler-based Markov chain Monte Carlo (MCMC) algorithm, given the prior distributions $\theta _j, \beta _r, d_m, d_{rm} \sim N(0, 1)$ , and $\alpha _r \sim LN(0, 0.5)$ following the previous work (Uto & Ueno, 2020 ). Here, $ N(\cdot , \cdot )$ and $LN(\cdot , \cdot )$ indicate normal and log-normal distributions with mean and SD values, respectively. The expected a posteriori (EAP) estimator was used as the point estimates.

We then separated the dataset randomly into two groups, the reference group and the focal group, ensuring no overlap of examinees and raters between them. In this separation, we selected examinees and raters in each group to ensure distinct distributions of examinee abilities and rater severities. Various separation patterns were tested and are listed in Table 1 . For example, condition 1 in Table 1 means that the reference group comprised randomly selected high-ability examinees and low-severity raters, while the focal group comprised low-ability examinees and high-severity raters. Condition 2 provided a similar separation but controlled for narrower variance in rater severity in the focal group. Details of the group creation procedures can be found in Appendix A .

Using the obtained data for the reference and focal groups, we conducted test linking using our method, the details of which are given in the Proposed method section. In it, the IRT parameter estimations were carried out using the same MCMC algorithm as in Step 1.

We calculated the Root Mean Squared Error (RMSE) between the IRT parameters for the focal group, which were linked using our proposed method, and their gold-standard parameters. In this context, the gold-standard parameters were obtained by transforming the scale of the parameters estimated from the entire dataset in Step 1 so that it aligned with that of the reference group. Specifically, we estimated the IRT parameters using data from the reference group and collected those estimated from the entire dataset in Step 1. Then, using the examinees in the reference group as common examinees, we applied linear linking based on the mean and sigma method to adjust the scale of the parameters estimated from the entire dataset to match that of the reference group.

For comparison, we also calculated the RMSE between the focal group’s IRT parameters, obtained without applying the proposed linking, and their gold-standard parameters. This functions as the worst baseline against which the results of the proposed method are compared. Additionally, we examined other baselines that use linear linking based on common examinees. For these baselines, we randomly selected five or ten examinees from the reference group, who were assigned scores by at least two focal group’s raters in the entire dataset. The scores given to these selected examinees by the focal group’s raters were then merged with the focal group’s data, where the added examinees worked as common examinees between the reference and focal groups. Using this data, we examined linear linking using common examinees. Specifically, we estimated the IRT parameters from the data of the focal group with common examinees and applied linear linking based on the mean and sigma method using the ability estimates of the common examinees to align its scale with that of the reference group. Finally, we calculated the RMSE between the linked parameter estimates for the examinees and raters belonging only to the original focal group and their gold-standard parameters. Note that this common examinee approach operates under more advantageous conditions compared with the proposed linking method because it can utilize larger samples for estimating the parameters of raters in the focal group.

We repeated Steps 2–5 ten times for each data separation condition and calculated the average RMSE for four cases: one in which our proposed linking method was applied, one without linking, and two others where linear linkings using five and ten common examinees were applied.

The parameter estimation program utilized in Steps 1, 4, and 5 was implemented using RStan (Stan Development Team, 2018 ). The EAP estimates were calculated as the mean of the parameter samples obtained from 2,000 to 5,000 periods using three independent chains. The AES model was developed in Python, leveraging the PyTorch library Footnote 2 . For the AES model training in Step 3, we randomly selected $90\%$ of the data from the reference group to serve as the training set, with the remaining $10\%$ designated as the development set. We limited the maximum number of steps for training the AES model to 800 and set the maximum number of epochs to 800 divided by the number of mini-batches. Additionally, we employed early stopping based on the performance on the development set. The AdamW optimization algorithm was used, and the mini-batch size was set to 8.

MCMC statistics and model fitting

Before delving into the results of the aforementioned experiments, we provide some statistics related to the MCMC-based parameter estimation. Specifically, we computed the Gelman–Rubin statistic $\hat{R}$ (Gelman et al., 2013 ; Gelman & Rubin, 1992 ), a well-established diagnostic index for convergence, as well as the effective sample size (ESS) and the number of divergent transitions for each IRT model during the parameter estimation phase in Step 1. Across all models, the $\hat{R}$ statistics were below 1.1 for all parameters, indicating convergence of the MCMC runs. Furthermore, as shown in the first row of Table 2 , our ESS values for all parameters in all models exceeded the criterion of 400, which is considered sufficiently large according to Zitzmann and Hecht ( 2019 ). We also observed no divergent transitions in any of the cases. These results support the validity of the MCMC-based parameter estimation.

Furthermore, we evaluated the model – data fit for each IRT model during the parameter estimation step in Step 1. To assess this fit, we employed the posterior predictive p value ( PPP -value) (Gelman et al., 2013 ), a commonly used metric for evaluating the model–data fit in Bayesian frameworks (Nering & Ostini, 2010 ; van der Linden, 2016 ). Specifically, we calculated the PPP -value using an averaged standardized residual, a conventional metric for IRT model fit in non-Bayesian settings, as a discrepancy function, similar to the approach in Nering and Ostini ( 2010 ); Tran ( 2020 ); Uto and Okano ( 2021 ). A well-fitted model yields a PPP -value close to 0.5, while poorly fitted models exhibit extreme low or high values, such as those below 0.05 or above 0.95. Additionally, we calculated two information criteria, the widely applicable information criterion (WAIC) (Watanabe, 2010 ) and the widely applicable Bayesian information criterion (WBIC) (Watanabe, 2013 ). The model that minimizes these criteria is considered optimal.

The last three rows in Table 2 shows the results. We can see that the PPP -value for GMFM is close to 0.5, indicating a good fit to the data. In contrast, the other models exhibit high values, suggesting a poor fit to the data. Furthermore, among the three IRT models evaluated, GMFM exhibits the lowest WAIC and WBIC values. These findings suggest that GMFM offers the best fit to the data, corroborating previous work that investigated the same dataset using IRT models (Uto & Okano, 2021 ). We provide further discussion about the model fit in the Analysis of rater characteristics section given later.

According to these results, the following section focuses on the results for GMFM. Note that we also include the results for MFRM and MFRM with RSS in Appendix B , along with the open practices statement.

Effectiveness of our proposed linking method

The results of the aforementioned experiments for GMFM are shown in Table 3 . In the table, the Unlinked row represents the average RMSE between the focal group’s IRT parameters without applying our linking method and their gold-standard parameters. Similarly, the Linked by proposed method row represents the average RMSE between the focal group’s IRT parameters after applying our linking method and their gold-standard parameters. The rows labeled Linked by five/ten common examinees represent the results for linear linking using common examinees.

A comparison of the results from the unlinked condition and the proposed method reveals that the proposed method improved the RMSEs for the ability and rater severity parameters, namely, $\theta _j$ and $\beta _r$ , which we intentionally varied between the reference and focal groups. The degree of improvement is notably substantial when the distributional differences between the reference and focal groups are large, as is the case in Conditions 1–5. On the other hand, for Conditions 6–8, where the distributional differences are relatively minor, the improvements are also smaller in comparison. This is because the RMSEs for the unlinked parameters are already lower in these conditions than in Conditions 1–5. Nonetheless, it is worth emphasizing that the RMSEs after employing our linking method are exceptionally low in Conditions 6–8.

Furthermore, the table indicates that the RMSEs for the step parameters and rater consistency parameters, namely, $d_{rm}$ and $\alpha _r$ , also improved in many cases, while the impact of applying our linking method is relatively small for these parameters compared with the ability and rater severity parameters. This is because we did not intentionally vary their distribution between the reference and focal groups, and thus their distribution differences were smaller than those for the ability and rater severity parameters, as shown in the next section.

Comparing the results from the proposed method and linear linking using five common examinees, we observe that the proposed method generally exhibits lower RMSE values for the ability $\theta _j$ and the rater severity parameters $\beta _r$ , except for conditions 2–3. Furthermore, when comparing the proposed method with linear linking using ten common examinees, it achieves superior performance in conditions 4–8 and slightly lower performance in conditions 1–3 for $\theta _j$ and $\beta _r$ , while the differences are more minor overall than those observed when comparing the proposed method with the condition of five common examinees. Note that the reasons why the proposed method tends to show lower performance for conditions 1–3 are as follows.

The proposed method utilizes fewer samples to estimate the rater parameters compared with the linear linking method using common examinees.

In situations where distributional differences between the reference and focal groups are relatively large, as in conditions 1–3, constructing an accurate AES model for the focal group becomes challenging due to the limited overlap in the ability value range. We elaborate on this point in the next section.

Furthermore, in terms of the rater consistency parameter $\alpha _r$ and the step parameter $d_{rm}$ , the proposed method typically shows lower RMSE values compared with linear linking using common examinees. We attribute this to the fact that the performance of the linking method using common examinees is highly dependent on the choice of common examinees, which can sometimes result in significant errors in these parameters. This issue is also further discussed in the next section.

These results suggest that our method can perform linking with comparable accuracy to linear linking using few common examinees, even in the absence of common examinees and raters. Additionally, as reported in Tables 15 and 16 in Appendix B , both MFRM and MFRM with RSS also exhibit a similar tendency, further validating the effectiveness of our approach regardless of the IRT models employed.

Detailed analysis

Analysis of parameter scale transformation using the proposed method.

In this section, we detail how our method transforms the parameter scale. To demonstrate this, we first summarize the mean and SD values of the gold-standard parameters for both the reference and focal groups in Table 4 . The values in the table are averages calculated from ten repetitions of the experimental procedures. The table shows that the mean and SD values of both examinee ability and rater severity vary significantly between the reference and focal groups following our intended settings, as outlined in Table 1 . Additionally, the mean and SD values for the rater consistency parameter $\alpha _r$ and the rater-specific step parameters $d_{rm}$ also differ slightly between the groups, although we did not intentionally alter them.

Second, the averaged values of the means and SDs of the parameters, estimated solely from either the reference or the focal group’s data over ten repetitions, are presented in Table 5 . The table reveals that the estimated parameters for both groups align with a normal distribution centered at nearly zero, despite the actual ability distributions differing between the groups. This phenomenon arises because IRT permits arbitrary scaling of parameters for each independent dataset, as mentioned in the Linking section. This leads to differences in the parameter scale for the focal group compared with their gold-standard values, thereby highlighting the need for parameter linking.

Next, the first two rows of Table 6 display the mean and SD values of the ability estimates for the focal group’s examinees, as predicted by the BERT-based AES model. In the table, the RMSE row indicates the RMSE between the AES-predicted ability values and the gold-standard ability values for the focal groups. The Linking Coefficients row presents the linking coefficients calculated based on the AES-predicted abilities. As with the abovementioned tables, these values are also averages over ten experimental repetitions. According to the table, for Conditions 6–8, where the distributional differences between the groups are relatively minor, both the mean and SD estimates align closely with those of the gold-standard parameters. In contrast, for Conditions 1–5, where the distributional differences are more pronounced, the mean and SD estimates tend to deviate from the gold-standard values, highlighting the challenges of parameter linking under such conditions.

In addition, as indicated in the RMSE row, the AES-predicted abilities may lack accuracy under specific conditions, such as Conditions 1, 2, and 3. This inaccuracy could arise because the AES model, trained on the reference group’s data, could not cover the ability range of the focal group due to significant differences in the ability distribution between the groups. Note that even in cases where the mean and SD estimates are relatively inaccurate, these values are closer to the gold-standard ones than those estimated solely from the focal group’s data. This leads to meaningful linking coefficients, which transform the focal group’s parameters toward the scale of their gold-standard values.

Finally, Table 7 displays the averaged values of the means and SDs of the focal group’s parameters obtained through our linking method over ten repetitions. Note that the mean and SD values of the ability estimates are the same as those reported in Table 6 because the proposed method is designed to align them. The table indicates that the differences in the mean and SD values between the proposed method and the gold-standard condition, shown in Table 4 , tend to be smaller compared with those between the unlinked condition, shown in Table 5 , and the gold-standard. To verify this point more precisely, Table 8 shows the average absolute differences in the mean and SD values of the parameters for the focal groups between the proposed method and the gold-standard condition, as well as those between the unlinked condition and the gold-standard. These values were calculated by averaging the absolute differences in the mean and SD values obtained from each of the ten repetitions, unlike the simple absolute differences in the values reported in Tables 4 and 7 . The table shows that the proposed linking method tends to derive lower values, especially for $\theta _j$ and $\beta _r$ , than the unlinked condition. Furthermore, this tendency is prominent for conditions 6–8 in which the distributional differences between the focal and reference groups are relatively small. These trends are consistent with the cases for which our method revealed high linking performance, detailed in the previous section.

In summary, the above analyses suggest that although the AES model’s predictions may not always be perfectly accurate, they can offer valuable insights into scale differences between the reference and focal groups, thereby facilitating successful IRT parameter linking without common examinees and raters.

We now present the distributions of examinee ability and rater severity for the focal group, comparing their gold-standard values with those before and after the application of the linking method. Figures 5 , 6 , 7 , 8 , 9 , 10 , 11 , and 12 are illustrative examples for the eight data-splitting conditions. The gray bars depict the distributions of the gold-standard parameters, the blue bars represent those of the parameters estimated from the focal group’s data, the red bars signify those of the parameters obtained using our linking method, and the green bars indicate the ability distribution as predicted by the BERT-based AES. The upper part of the figure presents results for examinee ability $\theta _j$ and the lower part presents those for rater severity $\beta _r$ .

The blue bars in these figures reveal that the parameters estimated from the focal group’s data exhibit distributions with different locations and/or scales compared with their gold-standard values. Meanwhile, the red bars reveal that the distributions of the parameters obtained through our linking method tend to align closely with those of the gold-standard parameters. This is attributed to the fact that the ability distributions for the focal group given by the BERT-based AES model, as depicted by the green bars, were informative for performing linear linking.

Analysis of the linking method based on common examinees

For a detailed analysis of the linking method based on common examinees, Table 9 reports the averaged values of means and SDs of the focal groups’ parameter estimates obtained by the linking method based on five and ten common examinees for each condition. Furthermore, Table 10 shows the average absolute differences between these values and those from the gold standard condition. Table 10 shows that an increase in the number of common examinees tends to lower the average absolute differences, which is a reasonable trend. Furthermore, comparing the results with those of the proposed method reported in Table 8 , the proposed method tends to achieve smaller absolute differences in conditions 4–8 for $\theta _j$ and $\beta _r$ , which is consistent with the tendency of the linking performance discussed in the “Effectiveness of our proposed linking method” section.

Note that although the mean and SD values in Table 9 are close to those of the gold-standard parameters shown in Table 4 , this does not imply that linear linking based on five or ten common examinees achieves high linking accuracy for each repetition. To explain this, Table 11 shows the means of the gold-standard ability values for the focal group and their estimates obtained from the proposed method and the linking method based on ten common examinees, for each of ten repetitions under condition 8. This table also shows the absolute differences between the estimated ability means and the corresponding gold-standard means.

Example of ability and rater severity distributions for the focal group under data-splitting condition 1

Example of ability and rater severity distributions for the focal group under data-splitting condition 2

Example of ability and rater severity distributions for the focal group under data-splitting condition 3

Example of ability and rater severity distributions for the focal group under data-splitting condition 4

Example of ability and rater severity distributions for the focal group under data-splitting condition 5

Example of ability and rater severity distributions for the focal group under data-splitting condition 6

Example of ability and rater severity distributions for the focal group under data-splitting condition 7

Example of ability and rater severity distributions for the focal group under data-splitting condition 8

The table shows that the results of the proposed method are relatively stable, consistently revealing low absolute differences for every repetition. In contrast, the results of linear linking based on ten common examinees vary significantly across repetitions, resulting in large absolute differences for some repetitions. These results yield a smaller average absolute difference for the proposed method compared with linear linking based on ten common examinees. However, in terms of the absolute difference in the averaged ability means, linear linking based on ten common examinees shows a smaller difference ( $|0.38-0.33| = 0.05$ ) compared with the proposed method ( $|0.38-0.46| = 0.08$ ). This occurs because the results of linear linking based on ten common examinees for ten repetitions fluctuate around the ten-repetition average of the gold standard, thereby canceling out the positive and negative differences. However, this does not imply that linear linking based on ten common examinees achieves high linking accuracy for each repetition. Thus, it is reasonable to interpret the average of the absolute differences calculated for each of the ten repetitions, as reported in Tables 8 and 10 .

This greater variability in performance of the linking method based on common examinees also relates to the tendency of the proposed method to show lower RMSE values for the rater consistency parameter $\alpha _r$ and the step parameters $d_{rm}$ compared with linking based on common examinees, as mentioned in the Effectiveness of our proposed linking method section. In that section, we mentioned that this is due to the fact that linear linking based on common examinees is highly dependent on the selection of common examinees, which can sometimes lead to significant errors in these parameters.

To confirm this point, Table 12 displays the SD of RMSEs calculated from ten repetitions of the experimental procedures for both the proposed method and linear linking using ten common examinees. The table indicates that the linking method using common examinees tends to exhibit larger SD values overall, suggesting that this linking method sometimes becomes inaccurate, as we also exemplified in Table 11 . This variability also implies that the estimation of the linking coefficient can be unstable.

Furthermore, the tendency of having larger SD values in the common examinee approach is particularly pronounced for the step parameters at the extreme categories, namely, $d_{r2}$ and $d_{r5}$ . We consider this comes from the instability of linking coefficients and the fact that the step parameters for the extreme categories tend to have large absolute values (see Table 13 for detailed estimates). Linear linking multiplies the step parameters by a linking coefficient A , although applying an inappropriate linking coefficient to larger absolute values can have a more substantial impact than when applied to smaller values. We concluded that this is why the RMSEs of the step difficulty parameters in the common examinee approach were deteriorated compared with those in the proposed method. The same reasoning would be applicable to the rater consistency parameter, given that it is distributed among positive values with a mean over one. See Table 13 for details.

Prerequisites of the proposed method

As demonstrated thus far, the proposed method can perform IRT parameter linking without the need for common examinees and raters. As outlined in the Introduction section, certain testing scenarios may encounter challenges or incur significant costs in assembling common examinees or raters. Our method provides a viable solution in these situations. However, it does come with specific prerequisites and inherent costs.

The prerequisites of our proposed method are as follows.

The same essay writing task is offered to both the reference and focal groups, and the written essays for it are scored by different groups of raters using the same rubric.

Raters will function identically across both the reference and focal groups, and the established scales can be adjusted through linear transformations. This implies that there are no systematic differences in scoring that are correlated with the groups but are unrelated to the measured construct, such as differential rater functioning (Leckie & Baird, 2011 ; Myford & Wolfe, 2009 ; Uto, 2023 ; Wind & Guo, 2019 ).

The ability ranges of the reference and focal groups require some overlap because the ability prediction accuracy of the AES decreases as the differences in the ability distributions between the groups increases, as discussed in the Detailed analysis section. This is a limitation of this approach, which requires future studies to overcome.

The reference group consists of a sufficient number of examinees for training AES models using their essays as training data.

Related to the fourth point, we conducted an additional experiment to investigate the number of samples required to train AES models. In this experiment, we assessed the ability prediction accuracy of the BERT-based AES model used in this study by varying the number of training samples. The detailed experimental procedures are outlined below.

Estimate the ability of all 1805 examinees from the entire dataset based on the GMFM.

Randomly split the examinees into 80% (1444) and 20% (361) groups. The 20% subset, consisting of examinees’ essays and their ability estimates, was used as test data to evaluate the ability prediction accuracy of the AES model trained through the following steps.

The 80% subset was further divided into 80% (1155) and 20% (289) groups. Here, the essays and ability estimates of the 80% subset were used as the training data, while those of the 20% served as development data for selecting the optimal epoch.

Train the BERT-based AES model using the training data and select the optimal epoch that minimizes the RMSE between the predicted and gold-standard ability values for the development set.

Use the trained AES model at the optimal epoch to evaluate the RMSE between the predicted and gold-standard ability values for the test data.

Randomly sample 50, 100, 200, 300, 500, 750, and 1000 examinees from the training data created in Step 3.

Train the AES model using each sampled set as training data, and select the optimal epoch using the same development data as before.

Use the trained AES model to evaluate the RMSE for the same test data as before.

Repeat Steps 2–8 five times and calculate the average RMSE for the test data.

Relationship between the number of training samples and the ability prediction accuracy of AES

Item response curves of four representative raters found in experiments using actual data

Figure 13 displays the results. The horizontal axis represents the number of training samples, and the vertical axis shows the RMSE values. Each plot illustrates the average RMSE, with error bars indicating the SD ranges. The results demonstrate that larger sample sizes enhance the accuracy of the AES model. Furthermore, while the RMSE decreases significantly when the sample size is small, the improvements tend to plateau beyond 500 samples. This suggests that, for this dataset, approximately 500 samples would be sufficient to train the AES model with reasonable accuracy. However, note that the required number of samples may vary depending on the essay tasks. A detailed analysis of the relationship between the required number of samples and the characteristics of essay writing tasks is planned for future work.

An inherent cost associated with the proposed method is the computational expense required to construct the BERT-based AES model. Specifically, a computer with a reasonably powerful GPU is necessary to efficiently train the AES model. In this study, for example, we utilized an NVIDIA Tesla T4 GPU on Google Colaboratory. To elaborate on the computational expense, we calculated the computation times and costs for the above experiment under a condition where 1155 training samples were used. Consequently, training the AES model with 1155 samples, including evaluating the RMSE for the development set of 289 essays in each epoch, took approximately 10 min in total. Moreover, it required about 10 s to predict the abilities of 361 examinees from their essays using the trained model. The computational units consumed on Google Colaboratory for both training and inference amounted to 0.44, which corresponds to approximately $0.044. These costs and the time required are significantly smaller than what is required for human scoring.

Analysis of rater characteristics

The MCMC statistics and model fitting section demonstrated that the GMFM provides a better fit to the actual data compared with the MFRM and MFRM with RSS. To explain this, Table 13 shows the rater parameters estimated by the GMFM using the entire dataset. Additionally, Fig. 14 illustrates the item response curves (IRCs) for raters 3, 16, 31, and 34, where the horizontal axis represents the ability $\theta _j$ , and the vertical axis depicts the response probability for each category.

The table and figure reveal that the raters exhibit diverse and unique characteristics in terms of severity, consistency, and range restriction. For instance, Rater 3 demonstrates nearly average values for all parameters, indicating standard rating characteristics. In contrast, Rater 16 exhibits a pronounced extreme response tendency, as evidenced by higher $d_{r2}$ and lower $d_{r5}$ values. Additionally, Rater 31 is characterized by a low severity score, generally preferring higher scores (four and five). Rater 34 exhibits a low consistency value $\alpha _r$ , which results in minimal variation in response probabilities among categories. This indicates that the rater is likely to assign different ratings to essays of similar quality.

As detailed in the Item Response Theory section, the GMFM can capture these variations in rater severity, consistency, and range restriction simultaneously, while the MFRM and MFRM with RSS can consider only its subsets. We can infer that this capability, along with the large variety of rater characteristics, contributed to the superior model fit of the GMFM compared with the other models.

It is important to note that, the proposed method is also useful for facilitating linking for MFRM and MFRM with RSS, even though the model fits for them were relatively worse, as well as for the GMFM, which we mentioned earlier and is shown in Appendix B .

Effect of using cloud workers as raters

As we detailed in the Actual data section, we used scores given by untrained non-expert cloud workers instead of expert raters. A concern with using raters from cloud workers without adequate training is the potential for greater variability in rating characteristics compared with expert raters. This variability is evidenced by the diverse correlations between the raters’ scores and their ground truth, reported in the Actual data section, and the large variety of rater parameters discussed above. These observations suggest the importance of the following two strategies for ensuring reliable essay scoring when employing crowd workers as raters.

Assigning a larger number of raters to each essay than would typically be used with expert raters.

Estimating the standardized essay scores while accounting for differences in rater characteristics, potentially through the use of IRT models that incorporate rater parameters, which we used in this study.

In this study, we propose a novel IRT-based linking method for essay-writing tests that uses AES technology to enable parameter linking based on IRT models with rater parameters across multiple groups in which neither examinees nor raters are shared. Specifically, we use a deep neural AES method capable of predicting IRT-based examinee abilities based on their essays. The core concept of our approach involves developing an AES model to predict examinee abilities using data from a reference group. This AES model is then applied to predict the abilities of examinees in the focal group. These predictions are used to estimate the linking coefficients required for linear linking. Experimental results with real data demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

In our experiments, we compared the linking performance of the proposed method with linear linking based on the mean and sigma method using only five or ten common examinees. However, such a small number of common examinees is generally insufficient for accurate linear linking and thus leads to unstable estimation of linking coefficients, as discussed in the “Analysis of the linking method based on common examinees” section. Although this study concluded that our method could perform linking with accuracy comparable to that of linear linking using few common examinees, further detailed evaluations of our method involving comparisons with various conventional linking methods using different numbers of common examinees and raters will be the target of future work.

Additionally, our experimental results suggest that although the AES model may not provide sufficient predictive accuracy for individual examinee abilities, it does tend to yield reasonable mean and SD values for the ability distribution of focal groups. This lends credence to our assumption stated in the Proposed method section that AES models incorporating IRT can offer valuable insights into differences in ability distribution across various groups, thereby validating their utility for test linking. This result also supports the use of the mean and sigma method for linking. While concurrent calibration, another common linking method, requires highly accurate individual AES-predicted abilities to serve as anchor values, linear linking through the mean and sigma method necessitates only the mean and SD of the ability distribution. Given that the AES model can provide accurate estimates for these statistics, successful linking can be achieved, as shown in our experiments.

A limitation of this study is that our method is designed for test situations where a single essay writing item is administered to multiple groups, each comprising different examinees and raters. Consequently, the method is not directly applicable for linking multiple tests that offer different items. Developing an extension of our approach to accommodate such test situations is one direction for future research. Another involves evaluating the effectiveness of our method using other datasets. To the best of our knowledge, there are no open datasets that include examinee essays along with scores from multiple assigned raters. Therefore, we plan to develop additional datasets and to conduct further evaluations. Further investigation of the impact of the AES model’s accuracy on linking performance is also warranted.

Availability of data and materials

The data and materials from our experiments are available at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This includes all experimental results and a sample dataset.

Code availability

The source code for our linking method, developed in R and Python, is available in the same GitHub repository.

The original paper referred to this model as the generalized MFRM. However, in this paper, we refer to it as GMFM because it does not strictly belong to the family of Rasch models.

https://pytorch.org/

Abosalem, Y. (2016). Assessment techniques and students’ higher-order thinking skills. International Journal of Secondary Education, 4 (1), 1–11. https://doi.org/10.11648/j.ijsedu.20160401.11

Article Google Scholar

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. Proceedings of the annual meeting of the association for computational linguistics (pp. 715–725).

Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed response writing tests. International Journal of Testing, 14 (1), 73–91. https://doi.org/10.1080/15305058.2013.816309

Amorim, E., Cançado, M., & Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. Proceedings of the annual conference of the north american chapter of the association for computational linguistics (pp. 229–237).

Bernardin, H. J., Thomason, S., Buckley, M. R., & Kane, J. S. (2016). Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability. Human Resource Management, 55 (2), 321–340. https://doi.org/10.1002/hrm.21678

Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., & Kurvers, H. (2017). ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch language. Proceedings of the international conference on artificial intelligence in education (pp. 52–63).

Dasgupta, T., Naskar, A., Dey, L., & Saha, R. (2018). Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. Proceedings of the workshop on natural language processing techniques for educational applications (pp. 93–102).

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the annual conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186).

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2 (3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

Eckes, T. (2023). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments . Peter Lang Pub. Inc.

Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1 (1), 19–33.

PubMed Google Scholar

Farag, Y., Yannakoudakis, H., & Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 263–271).

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., & Rubin, D. (2013). Bayesian data analysis (3rd ed.). Taylor & Francis.

Book Google Scholar

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7 (4), 457–472. https://doi.org/10.1214/ss/1177011136

Huang, J., Qu, L., Jia, R., & Zhao, B. (2019). O2U-Net: A simple noisy label detection approach for deep neural networks. Proceedings of the IEEE international conference on computer vision .

Hussein, M. A., Hassan, H. A., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5 , e208. https://doi.org/10.7717/peerj-cs.208

Article PubMed PubMed Central Google Scholar

Ilhan, M. (2016). A comparison of the results of many-facet Rasch analyses based on crossed and judge pair designs. Educational Sciences: Theory and Practice, 16 (2), 579–601. https://doi.org/10.12738/estp.2016.2.0390

Jin, C., He, B., Hui, K., & Sun, L. (2018). TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. Proceedings of the annual meeting of the association for computational linguistics (pp. 1088–1097).

Jin, K. Y., & Wang, W. C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55 (4), 543–563. https://doi.org/10.1111/jedm.12191

Kassim, N. L. A. (2011). Judging behaviour and rater errors: An application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11 (3), 179–197.

Google Scholar

Ke, Z., & Ng, V. (2019). Automated essay scoring: A survey of the state of the art. Proceedings of the international joint conference on artificial intelligence (pp. 6300–6308).

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking . New York: Springer.

Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48 (4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x

Li, S., Ge, S., Hua, Y., Zhang, C., Wen, H., Liu, T., & Wang, W. (2020). Coupled-view deep classifier learning from multiple noisy annotators. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 4667–4674).

Linacre, J. M. (1989). Many-faceted Rasch measurement . MESA Press.

Linacre, J. M. (2014). A user’s guide to FACETS Rasch-model computer programs .

Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014 (1), 1–23. https://doi.org/10.1002/ets2.12009

Liu, T., Ding, W., Wang, Z., Tang, J., Huang, G. Y., & Liu, Z. (2019). Automatic short answer grading via multiway attention networks. Proceedings of the international conference on artificial intelligence in education (pp. 169–173).

Lord, F. (1980). Applications of item response theory to practical testing problems . Routledge.

Lun, J., Zhu, J., Tang, Y., & Yang, M. (2020). Multiple data augmentation strategies for improving performance on automatic short answer scoring. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 13389–13396).

Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14 (2), 139–160.

Mesgar, M., & Strube, M. (2018). A neural local coherence model for text quality assessment. Proceedings of the conference on empirical methods in natural language processing (pp. 4328–4339).

Mim, F. S., Inoue, N., Reisert, P., Ouchi, H., & Inui, K. (2019). Unsupervised learning of discourse-aware text representation for essay scoring. Proceedings of the annual meeting of the association for computational linguistics: Student research workshop (pp. 378–385).

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4 (4), 386–422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5 (2), 189–227.

Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46 (4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x

Nadeem, F., Nguyen, H., Liu, Y., & Ostendorf, M. (2019). Automated essay scoring with discourse-aware neural models. Proceedings of the workshop on innovative use of NLP for building educational applications (pp. 484–493).

Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models . Evanston, IL, USA: Routledge.

Nguyen, H. V., & Litman, D. J. (2018). Argument mining for improving the automated scoring of persuasive essays. Proceedings of the association for the advancement of artificial intelligence (Vol. 32).

Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests [Doctoral dissertation, The Florida State University].

Patz, R. J., & Junker, B. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24 (4), 342–366. https://doi.org/10.3102/10769986024004342

Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 (4), 341–384. https://doi.org/10.3102/10769986027004341

Qiu, X. L., Chiu, M. M., Wang, W. C., & Chen, P. H. (2022). A new item response theory model for rater centrality using a hierarchical rater model approach. Behavior Research Methods, 54 , 1854–1868. https://doi.org/10.3758/s13428-021-01699-y

Article PubMed Google Scholar

Ridley, R., He, L., Dai, X. Y., Huang, S., & Chen, J. (2021). Automated cross-prompt scoring of essay traits. Proceedings of the association for the advancement of artificial intelligence (vol. 35, pp. 13745–13753).

Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and automated essay scoring. https://doi.org/10.48550/arXiv.1909.09482 . arXiv:1909.09482

Rosen, Y., & Tager, M. (2014). Making student thinking visible through a concept map in computer-based assessment of critical thinking. Journal of Educational Computing Research, 50 (2), 249–270. https://doi.org/10.2190/EC.50.2.f

Schendel, R., & Tolmie, A. (2017). Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Assessment & Evaluation in Higher Education, 42 (5), 673–689. https://doi.org/10.1080/02602938.2016.1177484

Shermis, M. D., & Burstein, J. C. (2002). Automated essay scoring: A cross-disciplinary perspective . Routledge.

Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for Multiple-Ratings data. Multivariate Behavioral Research, 54 (3), 360–381. https://doi.org/10.1080/00273171.2018.1530091

Stan Development Team. (2018). RStan: the R interface to stan . R package version 2.17.3.

Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. Proceedings of the international conference on artificial intelligence in education (pp. 469–481).

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. Proceedings of the conference on empirical methods in natural language processing (pp. 1882–1891).

Tran, T. D. (2020). Bayesian analysis of multivariate longitudinal data using latent structures with applications to medical data. (Doctoral dissertation, KU Leuven).

Uto, M. (2021a). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53 , 1440–1454. https://doi.org/10.3758/s13428-020-01498-x

Uto, M. (2021b). A review of deep-neural automated essay scoring models. Behaviormetrika, 48 , 459–484. https://doi.org/10.1007/s41237-021-00142-y

Uto, M. (2023). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, 55 , 3910–3928. https://doi.org/10.3758/s13428-022-01997-z

Uto, M., & Okano, M. (2021). Learning automated essay scoring models using item-response-theory-based scores to decrease effects of rater biases. IEEE Transactions on Learning Technologies, 14 (6), 763–776. https://doi.org/10.1109/TLT.2022.3145352

Uto, M., & Ueno, M. (2018). Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier , 4 (5), , https://doi.org/10.1016/j.heliyon.2018.e00622

Uto, M., & Ueno, M. (2020). A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo. Behaviormetrika, Springer, 47 , 469–496. https://doi.org/10.1007/s41237-020-00115-7

van der Linden, W. J. (2016). Handbook of item response theory, volume two: Statistical tools . Boca Raton, FL, USA: CRC Press.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).

Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018). Automatic essay scoring incorporating rating schema via reinforcement learning. Proceedings of the conference on empirical methods in natural language processing (pp. 791–797).

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11 , 3571–3594. https://doi.org/10.48550/arXiv.1004.2316

Watanabe, S. (2013). A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14 (1), 867–897. https://doi.org/10.48550/arXiv.1208.6338

Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26 (3), 283–306. https://doi.org/10.3102/10769986026003283

Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79 (5), 962–987. https://doi.org/10.1177/0013164419834613

Wind, S. A., & Jones, E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56 (1), 76–100. https://doi.org/10.1111/jedm.12201

Wind, S. A., Wolfe, E. W., Jr., G.E., Foltz, P., & Rosenstein, M. (2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18 (1), 27–49. https://doi.org/10.1080/15305058.2017.1361426

Zitzmann, S., & Hecht, M. (2019). Going beyond convergence in Bayesian estimation: Why precision matters too and how to assess it. Structural Equation Modeling: A Multidisciplinary Journal, 26 (4), 646–661. https://doi.org/10.1080/10705511.2018.1545232

Download references

This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 19H05663, 21H00898, and 23K17585.

Author information

Authors and affiliations.

The University of Electro-Communications, Tokyo, Japan

Masaki Uto & Kota Aramaki

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflicts of interest.

Ethics approval

Not applicable

Consent to participate

Consent for publication.

All authors agreed to publish the article.

Open Practices Statement

All results presented from our experiments for all models, including MFRM, MFRM with RSS, and GMFM, as well as the results for each repetition, are available for download at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This repository also includes programs for performing our linking method, along with a sample dataset. These programs were developed using R and Python, along with RStan and PyTorch. Please refer to the README file for information on program usage and data format details.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Data splitting procedures

In this appendix, we explain the detailed procedures used to construct the reference group and the focal group while aiming to ensure distinct distributions of examinee abilities and rater severities, as outlined in experimental Procedure 2 in the Experimental procedures section.

Let $\mu ^{\text {all}}_\theta $ and $\sigma ^{\text {all}}_\theta $ be the mean and SD of the examinees’ abilities estimated from the entire dataset in Procedure 1 of the Experimental procedures section. Similarly, let $\mu ^{\text {all}}_\beta $ and $\sigma ^{\text {all}}_\beta $ be the mean and SD of the rater severity parameter estimated from the entire dataset. Using these values, we set target mean and SD values of abilities and severities for both the reference and focal groups. Specifically, let $\acute{\mu }^{\text {ref}}_{\theta }$ and $\acute{\sigma }^{\text {ref}}_{\theta }$ denote the target mean and SD for the abilities of examinees in the reference group, and $\acute{\mu }^{\text {ref}}_{\beta }$ and $\acute{\sigma }^{\text {ref}}_{\beta }$ be those for the rater severities in the reference group. Similarly, let $\acute{\mu }^{\text {foc}}_{\theta }$ , $\acute{\sigma }^{\text {foc}}_{\theta }$ , $\acute{\mu }^{\text {foc}}_{\beta }$ , and $\acute{\sigma }^{\text {foc}}_{\beta }$ represent the target mean and SD for the examinee abilities and rater severities in the focal group. Each of the eight conditions in Table 1 uses these target values, as summarized in Table 14 .

Given these target means and SDs, we constructed the reference and focal groups for each condition through the following procedure.

Prepare the entire set of examinees and raters along with their ability and severity estimates. Specifically, let $\hat{\varvec{\theta }}$ and $\hat{\varvec{\beta }}$ be the collections of ability and severity estimates, respectively.

Randomly sample a value from the normal distribution $N(\acute{\mu }^{\text {ref}}_\theta , \acute{\sigma }^{\text {ref}}_\theta )$ , and choose an examinee with $\hat{\theta }_j \in \hat{\varvec{\theta }}$ nearest to the sampled value. Add the examinee to the reference group, and remove it from the remaining pool of examinee candidates $\hat{\varvec{\theta }}$ .

Similarly, randomly sample a value from $N(\acute{\mu }^{\text {ref}}_\beta ,\acute{\sigma }^{\text {ref}}_\beta )$ , and choose a rater with $\hat{\beta }_j \in \hat{\varvec{\beta }}$ nearest to the sampled value. Then, add the rater to the reference group, and remove it from the remaining pool of rater candidates $\hat{\varvec{\beta }}$ .

Repeat Steps 2 and 3 for the focal group, using $N(\acute{\mu }^{\text {foc}}_\theta , $ $\acute{\sigma }^{\text {foc}}_\theta )$ and $N(\acute{\mu }^{\text {foc}}_\beta ,\acute{\sigma }^{\text {foc}}_\beta )$ as the sampling distributions.

Continue to repeat Steps 2, 3, and 4 until the pools $\hat{\varvec{\theta }}$ and $\hat{\varvec{\beta }}$ are empty.

Given the examinees and raters in each group, create the data for the reference group $\textbf{U}^{\text {ref}}$ and the focal group $\textbf{U}^{\text {foc}}$ .

Remove examinees from each group, as well as their data, if they have received scores from only one rater, thereby ensuring that each examinee is graded by at least two raters.

Appendix B: Experimental results for MFRM and MFRM with RSS

The experiments discussed in the main text focus on the results obtained from GMFM, as this model demonstrated the best fit to the dataset. However, it is important to note that our linking method is not restricted to GMFM and can also be applied to other models, including MFRM and MFRM with RSS. Experiments involving these models were carried out in the manner described in the Experimental procedures section, and the results are shown in Tables 15 and 16 . These tables reveal trends similar to those observed for GMFM, validating the effectiveness of our linking method under the MFRM and MFRM with RSS as well.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Uto, M., Aramaki, K. Linking essay-writing tests using many-facet models and neural automated essay scoring. Behav Res (2024). https://doi.org/10.3758/s13428-024-02485-2

Download citation

Accepted : 26 July 2024

Published : 20 August 2024

DOI : https://doi.org/10.3758/s13428-024-02485-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Writing assessment
Many-facet Rasch models
IRT linking
Automated essay scoring
Educational measurement
Find a journal
Publish with us
Track your research

Open access
Published: 25 August 2024

Optimizing a national examination for medical undergraduates via modern automated test assembly approaches

Lingling Xu 1 ,
Zhehan Jiang 1 ,
Fen Cai 1 ,
Jinying Ouyang 1 ,
Hanyu Liu 1 &
Ting Cai 1

BMC Medical Education volume 24 , Article number: 919 ( 2024 ) Cite this article

Metrics details

Automated test assembly (ATA) represents a modern methodology that employs data science optimization on computer platforms to automatically create test form, thereby significantly improving the efficiency and accuracy of test assembly procedures. In the realm of medical education, large-scale high-stakes assessments often necessitate lengthy tests, leading to elevated costs in various dimensions (such as examinee fatigue and expenses associated with item development). This study aims to augment the design of the medical education assessments by leveraging modern ATA approaches.

To achieve the objective, a four-step process employing psychometric methodologies was used to calibrate and analyze the item pool of the Standardized Competence Test for Clinical Medicine Undergraduates (SCTCMU), a nationwide summative test comprising 300 multiple-choice questions (MCQ) in China. Subsequently, two modern ATA approaches were employed to determine the optimal item combination, accounting for both statistical and content requirements specified in the test blueprint. The qualities of the assembled test form, generated using modern ATA approaches, underwent meticulous evaluation.

Through an exploration of the psychometric properties of the SCTCMU as a foundational step, the evaluation revealed commendable quality in the item properties. Furthermore, the evaluation of the quality of assembled test form using modern ATA approaches indicated the ability to ascertain the optimal test length within the predefined measurement precision. Specifically, this investigation demonstrates that the application of modern ATA approaches can substantially reduce the test length of assembled test form, while simultaneously maintaining the required statistical and content standards specified in the test blueprint.

Conclusions

This study harnessed modern ATA approaches to facilitate the automatic construction of test form, thereby significantly enhancing the efficiency and precision of test assembly procedures. The utilization of modern ATA approaches offers medical educators a valuable tool to enhance the efficiency and cost-effectiveness of medical education assessment.

Peer Review reports

Introduction

Assessments serve a wide array of vital functions: they exert a significant influence on learning, offer feedback on the efficacy of educational and training programs, and consequently contribute to patient safety [ 1 ]. Prior to the mid-twentieth century, medical education, licensure, and certification assessments primarily consisted of essays or oral evaluations [ 1 , 2 ]. During that era, assessment-derived evaluations often proved to be subjective, arbitrary, and lacked reproducibility [ 1 ]. Subsequently, Multiple-choice question (MCQ) examinations emerged in large-scale high-stakes assessments [ 3 ]. Meanwhile, several alternative assessment methods, such as performance assessments, were also devised owing to the growing complexity of medical education assessments [ 4 ]. However, large-scale high-stakes assessments using primarily MCQs are prevalent in medical education and are designed to objectively measure student performance in complex areas of medical knowledge. This approach is heavily utilizated in medical licensure and certification examinations, such as the United States Medical Licensing Examination and the National Board of Medical Examiners subject examination.

To ensure that large-scale high-stakes assessments effectively demonstrate a student’s grasp of clinical knowledge, tests should exhibit objectivity, reproducibility (reliability), and validity for the intended purpose. Furthermore, they should garner acceptance from both examinees and examiners, integrate a learning-enhancing component, and maintain cost-efficiency [ 5 ]. From an educational measurement theory standpoint, an excessive number of items in an examination not only escalates item development costs but also results in unnecessary item exposure within the item bank [ 6 , 7 ]. Additionally, this practice may increase respondent burden, diminish response quality and willingness to complete the examination [ 8 ]. Such an approach not only elevates the risk of test fatigue and performance impairment but also amplifies the time and effort invested by test takers, thereby further increasing examination costs. Compared with other disciplines, existing medical education qualifications and certification examinations reveal that large-scale high-stakes assessments often include a substantial number of test items [ 9 , 10 ]. Therefore, it is crucial to explore novel methodologies for optimizing test items in the field, ensuring precise measurement of target skills/attributes and reducing expenses associated with test development and administration, without imposing excessive burdens on examinees.

As an indispensable component of educational assessment, measurement theories offer frameworks and approaches to the entire process of a test; the mainstream ones include classic test theory (CTT) [ 11 , 12 , 13 ], generalizability theory (G-theory) [ 14 , 15 , 16 ], and item response theory (IRT) [ 17 , 18 ]. The extensive utilization of IRT in the past decades underscores its significance in the development and analysis of large-scale high-stakes assessments in medical education. The statistical models of IRT have been used to analyze item quality, cut-off score reliability, dimensionality, and examinees’ scores of many large-scale high-stakes assessments [ 10 , 19 , 20 , 21 ].

In tandem with the IRT’s advancements, the arduous process of manually selecting items for test form generation has been revolutionized with the introduction of automated test assembly (ATA). The conventional test assembly method heavily relies on the experiential insights of test developers, a process that varies among developers and is likely to result in unreliable and suboptimal test assembly decisions. In contrast, the modern ATA approach employs computer algorithms to automate the construction of test form, greatly enhancing efficiency and accuracy [ 22 , 23 ]. Modern ATA approaches aim to align the difficulty of test items with examinees’ ability levels under the IRT framework [ 24 ]. This minimizes the disparity in measurement precision between assembled test form and the desired test, while satisfying the specifications outlined in the test blueprint. Realizing these benefits, the research committee of the Standardized Competence Test for Clinical Medicine Undergraduates (SCTCMU) in China would investigate the use of modern ATA approaches to enhance the design.

Developed and administered jointly by the National Medical Examination Center, and the National Center for Health Professions Education Development in China, the SCTCMU functions as a nationwide assessment for evaluating students’ learning outcomes in basic and clinical medical sciences at the end of the fourth year of their five-year undergraduate studies in clinical medicine [ 10 ]. Examinees whose scores meet the pass score will be admitted to the clinical placement stage to further improve their clinical skills. It consists of two components: clinical skills assessed through objective structured clinical examinations (OSCEs) and medical knowledge assessed through computerized MCQs, which are the target of the present paper. The medical knowledge part comprises 300 dichotomously scored MCQs, each of which contains four distractors and a key answer.

The sheer magnitude of China’s population, the largest in the world, contributes to elevated costs associated with test administration. For example, more than 25,000 examinees registered for the Spring 2022 SCTCMU administration. Moreover, the application of modern ATA approaches for optimizing test items in the medical education field can amplify its utility, alleviating testing burdens for a greater absolute number of examinees. Fueled by these pragmatic necessities, modern ATA approaches were utilized to enhance the design of the SCTCMU by estimating the optimal test length, all the while adhering to the requisites of the test blueprint and the prescribed measurement precision.

This paper aims to apply modern ATA approaches to the SCTCMU to reduce the test length of assembled test form, while simultaneously upholding the requisite statistical and content standards that are outlined in the test blueprint. To achieve the objective, four fundamental issues were addressed in order to satisfy the cycle of a typical ATA research agenda: (a) investigating psychometric properties and deriving calibrated item parameters from an authentic SCTCMU item pool, (b) constructing test form using modern ATA approaches while adhering to a predetermined level of measurement precision, (c) evaluating the quality of the assembled test form in order to determine the optimal test length within the designated content and statistical requirements, and (d) providing practical insights and guidance for the design and quality assurance of medical education test form generated using ATA approaches.

Overall workflow

Figure 1 visually outlines the three pivotal components involved in this optimization process: item pool preparation, test assembly, and quality evaluation of the assembled test form. The item pool for this study comprises items extracted from the Spring 2022 SCTCMU. During the item pool preparation phase, a four-step process employing psychometric methodologies was employed to evaluate the item parameters of the Spring 2022 SCTCMU. Subsequently, various modern ATA approaches were used to identify the most suitable item combination, ensuring conformity with predefined constraints like alignment with the test blueprint, while achieving the desired measurement precision for pass-fail determinations. Ultimately, the attributes of the assembled test form, achieved through modern ATA approaches, were evaluated based on several criteria encompassing test length, reliability, validity, measurement precision, and adherence to the non-statistical aspects of the blueprint. These evaluation criteria will be elaborated upon in the subsequent sections.

The overall framework for utilizing modern ATA approaches to optimize medical education assessments

Item pool preparation

To conduct the psychometric evaluation of SCTCMU in item pool preparation, a 4-step flow covering essential tasks was adopted, as seen in Fig. 2 . Without statistical and/or mathematical context, one can regard a step as a prerequisite for its successor, and rules of thumb for all necessary statistics (i.e., thresholds) are also provided in the figure. Step 1 is self-evident: obtaining the target dataset is a foundation for further analysis. In the second step, the unidimensionality assumption should be satisfied so that IRT models can be constructed. Like all statistical modeling procedures, Step 3 drives researchers to check the appropriateness of the model, including which candidate model performs the best based on model-fit statistics, as well as evaluating the quantitative values for the model fit. After confirming the use of IRT, the final step is extracting item parameter information yielded by the selected model. Details about each point listed in the steps are described below.

Step 1: data collection

The items were obtained from the raw responses of 300 MCQs from 25,154 examinees in the Spring 2022 SCTCMU administration and were used as the source data. It is based on the content specified in the National Chinese Medical Qualification Examination Standards to determine the scope of content, and the summary of content and corresponding proportions in each subject regarding medical knowledge are shown in Tab. S1 of Supplementary Material 1 .

Step 2: test model hypothesis

Unidimensionality is a critical assumption for conducting IRT analysis [ 25 ]. Factor-analytic methods are frequently used to assess whether a one-dimensional construct underlies the examination. Unidimensionality is confirmed when the ratio of the first eigenvalue to the second eigenvalue is equal to or greater than 4, and when the first factor accounts for more than 20% of the total variance [ 26 ].

Step 3: model-data fit and model selection

In the framework of IRT, an appropriate model is required for parameter estimation. In this study, three commonly used dichotomous IRT models were considered: the two-parameter logistic model (2PLM) [ 27 ], the three-parameter logistic model (3PLM) [ 28 ], and the Rasch model [ 29 ]. The optimal IRT model was selected for further analysis based on the following test-level model fitting indices: -2 log-likelihood (-2LL) [ 30 ], Akaike information criterion (AIC) [ 31 ], and Bayesian information criterion (BIC) [ 32 ], and the G 2 statistic [ 33 ]. Lower values of these indices indicate a better model fit. Additionally, the goodness-of-fit test ( G 2 statistic) was not statistically significant, suggesting that the selected model provided a good fit to the data.

Step 4: item parameters analysis

Item parameters, such as item discrimination and item difficulty, are essential in constructing test form. These parameters can be evaluated using specific cutoff values. Items with discrimination values below 0.8 should be excluded [ 25 , 34 ]. Baker [ 35 ] further categorizes the discriminative ability of an item as none (0), very low (0.01 to 0.34), low (0.35 to 0.64), moderate (0.65 to 1.34), high (1.35 to 1.69), very high (above 1.70), and perfect (+ infinite) based on its discrimination value. Furthermore, according to Steinberg and Thissen [ 36 ], the item difficulty parameter should fall within the range of -3 to 3.

The steps for using psychometric methods in item pool preparation

Test assembly using modern ATA approaches

General framework of test assembly.

The objective of test assembly is to minimize the difference in measurement precision between the assembled test form and target test while meeting the test blueprint requirements, termed non-statistical constraints. When using modern ATA approaches for test assembly, one needs to generate target test information functions (TIF) (“targets”), which determine the ability estimation precision (i.e., score precision), and the “constraints,” which indicate the qualitative and nonstatistical test specifications in some way. In this study, drawing on the previous studies [ 37 , 38 , 39 ], we set the measurement precision values through TIF at the cut scores (i.e., the ability value corresponding to an expected test score of 180 on the test characteristic curve) of 25 during the test assembly process. Additionally, the SCTCMU content specialists provide test blueprints that outline the required test sections, test majors, and content representation in terms of topics.

ATA algorithms

The modern ATA algorithms can be classified into heuristic algorithms (e.g., normalized weighted absolute deviation heuristic, maximum priority index (MPI), and weighted deviations model), mixed-integer programming (MIP) algorithms, and machine learning algorithms (e.g., Bayesian optimization algorithm, deep learning algorithm, and simulated annealing algorithm). The reasons why MPI and MIP algorithms were selected for this study are threefold. First, extensive literature has demonstrated the widespread application of MIP and MPI algorithms in multiple fields of educational and psychological research [ 40 , 41 , 42 ]. Second, unlike the ones such as pure machine learning approaches, MPI and MIP algorithms are capable of effectively handling complex IRT models and accommodating both statistical and non-statistical constraints, allowing the item combination procedures in accordance with IRT principles. Third, numerous studies have successfully employed these algorithms in practical settings, such as the large-scale Chinese Proficiency Tests [ 43 ] and the Comprehensive Osteopathic Medical Licensing Examination of the United States [ 44 ]. The evidence for the maturity, suitability, and generalizability of these algorithms establishes a solid foundation for optimizing the design of medical education assessments. A comprehensive review of how these two approaches are utilized in test assembly is provided below.

MPI algorithm

The MPI algorithm is used to control both the statistical and non-statistical constraints [ 45 ]. The test assembly processing flow of MPI algorithm is shown in Fig. 3 . Suppose that we need to select the item from the item pool that most closely matches the target TIF value at the cut score while satisfying K constraints, then the MPI index for the candidate item i in the item pool is given by

where ${I}_{i}$ represents the Fisher information of item i evaluated at the ability value of the cut score, ${f}_{k}$ measures the scaled “quate left” of constraint k ; and ${q}_{ik}$ indicated whether item i is related to constraint k ( ${q}_{ik}=1$ if constraint k is relevant to item i , otherwise ${q}_{ik}=0$ ). Each constraint k is associated with a weight ${w}_{k}$ . Generally, more important constraints are assigned larger weights.

Suppose constraint k is a content constraint that requires the test to have ${B}_{k}$ items from a certain content area, and so far ${b}_{k}$ items have been selected. The resulting scaled ‘quota left’ is calculated as ${f}_{k}=\frac{{B}_{k-}{b}_{k}}{{B}_{k}}$ . Using Eq. ( 1 ), we can compute the priority index (PI) for every available item in the pool, and the item with the largest PI, instead of the largest Fisher information, will be chosen for the test assembly. The process continues until the selected items satisfy the target measurement accuracy at the cut score.

MIP algorithm

Compared to the MPI algorithm, the MIP algorithm can strictly satisfy many non-statistical constraints (e.g., content specifications and item exposure rate) when building the test form [ 46 ]. The test assembly processing flow of MIP algorithm also is shown in Fig. 3 . A MIP model consists of an objective function and multiple constraints, both of which are defined in the form of linear formulation. The goal of the MIP algorithm is to select available items from the item pool so that the difference between the test information based on the selected items and the target information at the anchored ability point (i.e., cut score), ${\theta}_{q}$ , is minimized, while also meeting the non-statistical constraints. Specifically, items are selected for test assembly based on the following equation,

Subject to: nonstatistical constraints and ${x}_{i}\in\:\left\{\text{0,1}\right\},k=1,\dots,N$ .

Two approaches utilizing in the test assembly

Evaluation criteria

Drawing on previous studies relating to modern ATA approaches [ 43 , 46 , 47 ], the evaluation of modern ATA approaches for optimizing medical education assessments in this study was conducted using the following measures. Firstly, the test length of the assembled test form was recorded. Secondly, the Cronbach’s alpha was used to evaluate the reliability, and the criterion-related validity and content validity were computed. Thirdly, the value of TIF at the cut score was also recorded to assess the measurement accuracy. Fourthly, the non-statistical constraint violation rate ( $\bar{V}$ ) was used to evaluate the non-statistical constraints from the test blueprint in the modern ATA approaches for optimizing medical education assessments [ 48 ]. A smaller $\bar{V}$ indicates that there were fewer violated constraints from the test blueprint in the constructed test form.

Item pool analysis

In this study, the ratio between the first eigenvalue and the second eigenvalue was 6.562 (i.e., greater than 4), and the first factor explained 23.203% of the total variance (i.e., higher than 20%). Therefore, the items in the current item pool were considered unidimensional and suitable for the next phase of analysis. In order to find a suitable IRT model to fit the dataset, three commonly used dichotomous IRT models were applied in R software, including the Rasch model, the 2PLM, and the 3PLM. Obviously, as seen in Table 1 , the 3PLM had the best fit among these models; therefore, this model was chosen for the subsequent analysis. Furthermore, the item parameter estimates of the 3PLM indicated that most items had moderate to high discrimination power. The difficulty parameters showed an approximately normal distribution and were sufficiently broad to cover a wide range of student abilities. The chances of guesses were negligible.

Evaluation the quality of assembled test form using modern ATA approaches

Test length, reliability, and validity.

As shown in Table 2 , the test length of the assembled test, generated using the MPI algorithm, was lower than that of the test assembled using the MIP algorithm. The Cronbach’s alpha coefficient for the test assembled using the MPI algorithm was 0.946, while that for the MIP algorithm was 0.937. Moreover, the assembled test generated using the MPI and MIP algorithms exhibited strong criterion-related validity, as the total score of the assembled test form had a significant correlation of 0.953 and 0.971 ( p < .001) with the total SCTCMU score. Our analysis confirmed that the assembled test had good content validity, as detailed in the following points: (1) the content coverage of the assembled test remains consistent with that of the item bank, both containing 4 content areas that met the requirements of the test blueprints (see Table 2 ). (2) By comparing the content distribution in the item bank over the past three years, we identified a consistent pattern (see Tab. S2 in Supplementary Material 1 ). This indicates that the content distribution of the item bank used in this study is reasonable. Our study also found that the content distribution of the assembled test exhibited similarities to that of the item bank (see Table 2 ). These findings collectively suggest that the assembled test using modern ATA approaches demonstrates robust reliability and validity.

Measurement accuracy and non-statistical constraints

As shown in Table 3 , the actual TIF value at the cut score of the assembled test form closely approached the maximum TIF value, indicating a high level of measurement precision for the assembled test form near the cut score. The test form generated using the MIP algorithm fully meets the predefined requirements for non-statistical constraints from test blueprint. Additionally, the fulfillment of non-statistical constraints in the assembled test form, generated using the MIP algorithm, is superior to those generated using the MPI algorithm.

The present study focuses on optimizing the design of medical education assessments through the application of modern ATA approaches. A four-step process, employing psychometric methodologies, was utilized to calibrate and analyze the item pool of the SCTCMU. The evaluation demonstrated the commendable quality of item properties. Subsequently, two modern ATA approaches (i.e., MPI and MIP algorithms) were employed to determine the optimal item combination, accounting for both statistical and content requirements specified in the test blueprint. The qualities of the assembled test form, generated using modern ATA approaches, underwent meticulous evaluation. Results from this investigation indicated the ability to ascertain the optimal test length within the predefined measurement precision. Overall, the utilization of these modern ATA approaches significantly reduced the test length among the assembled test form while satisfying both the statistical and content prerequisites outlined in the test blueprint.

Another significant finding of this study is that the test length of the assembled test form, generated using the MPI algorithm, was comparatively lower than those assembled using the MIP algorithm. One potential explanation for this disparity could be due to the weight assigned to non-statistical constraints within the MPI algorithm. Future research could explore the impact of adjusting the weight of non-statistical constraints within the MPI algorithm to ascertain the optimal number of test items required for assembly. Additionally, the measurement accuracy outcomes from the MPI algorithm are generally similar to those from the MIP algorithm. While the MIP algorithm guarantees the satisfaction of all non-statistical constraints, the MPI algorithm may lead to minor nonstatistical constraint violations. The flawless non-statistical constraint adherence of the MIP algorithm is due to the global optimization objective within MIP models and the substantial number of items in each non-statistical constraint category within the item banks. Nevertheless, the MIP algorithm may encounter infeasibility problems if the item bank has a limited number of items in each category. These results offer valuable insights into selecting a specific modern ATA approach for optimizing the design of medical education assessments. While MIP algorithms are recommended when non-statistical constraint satisfaction is prioritized, caution must be exercised regarding potential infeasibility concerns. On the other hand, MPI algorithms are suitable for those seeking a more balanced and cost-efficient test assembly, given their heuristic nature and ease of implementation.

Leveraging modern psychometric methods rooted in IRT has proven beneficial for the calibration and analysis of item pools in medical education assessments. IRT, by providing standard error of measurement for various ability levels, facilitates the construction of large-scale high-stakes assessments to optimize measurement precision at the pass-fail threshold. Furthermore, metric calibration enables the establishment of item banks, ensuring continuity and equity in examination standards. However, certain considerations merit attention in practice. Different IRT models exist for varying data types and testing scenarios. For instance, the Rasch model, 2PLM, and 3PLM described herein are applicable for dichotomous scoring items, while the multifaceted Rasch model addresses cases where the same examinee is scored by multiple judges, as in an OSCE. The range of item parameters used for evaluating item pool quality, such as discrimination and difficulty parameters, serves an informational purpose and should be interpreted within the context of the assessment’s objectives. For instance, certain items assessing critical content mastery may be retained, even if their discrimination parameter is relatively modest, in the context of criterion-referenced assessments.

In the field of medical education, to create the items needed for a large-scale high-stakes assessment, extensive MCQ development is required. Item development is an expensive process because the cost of developing a single item often ranges from US$1500 to US$2000 [ 6 ]. Given this estimate, it is easy to see how the costs of item development quickly escalate. Moreover, administering a multitude of items leads to greater respondent burden, which can reduce the quality of an examinee’s responses and/or willingness to take the test altogether [ 8 ]. To address these issues, this study harnessed IRT-based ATA approaches to facilitate the automatic construction of test form, thereby significantly enhancing the efficiency and precision of test assembly procedures. The utilization of modern ATA approaches offers medical educators a valuable tool and indicates vital implications for medical education assessment. Firstly, for test administration, our approach can effectively reduce the number of required test items, resulting in significant cost savings associated with test development. Secondly, for examinees, the shorter duration effectively prevents them from making mistakes due to fatigue, and it also means a lower error tolerance rate and a higher demand on their overall ability. Thirdly, for educational policy and practice in other regions or disciplines, these changes could inspire policymakers in other districts to adapt their assessment systems for better student experiences and outcomes. Additionally, it may motivate educators to reassess testing methods, favoring more effective, targeted assessments aligned with learning outcomes.

Despite the promising results, the present investigation could be further enhanced in several ways. First, the item bank was consistent with the assumed yet verified structure of unidimensionality under a series of criteria in accordance with the requirements of the proposed approach. However, if multidimensional feedback is demanded, the study should consider using multidimensional item response theory (MIRT), such as bifactor IRT [ 49 ], to accommodate the need for subdomain assessment. To reiterate, the dimensionality is both theoretical and empirical, meaning that even if subdomains are assumed to exist, they should be evidenced by data and modeling results. That said, the general flow of fitting MIRT models should be conducted and validated prior to the multidimensional ATA. Second, the utilization of modern ATA approaches involves complex computing processes that require a certain amount of computing power. Each assembling iteration is costly, while elements such as the sample size, the number of items, and the complexity of modeling and constraints dramatically increase the computational costs. In the present study, parallel computation from a high-end workstation was utilized to reduce the computing time. However, it still consumed substantial time to achieve the results. Future studies should explore more strategies for boosting the speed to deliver timely results for the test/exam organizer. Third, blueprints are not always static; test/exam organizers may revise and update their blueprints from time to time, meaning the constraints should be adjusted. While this study evaluated the adaptability of MIP and MPI algorithms to such constraints, further research could explore the effects of the number of constraints in test blueprints on test assembly. Lastly, the number of items highly correlate with the session time, which should be reduced to align with the test length. Future studies can consider match the time information from the log files of the examination administration server, and incorporate the information to determine the session time.

In the present study, we investigated the utilization of modern ATA approaches for optimizing the design of medical education assessments. This investigation demonstrates that the application of IRT-based ATA approaches can substantially reduce the test length of assembled test form, while simultaneously adhering to the requisite statistical and content standards outlined in the test blueprint. Our findings suggest that the utilization of modern ATA approaches offers medical educators a valuable tool to enhance the efficiency and cost-effectiveness of medical education assessments, such as reducing respondent burden and saving costs associated with test development.

Data availability

The datasets analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request. Requests to access these datasets should be directed to ZJ, [email protected].

Abbreviations

Automated Test Assembly

Standardized Competence Test for Clinical Medicine Undergraduates

Multiple-Choice Question

Classic Test Theory

Generalizability Theory

Item Response Theory

Objective Structured Clinical Examination

Two-Parameter Logistic Model

Three-Parameter Logistic Model

-2 Log-Likelihood

Akaike’s Information Criteria

Bayesian Information Criterion

Test Information Functions

Maximum Priority Index

Mixed-Integer Programming

Root Mean Square Deviation

Non-Statistical Constraint Violation Rate

Multidimensional Item Response Theory

Norcini J, Anderson B, Bollela V, Burch V, Costa MJ, Duvivier R, Galbraith R, Hays R, Kent A, Perrott V, Roberts T. Criteria for good assessment: consensus statement and recommendations from the Ottawa 2010 conference. Med Teach. 2011;33(3):206–14. https://doi.org/10.3109/0142159X.2011.551559

Article Google Scholar

Newble D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ. 2004;38(2):199–203. https://doi.org/10.1111/j.1365-2923.2004.01755.x

Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE Guide 31. Med Teach. 2007;29(9):855–71. https://doi.org/10.1080/01421590701775453

Howley LD. Performance assessment in medical education: where we’ve been and where we’re going. Eval Health Prof. 2004;27(3):285–303. https://doi.org/10.1177/0163278704267044

Van Der Vleuten CP. The assessment of professional competence: developments, research and practical implications. Adv Health Sci Educ Theory Pract. 1996;1(1):41–67. https://doi.org/10.1007/BF00596229

Gierl MJ, Lai H, Turner SR. Using automatic item generation to create multiple-choice test items. Med Educ. 2012;46(8):757–65. https://doi.org/10.1111/j.1365-2923.2012.04289.x

Xing D, Hambleton RK. Impact of test design, item quality, and item bank size on the psychometric properties of computer-based credentialing examinations. Educ Psychol Meas. 2004;64(1):5–21. https://doi.org/10.1177/0013164403258393

Finkelman MD, Smits N, Kim W, Riley B. Curtailment and stochastic curtailment to shorten the CES-D. Appl Psychol Meas. 2012;36(8):632–58. https://doi.org/10.1177/0146621612451647

Guttormsen S, Beyeler C, Bonvin R, Feller S, Schirlo C, Schnabel K, Schurter T, Berendonk C. The new licencing examination for human medicine: from concept to implementation. Swiss Med Wkly. 2013;143:w13897. https://doi.org/10.4414/smw.2013.13897

Han Y, Jiang Z, Ouyang J, Xu L, Cai T. Psychometric evaluation of a national exam for clinical undergraduates. Front Med (Lausanne). 2022;9:1037897. https://doi.org/10.3389/fmed.2022.1037897

Lord FM, Novick MR. Statistical theories of mental test scores. Reading, MA: Addison-Wesley; 1968.

Google Scholar

Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement. 3rd ed. New York: American Council on Education and MacMillan; 1989. pp. 105–46.

Haertel EH. Reliability. In: Brennan RL, editor. Educational measurement. 4th ed. Westport, CT: American Council on Education/Praeger; 2006. pp. 65–110.

Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The dependability of behavioral measurements: theory of generalizability for scores and profiles. New York: Wiley; 1972.

Brennan RL. Elements of generalizability theory (rev. ed.). Iowa City. IA: ACT, Inc; 1992.

Brennan RL. Generalizability theory. New York: Springer-; 2001.

Book Google Scholar

Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum; 2000.

Lord FM. Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum; 1980.

Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37(9):830–7. https://doi.org/10.1046/j.1365-2923.2003.01594.x

Hissbach JC, Klusmann D, Hampe W. Dimensionality and predictive validity of the HAM-Nat, a test of natural sciences for medical school admission. BMC Med Educ. 2011;11:83. https://doi.org/10.1186/1472-6920-11-83

Lahner FM, Schauber S, Lörwald AC, Kropf R, Guttormsen S, Fischer MR, Huwendiek S. Measurement precision at the cut score in medical multiple choice exams: theory matters. Perspect Med Educ. 2020;9(4):220–8. https://doi.org/10.1007/s40037-020-00586-0

Swanson L, Stocking ML. A model and heuristic for solving very large item selection problems. Appl Psychol Meas. 1993;17(2):151–66.

van der Linden WJ. Linear models for optimal test design. New York: Springer; 2005.

Luo X. Automated Test Assembly with mixed-integer programming: the effects of modeling approaches and solvers. J Educ Meas. 2020;57(4):547–65. https://doi.org/10.1111/jedm.12262

Tan Q, Cai Y, Li Q, Zhang Y, Tu D. Development and validation of an Item Bank for Depression Screening in the Chinese Population using computer adaptive testing: a Simulation Study. Front Psychol. 2018;9:1225. https://doi.org/10.3389/fpsyg.2018.01225

Flens G, Smits N, Carlier I, van Hemert AM, de Beurs E. Simulating computer adaptive testing with the Mood and anxiety Symptom Questionnaire. Psychol Assess. 2016;28(8):953–62. https://doi.org/10.1037/pas0000240

Birnbaum A. On the estimation of mental ability. Ser Rep. 1958;15:7755–7723.

Birnbaum AL. Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores;1968.

Rasch G. Probabilistic models for some intelligence and attainment tests. The Danish Institute of Educational Research. Copenhagen: Chicago: The University of Chicago Press; 1960.

Spiegelhalter DJ, Best NG, Carlin BP, Van der Linde A. Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. Sci Rep. 1998;98:009.

Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19:716–23.

Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.

Lu Y. Accessing fit of item response theory models (Unpublished doctoral dissertation), Massachusetts Amherst University. 2006.

Cho S, Drasgow F, Cao M. An investigation of emotional intelligence measures using item response theory. Psychol Assess. 2015;27(4):1241–52. https://doi.org/10.1037/pas0000132

Baker FB. The basics of Item Response Theory. Clearinghouse on Assessment and evaluation. College Park, MD: University of Maryland; 2001.

Steinberg L, Thissen D. Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychol Methods. 1996;1(1):81.

Hambleton RK, Lam W. Redesign of MCAS tests based on a consideration of information functions (Revised Version); 2009.

Qi S, Zhou J, Zhang Q. Application of information function technique to analyzing the criterion-reference test. Sty Psychol Behav. 2003;(1):6.

Young JW, Morgan R, Rybinski P, Steinberg J, Wang Y. Assessing the test information function and differential item functioning for the TOEFL Junior ® Standard Test. ETS Res Rep Ser. 2013;2013(1):i–27.

Li J, van der Linden WJ. A comparison of constraint programming and mixed-integer programming for Automated Test‐Form Generation. J Educ Meas. 2018;55(4):435–56. https://doi.org/10.1111/jedm.12187

Al-Yakoob SM, Sherali HD. Mathematical models and algorithms for a high school timetabling problem. Comput Oper Res. 2015;61:56–68. https://doi.org/10.1016/j.cor.2015.02.011

Chang HH. Psychometrics behind computerized adaptive testing. Psychometrika. 2015;80(1):1–20. https://doi.org/10.1007/s11336-014-9401-5

Wang S, Zheng Y, Zheng C, Su YH, Li P. An Automated Test Assembly Design for a large-scale Chinese proficiency test. Appl Psychol Meas. 2016;40(3):233–7. https://doi.org/10.1177/0146621616628503

Shao C, Liu S, Yang H, Tsai TH. Automated test assembly using SAS operations research software in a medical licensing examination. Appl Psychol Meas. 2020;44(3):219–33. https://doi.org/10.1177/0146621619847169

Cheng Y, Chang HH. The maximum priority index method for severely constrained item selection in computerized adaptive testing. Br J Math Stat Psychol. 2009;62(Pt 2):369–83. https://doi.org/10.1348/000711008X304376

Luecht R, Brumfield T, Breithaupt K. A testlet assembly design for adaptive multistage tests. Appl Meas Educ. 2006;19(3):189–202. https://doi.org/10.1207/s15324818ame1903_2

Luecht RM. Computer-assisted test assembly using optimization heuristics. Appl Psychol Meas. 1998;22(3):224–36.

Xu L, Wang S, Cai Y, Tu D. The automated test assembly and routing rule for multistage adaptive testing with multidimensional item response theory. J Educ Meas. 2021;58(4):538–63.

Gibbons RD, Alegria M, Markle S, Fuentes L, Zhang L, Carmona R, Collazos F, Wang Y, Baca-García E. Development of a computerized adaptive substance use disorder scale for screening and measurement: the CAT-SUD. Addiction. 2020;115(7):1382–94. https://doi.org/10.1111/add.14938

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their suggestions, and are very grateful to all the individual participants who involved in this study.

This work was supported by National Natural Science Foundation of China for Young Scholars under Grant 72104006 and 72304019, Peking University Health Science Center under Grant BMU2021YJ010, National Medical Examination Center of China for the project Examination Standards and Content Designs of National Medical Licensing Examination, China Postdoctoral Science Foundation under Grant 2023M740082, and Peking University Health Science Center Medical Education Research Funding Project 2023YB24.

Author information

Authors and affiliations.

Peking University, Beijing, China

Lingling Xu, Zhehan Jiang, Fen Cai, Jinying Ouyang, Hanyu Liu & Ting Cai

You can also search for this author in PubMed Google Scholar

Contributions

ZJ and LX developed the study concept and drafted the manuscript. ZJ conducted the literature review and discussion. LX and FC performed the data analysis. FC and HL implemented the data interpretation. JO and TC were involved in drafting and revising the manuscript. All authors contributed to the article and approved the submitted version.

Corresponding authors

Correspondence to Zhehan Jiang or Fen Cai .

Ethics declarations

Ethics approval and consent to participate.

The studies involving human participants were reviewed and approved by Biomedical Ethics Committee of Peking University (IRB00001052-22070). All methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Xu, L., Jiang, Z., Cai, F. et al. Optimizing a national examination for medical undergraduates via modern automated test assembly approaches. BMC Med Educ 24 , 919 (2024). https://doi.org/10.1186/s12909-024-05905-1

Download citation

Received : 18 September 2023

Accepted : 14 August 2024

Published : 25 August 2024

DOI : https://doi.org/10.1186/s12909-024-05905-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Medical education assessment
Item response theory
Automated test assembly
Optimization
Modern psychometric method

BMC Medical Education

ISSN: 1472-6920

General enquiries: [email protected]

IMAGES

Automated Essay Scoring Explained
GitHub
AI4E®- Developing AI Literacy among Educators: AI in Automated Essay Scoring Systems
(PDF) Automated Essay Scoring Systems
PPT
ASAP Benchmark (Automated Essay Scoring)

COMMENTS

An automated essay scoring systems: a systematic literature review
Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. ( 1973 ).
Explainable Automated Essay Scoring: Deep Learning Really Has
School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada; Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity.
[2401.05655] Unveiling the Tapestry of Automated Essay Scoring: A
Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the ...
PDF Automated Essay Scoring in Middle School Writing: Understanding Key
feedback. Our findings suggest that human and automated essay scoring should ideally be used in complementary ways that draw on the strengths of each—on the one hand, the ability of the AI-based scoring tools to save teachers time, generate informative individual and class-level data, and engage students with immediate feedback teachers cannot so
Automated Essay Scoring and Revising Based on Open-Source Large
With the rise of natural language processing techniques, automated essay scoring (AES) and automated ... Manually scoring and revising student essays has long been a time-consuming task for educators. ... The implications of large language models for medical education and knowledge assessment," J. Med. Internet Res. Med. Educ., vol. 9, no. 1 ...
A Systematic Literature Review: Are Automated Essay Scoring Systems
Artificial intelligence technology is becoming increasingly essential to education. The outbreak of COVID-19 in recent years has led many schools to launch online education. Automated online assessments have become a hot topic of interest, and an increasing number of researchers are studying Automated Essay Scoring (AES). This work seeks to summarise the characteristics of current AES systems ...
Automated Scoring of Writing
Automated essay scoring involves automatic assessment of a students' written work, usually in response to a writing prompt. This assessment generally includes (1) a holistic score of students' performance, knowledge, and/or skill and (2) a score descriptor on how the student can improve the text. For example, e-rater by ETS ( 2013) scores ...
Automated Essay Scoring Systems
The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968).PEG relies on proxy measures, such as average word length, essay length, number of certain punctuation marks, and so forth, to determine the quality of an open-ended response item.
An automated essay scoring systems: a systematic literature review
The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. ... Automated essay scoring (AES) is a computer-based assessment system that ...
More efficient processes for creating automated essay scoring
Automated essay scoring (AES) has emerged as a secondary or as a sole marker for many high-stakes educational assessments, in native and non-native testing, owing to remarkable advances in feature engineering using natural language processing, machine learning, and deep-neural algorithms.
PDF Volume 10(1), 37 The Effects of Explanations in Automated Essay Scoring
Therefore, the current study aims to identify how explanations for AI in education affect student trust in the algorithm as well as their motivation to continue learning; in this case, we are using an automated essay scoring system. Automated essay scoring systems are widely known to support writing instruction and assessment (Allen et al., 2016;
Automated Essay Scoring: Writing Assessment and Instruction
Overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre, and the IntelliMetric system is a consistent, reliable system for scoring AWA (Analytic Writing Assessment) essays.
Automated Essay Scoring
Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.
AI in essay-based assessment: Student adoption, usage ...
The association between AI tool usage and past academic performance differed by degree. Among non-economics students (58% of respondents), AI users had significantly higher GPAs than non-users (difference ), but this was not the case among economics students (difference ).. There were also no significant differences between the submissions of AI-users and non-users in terms of Turnitin's AI ...
[2110.06874] Automated Essay Scoring Using Transformer Models
Automated essay scoring (AES) is gaining increasing attention in the education sector as it significantly reduces the burden of manual scoring and allows ad hoc feedback for learners. Natural language processing based on machine learning has been shown to be particularly suitable for text classification and AES. While many machine-learning approaches for AES still rely on a bag-of-words (BOW ...
Automated Pipeline for Multi-lingual Automated Essay Scoring with
Automated Essay Scoring (AES) is a well-studied problem in Natural Language Processing applied in education. Solutions vary from handcrafted linguistic features to large Transformer-based models, implying a significant effort in feature extraction and model implementation. We introduce a novel Automated Machine Learning (AutoML) pipeline integrated into the ReaderBench platform designed to ...
[2102.13136] Automated essay scoring using efficient transformer-based
Automated Essay Scoring (AES) is a cross-disciplinary effort involving Education, Linguistics, and Natural Language Processing (NLP). The efficacy of an NLP model in AES tests it ability to evaluate long-term dependencies and extrapolate meaning even when text is poorly written. Large pretrained transformer-based language models have dominated the current state-of-the-art in many NLP tasks ...
Exploring the potential of using an AI language model for automated
The widespread adoption of ChatGPT, an AI language model, has the potential to bring about significant changes to the research, teaching, and learning of foreign languages. The present study aims to leverage this technology to perform automated essay scoring (AES) and evaluate its reliability and accuracy. Specifically, we utilized the GPT-3 ...
PDF An Overview of Automated Scoring of Essays
Caroline A. & Peter S. Lynch School of Education, Boston College www.jtla.org Semire Dikli. Volume 5, Number 1 An Overview of Automated Scoring of Essays Semire Dikli Editor: Michael Russell ... Automated Essay Scoring (AES) is defined as the computer technology that evaluates and scores the written prose (Shermis & Barrera, 2002; Shermis ...
PDF Automated Essay Scoring Systems
Overview of Automated Scoring Systems. Instructional applications of automated scoring systems are developed to facilitate the process of scoring and feedback in writing classrooms. These AES systems mimic human scoring by using various attributes; however, implemented attributes vary widely.
Automated essay scoring
Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting.It is a form of educational assessment and an application of natural language processing.Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6.
Automated essay scoring: Psychometric guidelines and practices
In this paper, we provide an overview of psychometric procedures and guidelines Educational Testing Service (ETS) uses to evaluate automated essay scoring for operational use. We briefly describe the e-rater system, the procedures and criteria used to evaluate e-rater, implications for a range of potential uses of e-rater, and directions for ...
PDF Automated Essay Scoring With e-rater V
Automated Essay Scoring With e-rater® V.2. Yigal Attali & Jill Burstein. Editor: Michael Russell [email protected] Technology and Assessment Study Collaborative Lynch School of Education, Boston College Chestnut Hill, MA 02467. ign: Thomas Hofmann Layout: Aimee LevyJTLA is a free on-line journal, published by the Technology and Assessment Study ...
Automated Essay Scoring and the Deep Learning Black Box: How ...
This article investigates the feasibility of using automated scoring methods to evaluate the quality of student-written essays. In 2012, Kaggle hosted an Automated Student Assessment Prize contest to find effective solutions to automated testing and grading. This article: a) analyzes the datasets from the contest - which contained hand-graded essays - to measure their suitability for ...
Linking essay-writing tests using many-facet models and neural
Argument mining for improving the automated scoring of persuasive essays. Proceedings of the association for the advancement of artificial intelligence (Vol. 32). Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests [Doctoral dissertation, The Florida State University].
Optimizing a national examination for medical undergraduates via modern
Automated test assembly (ATA) represents a modern methodology that employs data science optimization on computer platforms to automatically create test form, thereby significantly improving the efficiency and accuracy of test assembly procedures. In the realm of medical education, large-scale high-stakes assessments often necessitate lengthy tests, leading to elevated costs in various ...
PDF Automated Essay Scoring
in modifying their essays. The research on Automated Essay Scoring (AES) has revealed that computers have the capacity to function as a more effective cognitive tool (Attali, 2004). AES is defined as the computer technology that evaluates and scores the written prose (Shermis & Barrera, 2002; Shermis & Burstein, 2003; Shermis, Raymat, & Barrera,