Full article: Automated Scoring of Short-Answer Questions: A Progress Report

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Multiple-choice questions have become ubiquitous in educational measurement because the format allows for efficient and accurate scoring. Nonetheless, there remains continued interest in constructed-response formats. This interest has driven efforts to develop computer-based scoring procedures that can accurately and efficiently score these items. Early procedures were typically based on surface features of the responses or simple matching procedures, but recent developments in natural language processing have allowed for much more sophisticated approaches. This paper reports on a state-of-the-art methodology for scoring short answer questions supported by a large language model. Responses were collected in the context of a high-stakes test for medical students. More than 35,000 responses were collected across 71 studied items. Aggregated across all responses the proportion of agreement with human scores ranged from .97 to .99 (depending on specifics such as training sample size). In addition to reporting detailed results, the paper discusses practical issues that require consideration when adopting this type of scoring system.

Multiple-choice questions (MCQs) were introduced by Frederick Kelly (Citation1916) more than a century ago as part of the Kansas State Silent Reading Tests. Kelly’s immediate concern in introducing the format was to eliminate the errors that were produced when teachers scored tests. In addition to objective scoring, the approach also improved the efficiency of scoring because it reduced scoring to a simple clerical activity that could be performed using a template. This efficiency ushered in the era of large-scale testing. For example, multiple-choice items made it possible for the United States Army to test more than 1,700,000 recruits during World War I and for school systems to test millions of students in the decades following that war. Since that time, the introduction of the optical scanner and more recently computer administration of tests have further increased the efficiency of MCQs. The accuracy and efficiency of MCQs have led to the format dominating testing for more than half a century.

In contrast, constructed-response items cannot be scored as efficiently – and, in many cases, as objectively – as selected-response items. Test administrators who elect to use constructed-response items rather than MCQs only do so when the perceived benefits of these items outweigh the limitations of the format. The choice to use constructed-response items rather than MCQs is a choice to use a format that typically will be more expensive to score, more error-prone, and may require greater delays in score reporting. Despite these limitations, there has been a consistent interest in both complex constructed responses such as essays and short constructed-response items (typically referred to as short-answer questions or SAQs). This choice is most often motivated by the belief that these items measure aspects of the relevant construct that cannot be assessed by MCQs and by concerns that the use of MCQs may lead to changes in curriculum that may be detrimental to education – changing both what teachers teach and what students study (e.g., Fiske, Citation1990; Frederiksen & Collins, Citation1989; Guthrie, Citation1984; Nickerson, Citation1989; Sam et al., Citation2016).

Although some authors have been adamant about the limitations of MCQs, others have suggested that “the response format has less influence on what is being measured than we may be inclined to think” (e.g., Schuwirth & van der Vlueten, Citation2004, p. 977). Clearly, there are strongly held beliefs about the strengths and limitations of these formats, but empirical evidence to support these beliefs is more difficult to find. The cognitive process used to respond to an item may have as much to do with the skill of the item writer and the ability of the examinee as it has to do with the item format. There is one aspect of a response to an MCQ that necessarily differs from a response to a comparable SAQ; The response to the MCQ requires recognition of the correct answer and the response to the SAQ requires recall. In some cases, this difference may be viewed as critical to the construct being assessed. In addition to this distinction between the formats, which exists by definition, there have been empirical studies that compare the performance of the two formats. Mee et al. (Citation2023) provides a comparison of the two formats that is particularly relevant; They used the same 71 items (as well as a subset of the test-taker responses) that are used in the present study. In Mee et al., each test taker responded to two randomly assigned study items in MCQ format and two different randomly assigned items in SAQ format. This allowed for item performance to be compared across format using randomly equivalent groups. The results showed that the MCQs were easier than SAQs: the average p-value for the MCQ items was 0.80 whereas for the SAQ items, it was 0.65. The two formats were similar in terms of discrimination: mean discrimination was 0.18 for the MCQs and 0.19 for the SAQs. These results are consistent with those reported by previous researchers (e.g., Norman et al., Citation1987; Sam et al., Citation2018; Schuwirth et al., Citation1996). Because the items in the Mee et al. study were administered by computer, it was also possible to accurately record the time required to respond to the two formats: on average, the SAQs took 22 seconds (27%) longer. (See Mee et al. for a more detailed review of the available literature comparing these formats.)

The types of performance differences between items in the MCQ and SAQ format described in the previous paragraph appear to be well established. Evidence of clear differences in the cognitive processes required to respond to the two formats is more elusive. Until more research is completed comparing the two item formats, the choice to use SAQs will, to some extent continue to be based on beliefs about the strengths and limitations of the two formats. Nonetheless, there continues to be interest in the use of SAQs. That interest in SAQs is most clearly made manifest by the fact that these items are used as part of several large-scale assessments. For example, SAQs are used on three of the largest exams administered to school children: the Programme for International Student Assessment (PISA; Yamamoto et al., Citation2017) as well as the Trends in International Mathematics and Science Study (TIMSS; International Association for the Evaluation of Educational Achievement (IEA), Citation2013), and the National Assessment of Educational Progress (NAEP; U.S. Department of Education, Citation2022). SAQs are also used in assessments for physician licensure in North America, including the COMLEX-USA Level 3 examination (National Board of Osteopathic Medical Examiners, Citation2023) and the Medical Council of Canada Qualifying Examination, Part I (Medical Council of Canada, Citation2023).

The continued interest in SAQs has led to longstanding efforts to develop effective and efficient automated scoring procedures for these items. This paper adds to that body of research by reporting on a study designed to evaluate the accuracy of an automated system designed for scoring SAQs used in medical education. This paper is far from the first to report on automated scoring procedures for this item type and it will not be the last. Instead, it is meant to be a progress report that provides an example of the current state-of-the-art for scoring these items. Although the results we report were collected in the context of a high-stakes test of internal medicine designed for medical students, the general approach would likely have applicability across the educational spectrum including assessments such as TIMSS, PISA, and NAEP.

In addition to describing the scoring system and reporting the results of this study, we discuss the implications of the results for applications in large-scale testing. Since no system for scoring constructed responses will be as objectively accurate as scoring systems for MCQs, we consider practical strategies for implementation. In the next section we provide a brief overview of previous research on automated scoring.

1. Literature Review

1.1. Automated scoring of free-test responses

Much of the early work on automated scoring of constructed responses focused on essays. Ellis Page’s work in this area dates back more than a half-century (Page, Citation1966, Citation1967). Page scored essays based on surface features that correlated with human ratings of those essays. The approach used surrogates (e.g., word count, use of punctuation) rather than more direct measures of the quality of the writing. This general approach has remained a part of some subsequent systems for automated scoring of free text, but has obvious limitations when scoring is intended to reflect the content rather than the form of the writing.

Recognizing that the use of constructed-response items was hampered by cost and the difficulties in producing reliable scores, researchers for Educational Testing Service (in collaboration with others) examined applications of what they referred to as expert systems in the 1990s. These systems were designed to score constructed responses to computer programming and mathematics problems. Braun et al. (Citation1990) showed that their best-performing system could analyze 93–95% of responses (produced by a sample of students who took the Advanced Placement Computer Science Examination) and produce correlations with human scores that were in the .83–.88 range for some problems. For other problems included in the study, the correlations were precipitously lower. Similarly, Bennett and Sebrechts (Citation1996) examined the potential for an expert system to support qualitative feedback to test takers responding to math problems on the Graduate Record Examination. On average, the proportion of agreement between the expert system and human judges was .91 when determining whether the answer was correct or incorrect. The system was much less useful for identifying the specific errors required to support feedback, although this may have been driven significantly by lower agreement between the judges on these errors.

In a review of approaches to automated scoring, Martinez and Bennett (Citation1992) note that “Perhaps the most difficult constructed response to score is one involving natural language” (p. 164). During the following decades, considerable attention was given to that problem. As we have already noted, Page’s system was based on specific quantifiable characteristics of the response. Although subsequent systems incorporated this approach, two important alternative approaches emerged as more sophisticated methods for scoring text responses. The first of these produced a measure that represented the similarity between a new response and responses that had already received human scores. This general approach was initially referred to as latent semantic analysis (e.g., Landauer et al., Citation1998).

To implement latent semantic analysis, a large corpus of relevant text is identified. The corpus is used to create a multidimensional semantic space in which every word, sentence, paragraph, or document in the corpus can be represented by a vector of numbers. The corpus will typically include hundreds of thousands of paragraphs of text, and the semantic space will have hundreds of dimensions. Previously scored essays can then be located in this semantic space and the associated vectors for new essays can then be compared to those of previously scored essays by using a similarity metric such as the cosine of the angle between the vector representing the new essay and that representing any previously scored essay. The score for the previously scored essay that represents the closest match, or the average score for some set of similar essays, then can be assigned to the new essay. This general approach was shown to be useful for scoring a range of written responses (Burstein et al., Citation2001, Landauer et al., Citation2000; Streeter et al., Citation2011; Swygert et al., Citation2003). The one application in which latent semantic analysis has been less effective is in scoring short written responses where the answer key defines highly specific concepts (LaVoie et al., Citation2020; Willis, Citation2015).

For scoring short responses in which content is critical, a second approach has been developed in which the string of words in the response is explicitly matched to concepts delineated in a key. Numerous studies have reported on this general approach (Cook et al., Citation2012; Jani et al., Citation2020; Liu et al., Citation2014; Sam et al., Citation2016; Sarker et al., Citation2019; Sukkarieh & Blackmore, Citation2009; Willis, Citation2015; Yamamoto et al., Citation2017). These studies represent a range of variations on the same basic theme. The simplest systems use automation to increase the efficiency of human judges. The approach uses a key (sometimes called a dictionary) and each time the human judges designate a response as correct or incorrect, the response is added to the key. If a subsequent test taker enters a response that matches any existing response in the key, the system scores that response without showing it to a human judge. When the total number of responses is substantially greater than the number of unique responses, this general approach can be highly efficient (Willis, Citation2015; Yamamoto et al., Citation2017). For example, Sam et al. (Citation2016) used this approach to score 15 short answer questions presented to 266 medical students and reported that each question was scored for all students in less than two minutes.

Other approaches that match to a key or dictionary rely on fuzzy matching using natural language processing techniques. For example, INCITE (Harik et al., Citation2023; Sarker et al., Citation2019) is a system that matches examinee responses to dictionaries developed by content experts using two types of fuzzy matching. The first of these is based on the Levenshtein Ratio. The Levenshtein Ratio is defined as the sum of the length of the two strings being evaluated for a potential match minus the distance between the strings divided by the sum of the length of the strings:

Levenshtein Ratio = \frac{Length - Distance}{Length} .

In this context, the “distance” between two strings equals the number of deletions, insertions, or substitutions required to transform one string to the other. This approach can identify matches when the response has been misspelled. Without fuzzy matching, each acceptable misspelling must be entered into the dictionary separately. The second form of fuzzy matching used in the INCITE system is referred to as bag of words. With this approach, matching is based on the number of words in one sample that also appear in the second sample without regard to word order. As Harik et al. note, the INCITE system was developed to score patient notes produced as part of a clinical skills assessment for medical licensure. Scoring was based on identifying key concepts in the note. The system was sufficiently effective that operational implementation was scheduled to begin within weeks of when the examination was suspended because of the COVID-19 outbreak (Harik et al., Citation2023). (Although these patient notes were much longer than the response to an SAQ, the individual concepts were similar in length and complexity to SAQ responses).

So far, we have presented three conceptually different approaches for scoring text: 1) scoring based on quantifiable aspects of the response, 2) similarity-based procedures in which responses are represented as a vector of numbers, and 3) matching to a keyed response. It is not uncommon, however, for a system to combine aspects of more than one of these approaches. For example, the e-rater system – which was developed at Educational Testing Service and deployed in combination with human raters for scoring the writing sample component of the Graduate Management Admissions Test (Burstein et al., Citation2001) – scores quantifiable aspects of a response and also includes a module that employs a similarity-based procedure. Similarly, c-rater – a system also developed by Educational Testing Service for scoring short content-based responses – uses both matching and similarity-based procedures (Leacock & Chodorow, Citation2003).

1.2. Recent advances in NLP technology

The success of the approaches described in the previous section notwithstanding, recent advances in artificial intelligence and natural language processing have provided new tools with the potential to substantially increase the accuracy of automated scoring. With these approaches, sometimes described as large language models, words, phrases, or sentences are represented by vectors in multidimensional space referred to as embeddings. These models are much more complex than those used previously (for example, those used with latent semantic analysis); they are trained on massive amounts of text and may have billions of parameters. Unlike one-hot encodings, where each word is represented by a binary vector with a single “1” indicating its presence in a specific position and “0”s elsewhere (e.g., [0,0,0,1,0,…0,0,0,1]), embeddings are dense vectors that encode information about the context and meaning of words based on their distributional patterns in the training data (e.g., [0.2,−0.5,0.1,…,0.3,−0.4]). This means that similar words tend to have similar vector representations, allowing models to generalize better across different contexts and tasks. Among other things, the increased complexity allows the systems to interpret words within a context so that a word with multiple meanings (e.g., “river bank” and “savings bank”) is interpreted correctly. For an introduction to these models for a measurement audience, we refer the reader to Yaneva et al. (Citation2023).

BERT (Bidirectional Encoder Representations from Transformers; Devlin et al., Citation2018), is one of the most widely used of these models. The advances represented by BERT raised questions regarding whether similarity-based methods utilizing this type of powerful neural architecture might outperform matching procedures for scoring SAQs. Bexte et al. (Citation2022) examined the performance of BERT for scoring short responses using a publicly available data set. The average quadradic weighted kappa across prompts was .75, suggesting the approach is promising. However, BERT has several disadvantages for scoring tasks. Firstly, while BERT does routinely output embeddings, these embeddings are for each token, which means the tensors for two given sentences (or phrases) will not necessarily be the same size. This requires token embeddings to be combined in some way (e.g., mean pooling) such that each response is expressed by the same sized vector. The results of this pooling are often unsatisfactory (Reimers & Gurevych, Citation2019).

One alternative is to use BERT to directly predict whether or not a given pair of responses belong to the same class (e.g., correct or incorrect). These predictions then can be used to quantify the similarity of responses. This computation is typically done by inputting two sentences (or in this case, responses) into the BERT model and deriving a joint embedding for the pair. The output is passed to a simple regression function to derive the final label. This label prediction approach generally outperforms the abovementioned approach using BERT embeddings but does so at a cost. First, this approach is highly inefficient, which can increase the computational time required. Second, no independent response embeddings are computed. Without embeddings that correspond to individual responses, some potentially useful analyses cannot be done such as a principal components analysis.

To address these shortcomings, Reimers and Gurevych (Citation2019) introduced Sentence-BERT (S-BERT) as a more efficient embeddings-based solution. S-BERT produces fixed-sized vectors for input sentences (in this case responses), which can then be efficiently compared using similarity measures like cosine similarity or Euclidean distance. Sentence embeddings produced by S-BERT have been shown to perform well in tasks related to semantic textual similarity, thanks to the use of siamese network architecture. Bexte et al. (Citation2022)Footnote¹ evaluated a finetuned version of S-BERT and the results closely approximated those produced with the full BERT model – the average quadradic weighted kappa dropped by only .01.

More recently, Suen et al. (Citation2023) extended the work of Bexte et al. (Citation2022) by incorporating a comparative learning procedure into a system using S-BERT to improve the similarity matching. They provided a direct comparison of this new system known as ACTA (Analysis of Clinical Text for Assessment) and the INCITE system that uses direct and fuzzy matching and found that ACTA substantially outperformed INCITE. The level of agreement with human judges ranged from .88 to .90 for INCITE and from .97 to .99 for ACTA. This makes a strong case that with current technology, similarity-based procedures may be the preferred approach for scoring short content-based responses. (A more complete description of this system is included in the method section.)

In the present study we apply the ACTA system to score responses collected as part of a high-stakes assessment of internal medicine administered to medical students. The motivation for the data collection was to evaluate the potential to use SAQs in that context, but again, the results reflect on the potential for the currently available NLP models to support the use of SAQs across the spectrum of educational measurement.

2. Method

In this section we describe the test material, examinee sample, scoring procedures, and the measures used to evaluate the performance of the automated-scoring system.

2.1. The Data Set

As we noted, the data for this study were collected as part of operational administration of the NBME Internal Medicine Subject Exam. This examination is usually administered at the end of the test taker’s internal medicine clerkship. Conditions for administration are standardized, and the score from the test typically contributes to the examinee’s clerkship grade or pass/fail decision.

To evaluate the automated scoring system, we used 71 SAQs. Each of these items had been used previously in MCQ format on the United States Medical Licensing Examination (USMLE). These items, which were part of a larger group of items selected for subsequent use on the NBME Internal Medicine Subject Exam, had been identified as appropriate for conversion to the SAQ format. The MCQs were converted to SAQs by removing the option list and appropriately adjusting the lead-in. For example, “Which of the following is the most likely diagnosis?” would be changed to “What is the most likely diagnosis?” (The item stems were unchanged.) Each examinee testing during the data collection period was presented with two of these study items. This resulted in an average of 526 (SD = 20) responses per item.

2.2. Human Scoring

Each of the examinee responses was scored by a group of content experts using the following procedure:

Three physicians reviewed and scored a sample of responses for each item. Based on this activity they developed a scoring rubric for each item.
Two nurse practitioners and two physician’s assistants then used the rubrics to independently score the remaining responses. (Each unique response was presented once to each of the four raters). If all four raters agreed that a response was correct (or incorrect), that judgment was used as the final score. If there was disagreement, the response was presented to the physicians, and they made the final decision.

These human scores provide the standard used both to train and evaluate the automated scoring system (For additional information about the item set and human scoring, see Mee et al. (Citation2023).

2.3. Automated Scoring

Before the ACTA system can be used to score responses, it must be trained using the human scores just described. Then, each new response i can be compared to each of the already-scored responses in the system’s training set. The response j from the training set that is most similar to response i is identified and response i is assigned the same score as response j.

Identifying the response j from the training set that is most similar to a new response i is straightforward once each response is represented by an embedding – the system simply computes the cosine similarity between each i-j pair and selects the pair with the highest similarity index. However, the success of this process will depend on the quality of the embeddings. For this reason, embeddings were optimized for this particular task by further training – or finetuning—S-BERT using the responses in the training set, which have already been scored by content experts. By incorporating these expert scores into the system at the time of training, embeddings will be encoded not only with the semantic relationships between responses but also with information on their correctness or incorrectness for a given item.

To fine-tune the model, each expert-scored response for a given item is paired with every other expert-scored response and each pair is assigned a label of 1 if both responses share the same score (i.e., both are correct or both are incorrect) and 0 otherwise. Then, the pairs are passed into S-BERT and the model is trained to minimize a contrastive loss function. This contrastive loss is minimized when responses that share the same label are pushed closer together and responses that have different labels are pushed further apart (Hadsell et al., Citation2006). Here, “closer” and “further” are measured by the cosine distance between the embeddings for each pair of responses. Cosine distance is defined as the complement of cosine similarity:

similarity (e_{1}, e_{2}) = \frac{e_{1}^{T} e_{2}}{||e_{1}|| ||e_{2}||}

distance (e_{1}, e_{2}) = 1 - similarity (e_{1}, e_{2}),

where e₁ and e₂ are the embeddings for a given pair of responses in the training set. The sum of the squared distance for responses with the same label and the squared difference between some margin and the distance for responses with different labels is then defined as the contrastive loss:

L (e_{1}, e_{2}, Y) = (Y) {(d istance (e_{1}, e_{2}))}^{2} + (1 - Y) \max {(0, m - d istance (e_{1}, e_{2}))}^{2},

where Y is the label (1, if both responses have the same score; 0, if they differ) and m is the margin – a hyperparameter that defines the smallest distance between responses with different scores that is not penalized by the loss function. In other words, the contrastive loss function minimizes the distance between responses with the same score while trying to ensure that responses with different scores are a minimum distance, m, apart.

Once the model has been finetuned using the procedure just described, the scoring process outlined above can be applied: a new response i is passed into the system to generate its embedding and it is then compared to the embedding for each response in the training set for a given item and is assigned the score (i.e., correct or incorrect) of the most similar known response j as measured by cosine similarity provided that that similarity exceeds a defined threshold. (The importance of this threshold is discussed below.)

2.4. Evaluation of the Automated Scoring System

There are several variables that must be considered in evaluating the usefulness of an automated scoring system. The importance of each of these variables will be affected by the specifics of the assessment context. Some reflect on the validity of score interpretations and others on issues of practicality. The accuracy of the scores is clearly central, but in the context of a system like the one evaluated in this study, a threshold is used to determine if a new response will be matched to a previously scored response. With higher thresholds, the classifications will become increasingly accurate, but this may occur at the expense of leaving more responses unmatched (and, therefore, unscored by the automated system). If these unmatched responses require manual scoring by humans, the efficiency of an automated scoring system can be defined as the proportion of responses that do not require human scoring – i.e., the proportion that can be matched and scored automatically by the scoring system.

In general, the more responses that are human scored prior to training the model (i.e., the larger the training set), the fewer unmatched responses and the greater the accuracy. In this way, the training set represents the amount of information that contributes to accuracy and efficiency whereas the threshold represents the balance of information – whatever its amount – between the competing goals of accuracy and efficiency. Therefore, the size of training set and the threshold level are two related but distinct design considerations. In this study, both are varied as study conditions.

The amount of training data was manipulated by varying the proportion of available responses used to train the system. We trained a model for each of the 71 items multiple times using 20%, 40%, 60%, or 80% of the responses. (We use percentages instead of specific numbers of responses because the number of responses varied across items.) For each of these training conditions, the results were cross validated so that all responses were used to evaluate the scoring system at least once and all responses were used for training at least once. For example, in the case of the 80% condition, the first 20% of the responses would be held out for evaluation and training would be carried out on the remaining 80%. The process would then be repeated with the second 20% of responses being held out; again, training would be carried out using the remaining 80%. This process was repeated until each of the responses had been included in the evaluation sample once and – for the 80% condition – in the training set four times. With respect to the thresholds, these were varied as a study condition and included values of .98, .95, .90, .85, .80, .75, and .65. A value of .95 is typically considered reasonably conservative.

The scoring system was evaluated by assessing its accuracy and efficiency for each condition. Accuracy was measured by the level of agreement between the human scores and the machine scores for each response. To this end, we present two measures: proportion agreement and F1 scores. The former provides an intuitive measure of the extent to which the two scores are equivalent. It was calculated using the following formula:

Proportion Agreement = \frac{True Positives + True Negatives}{True Positives + True Negatives + False Positive + False Negatives} .

The latter measure is a commonly used index of accuracy in machine learning (Han et al., Citation2012). It represents the harmonic mean of the precision and recall:

F 1 = \frac{2}{\frac{1}{Recall} \times \frac{1}{Precision}} .

The value can also be expressed as

F 1 = \frac{True Positive}{True Positive + .5 (False Positive + False Negative)} .

Both measures take on a value of one when the proportion of non-identified correct answers (false negatives) and wrongly identified incorrect answers (false positives) are zero, and a value of zero when no answers are scored correctly. Efficiency was measured by the proportion of unmatched responses for each condition. This was calculated as

Proportion Unscored = \frac{Number of Unscored Responses}{Number of Unscored Responses + Number of Scored Responses} .

All three of these measures were calculated using all responses for each of five threshold conditions crossed with each of four training set conditions.

In what follows, we provide aggregate results for the accuracy measures as well as the proportion of unscored responses as a function of the training set size (percentage) and the threshold. This provides a high-level assessment of the performance of the scoring system. We then report the F1 scores and the proportion of unmatched responses at the item level. This provides a sense of how the performance of the scoring system varies across items.

We then compare item difficulty (as measured by the proportion of responses scored as correct) for scores based on the automated system with those based on human judgments. This provides a first look at the extent to which the two score scales can be viewed as interchangeable. Finally, to better understand the unmatched responses, we report the proportion of unmatched responses that were judged to be correct by human judges. This shows the extent to which unmatched responses can be viewed as missing at random.

3. Results

We begin by describing the results for what might be viewed as a typical case, in which the threshold is set at a common – if somewhat conservative—.95 and the training set is moderate (around 200 responses; 40%). With these conditions, the proportion agreement calculated based on all responses to all 71 items and the equivalent F1 score are both .98, and the proportion of unmatched responses is between .02 and .03. Using a smaller training set (approximately 100 responses; 20%) increases the proportion of unmatched responses to a value closer to .07. Using a threshold of .65 (with a moderate training set) would eliminate unmatched responses but decrease the mean proportion agreement and F1 scores to .97. In what follows, we provide more detailed comparisons across the studied conditions.

presents the proportion agreement as a function of the threshold and training set size. presents the analogous results for F1 scores. The results indicate that for both measures we see a decrease in accuracy as the threshold is reduced. Increasing the size of the training set increases accuracy. Equal increases in the size of the training set lead to progressively smaller increases in accuracy. and also suggest that there is an interaction between the threshold and the training set size. The change in accuracy as a function of training set size decreases as the threshold increases. Another way of describing the same phenomenon is to say that with a high threshold the accuracy is relatively high even with small training sets.

Figure 1. Proportion agreement as a function of threshold and training set size aggregated across the 71 items.

Figure 2. F1 score as a function of threshold and training set size aggregated across the 71 items.

presents the proportion of unmatched responses as a function the threshold and the size of the training set. Again, the pattern is straightforward. As the threshold is decreased the percentage of unmatched responses decreases. As the training set is increased, the percentage of unmatched responses decreases, however, there is relatively little change after the training set reaches 40% of the total sample (i.e., approximately 200 responses). Note that although the size of the training set had relatively little impact on accuracy with a high threshold, the impact of size of the training set on the proportion of unmatched responses is greatest when the threshold is high (thereby increasing the number of unmatched responses that could potentially benefit from a richer training set).

Figure 3. Proportion unmatched as a function of threshold and training set size aggregated across the 71 items.

In , we have presented aggregate results across all responses for the 71 studied items. These results reflect on the overall performance of the system; however, examining results at the item level may also be useful, particularly if the accuracy or proportion of unmatched responses is substantially different across items. presents the F1 scores for each of the 71 items using a moderate training set (approximately 200; 40%) and a threshold of .95. The items have been ordered from the highest to the lowest F1 score. (An analogous graph for proportion agreement is available from the first author but is not presented here to save space.) The figure shows that the level of agreement is reasonably stable across items. Note that to make the differences in F1 scores more apparent, the plotted range was limited to .85–1.0; however, item 71 had an F1 score of zero, which falls outside of this range. The poor performance of the scoring system on this item is likely the result of the fact that the item was extremely difficult. As a result, the training set included only four correct responses.Footnote²

Figure 4. F1 scores for each of the 71 items.

presents the percentage of unmatched responses for each of the items, again based on a moderate training set and a threshold of .95. The ordering is the same as that used in to allow for comparison. The results show a range in the proportion of responses that are unmatched across items but does not show any extreme outliers. The relationship between accuracy and the proportion of responses that are unmatched is weak (the correlation between the two values is −.10.).

Figure 5. Proportion of responses unmatched for each of the 71 items.

The results that have already been presented make it clear that there is a close correspondence between the human scores and the scores from the automated system. Nonetheless, there are differences, and it is important to consider whether scores differ systematically or randomly. To begin to answer that question, provides a comparison of the proportion of the responses scored as correct for each of the items with scores produced by the human and automated scoring system. Again, the results are for a moderate training set and a threshold of .95. There is little evidence of a systematic difference between the proportion correct for those responses based on the source of the scores. The mean proportion correct for the human scores shown in is .635 (SD = .236); for the scores produced by the automated system the mean is .640, (SD = .238).

Figure 6. Scatter plot of the proportion correct values for the 71 items based on human scorers and the automated scoring system.

provides a more focused look at the responses that were unmatched by the automated system. Specifically, the table shows the proportion of those responses that were scored as correct by human judges. The results suggest that unmatched responses are more likely to be incorrect than correct. For the majority of the items that had unmatched responses, all of those responses were judged to be incorrect. Across all matched responses, the proportion of responses the human scorers judged to be correct was .698. Across all unmatched responses that proportion was .365.

Table 1. Number of items with proportions of unscored responses that were scored as correct falling into various ranges.

Download CSV Display Table

4. Discussion

The results reported in and and show that reasonably high levels of agreement between human judgments and automated scoring systems can be achieved using the ACTA system. This level of agreement certainly supports operational use of such systems in some large-scale testing applications. This level of agreement may also be approaching the maximum possible, given that the automated scores are modeled on less-than-perfect human scores. Achieving the highest levels of agreement reported in this study comes at the cost of larger training sets, higher thresholds, or both. As shows, raising the threshold also increases the number of unmatched responses. These results make it clear that optimizing an automated scoring system for a specific application will require a number of value judgments. These will relate to tolerance for error, the trade-off between error and the cost of the increased involvement of human judges, and the importance of immediate feedback (which will be impossible if human scoring is a part of the ongoing scoring procedure).

In a setting such as a self-assessment designed to predict the test taker’s performance on a subsequent high-stakes test, the expense and delay in score reporting associated with using human judges may significantly outweigh the loss in accuracy associated with lowering the threshold used for matching responses. Based on the results reported in this study, it is possible to have a proportion agreement in excess of .97 with no unmatched responses – allowing for efficient and timely score reporting. By contrast, it is possible to produce a proportion agreement that approaches .99 by raising the threshold and using larger training sets. These steps will increase the expense and may delay score reporting, but this trade-off may be preferred for high-stakes tests.

In addition to evaluating the overall performance of the automated scoring system, we also examined the variability in performance across items. shows that in addition to one item for which the automated scoring system was essentially useless, there exists a range in performance across items. For practical purposes, that single outlier can be ignored. The item was so difficult that it would not be included on operational test forms regardless of the performance of the automated scoring system. (It is worth noting that the poor performance of that item lowered the reported aggregate accuracy measures presented in and modestly.) That one problematic item aside, the range of F1 scores, although relatively narrow, is still wide enough to suggest that when scoring accuracy is the primary concern, but delay in score reporting is unacceptable, test administrators may need to specify a level of accuracy required for an item to be eligible for use in test construction.

In general, when a high level of accuracy is needed, a hybrid system will still be required. Using such a system requires that the scores produced by the automated system and those produced by human judges are interchangeable in the sense that they are on the same scale. This question warrants more study, but and the associated results suggest that the item difficulties are essentially unchanged across scoring systems and that any differences that do exist are random rather than systematic. This result supports the use of a hybrid system. It also supports the interchangeability of scores during a transition from human scoring to an automated or hybrid scoring system.

In situations in which scoring accuracy is at a premium, it is likely that some responses will be unmatched. It may be tempting to think that scores could be reported for test takers with unmatched responses by deleting those items from scoring for those specific test takers (i.e., treating them as not administered or not reached). Unfortunately, strongly suggests unmatched responses cannot be treated as missing at random. Although the proportion of unmatched responses that humans judged to be correct varies across items, for more than half of the items that had unmatched responses, all of those responses were incorrect. Dropping unmatched responses from scoring would result in biasing the scores upwards because unmatched responses are much more likely to be incorrect than are responses selected at random. Further, although suggests that unmatched responses are likely to be incorrect, simply treating all unmatched responses as incorrect is also inappropriate because a small but nontrivial proportion of those responses are correct.

Because MCQs can be scored with near perfect accuracy, a test administrator considering whether to incorporate SAQs in a test that had previously comprised only MCQs will need to decide whether the modest level of error associated with an automated scoring system can be tolerated. That may well be based on a willingness to reduce the reliability of scoring to better assess the construct of interest. Our findings suggest that such a trade-off is worth considering but provide no way to directly quantify that trade-off. If, alternatively, the test developer is considering whether to replace human scoring of SAQs with an automated system, the comparison is more straightforward. Human scoring will carry with it a certain level of error. When a modest number of responses are to be scored – such as the sample used for training and evaluation in this study – exceptional effort can be made to produce highly accurate scores such as scoring each response multiple times (four times in this study) and referring questionable responses to a group of expert judges. In operational settings, simpler, more efficient scoring procedures are the norm and these human scoring systems will include error. For example, Leacock and Chodorow (Citation2003) reported an average percent agreement between human raters of 90.8% in a study using NAEP items. In most operational settings a direct comparison will allow for quantifying the impact that adoption of an automated system would have on scoring accuracy.Footnote³

With the accuracy of automated systems approaching (or exceeding) the accuracy of human judges, attention must be given to the practicality of these systems. It is clear that once a system like the one described in this paper is trained, ongoing scoring of test-taker responses will have a low per score cost relative to the expense of having each response reviewed by human judges. This will be true even if a modest number of responses are unmatched by the system and must be individually scored by humans. This long-term efficiency comes at the cost of significant up-front work to create a scoring model for each item. In the case of the present study a significant cost was associated with pretesting the items to collect the responses that were used to train the system. Additional cost was associated with human scoring of each unique response from this sample. Of course, if pretesting on a similar scale is a routine part of test development for a given assessment, this will not be an additional expense. Similarly, if the alternative to automated scoring is human scoring for all responses, the cost of developing a training set may not represent an additional expense. Either way, there will also be a one-time expense associated with developing the system. Regardless of the specific costs, this type of system will only be of practical value for large-scale testing, where up-front costs can be recovered with routine savings on operational scoring.

Given that the performance of the automated system used in this research demonstrates the feasibility of scoring SAQs with either a fully automated or hybrid system, some cautions are appropriate. ACTA performed well in scoring a large sample of actual test-taker responses. There are, however, aspects of the system that have not been evaluated. Based on research on previous automated text scoring systems, it will be important to ask whether there are credible strategies for gaming this system. It is likely impossible to be truly exhaustive in evaluating potential strategies, but it seems important at a minimum to challenge the system with strategies such as responding with words from the item stem or entering more than one response. (Research in this area is currently underway.)

A second area that requires more research is the potential for bias in the system. There has been considerable concern that bias in the training sets used for large language models could lead to bias in subsequent applications of those models. Again, a more complete study of bias will require larger data sets providing significant representation of the groups that may be impacted by bias. Nonetheless, preliminary analysis of the potential for bias is warranted. These issues are beyond the scope of the current study, but they are critical steps required prior to adopting this type of scoring system.

As we noted previously, this paper is intended as a progress report on automated scoring of SAQs. We believe that we have documented a level of progress – based on the introduction of large language models – that makes it practical to consider the use of automated scoring across a range of large-scale testing applications. The specific context for our study was a set of items appropriate for third-year medical students, but we see no reason to assume that the approach used in this research would be less useful for other applications such as K-12 assessment. Nonetheless, we note that the results reported in this study should be considered preliminary. This paper reports results from a single sample of 71 items. Although a subsequent unpublished study shows similar results for an independent sample of items and test takers, to this point evaluation of the system has been based exclusively on test material designed for use in medical education. The usefulness of large-language models for scoring other test content has yet to be empirically demonstrated.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Notes

¹ We refer the reader to the Bexte et al. (Citation2022) paper for a detailed description of how the S-BERT model was originally applied in the context of SAQ scoring.

² A broader comparison of the F1 scores to difficulty (proportion correct) for the same items shows a generally moderate correlation of .33. However, at the extremes, item difficulty does appear to be associated with F1 scores. If we consider the 10% of the items with the lowest F1 scores, five of those seven items are in common with the 10% of the items that are most difficult. (Note that the item difficulty is defined as the proportion correct based on the human scoring. This eliminates the possibility that the relatively poor performance of ACTA resulted in additional responses being scored as incorrect.

³ The human ratings used in this study were collected using a design that was intended to maximize the accuracy of the final judgment. That design does not lend itself to estimating the level of accuracy that would be expected under operational scoring conditions. Nonetheless, four judges independently applied the scoring rubrics to a substantial, but nonrandom, subset of responses for each item. They did, however, only evaluated unique responses, so extrapolating the pairwise agreement for these judges requires the assumption that they would be perfectly internally consistent when evaluating the same response on different occasions. With these important caveats in mind, we note that the level of agreement between pairs would have been .95 for unique responses and .98 across all responses.

References

Bennett, R. E., & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical problem solutions. Applied Measurement in Education, 9(2), 133–150. https://doi.org/10.1207/s15324818ame0902_3
Web of Science ®Google Scholar
Bexte, M., Horbach, A., & Zesch, T. (2022). Similarity-based content scoring-how to make S-BERT keep up with BERT. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) (pp. 118–123). Seattle, WA.
Google Scholar
Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27(2), 93–108. https://doi.org/10.1111/j.1745-3984.1990.tb00736.x
Web of Science ®Google Scholar
Burstein, J., Leacock, C., & Swartz, R. (2001). Automated evaluation of essays and short answers. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=4d7766c71bc1d03dc5b53511cac4ca947b017034
Google Scholar
Cook, R., Baldwin, S., & Clauser, B. (2012, April). Tep 2 CS patient noteAn nlp-based approach to automated scoring of the USMLE® s. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, BC.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
Google Scholar
Fiske, E. B. (1990, January 31). But is the child learning? Schools trying new tests. The New York Times, Al, B6.
Google Scholar
Frederiksen, J. R., & Collins, A. (1989). A systems approach to educational testing. Educational Researcher, 18(9), 27–32. https://doi.org/10.2307/1176716
Google Scholar
Guthrie, J. T. (1984). Testing higher level skills. Journal of Reading, 28(2), 188–190. http://www.jstor.org/stable/40029442
Google Scholar
Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), New York, NY (Vol. 2, pp. 1735–1742). IEEE.
Google Scholar
Han, J. W., Kamber, M., & Pei, J. (2012). Data mining concepts and techniques (3rd ed.). Morgan Kaufmann Publishers.
Google Scholar
Harik, P., Mee, J., Runyon, C., & Clauser, B. E. (2023). Assessment of clinical skills: A case study in constructing an nlp-based scoring system for patient notes. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment (pp. 58–73). NCME Educational Measurement and Assessment Book Series. Taylor & Francis.
Google Scholar
International Association for the Evaluation of Educational Achievement (IEA). (2013). Released science items for TIMSS USA, grade 8. Third International Mathematics and Science Study. https://nces.ed.gov/timss/pdf/TIMSS2011_G8_Science.pdf
Google Scholar
Jani, K. H., Jones, K. A., Jones, G. W., Amiel, J. B., Elhadad, N., & Elhadad, N. (2020). Machine learning to extract communication and history-taking skills in OSCE transcripts. Medical Education, 54(12), 1159–1170. https://doi.org/10.1111/medu.14347
PubMed Web of Science ®Google Scholar
Kelly, F. J. (1916). The Kansas silent reading tests. Journal of Educational Psychology, 7(2), 63–80. https://doi.org/10.1037/h0073542
Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028
Web of Science ®Google Scholar
Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The intelligent essay assessor. IEEE intelligent systems, 15(5), 27–31.
Web of Science ®Google Scholar
LaVoie, N., Parker, J., Legree, P. J., Ardison, S., & Kilcullen, R. N. (2020). Using latent semantic analysis to score short answer constructed responses: Automated scoring of the consequences test. Educational and Psychological Measurement, 80(2), 399–414. https://doi.org/10.1177/0013164419860575
PubMed Web of Science ®Google Scholar
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405. https://doi.org/10.1023/A:1025779619903
Google Scholar
Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement Issues & Practice, 33(2), 19–28. https://doi.org/10.1111/emip.12028
Google Scholar
Martinez, M. E., & Bennett, R. E. (1992). A review of automatically scorable constructed-response item types for large-scale assessment. Applied Measurement in Education, 5(2), 151. https://doi.org/10.1207/s15324818ame0502_5
Google Scholar
Medical Council of Canada. (2023). Medical Council of Canada qualifying examination, part I. https://mcc.ca/examinations/mccqe-part-i/
Google Scholar
Mee, J., Pandian, R., Wolczynski, J., Morales, A., Paniagua, M., Harik, P., Baldwin, P., & Clauser, B. E. (2023). An experimental comparison of multiple-choice and short-answer questions on a high-stakes test for medical students. Advances in Health Sciences Education, 29(3), 783–801. https://doi.org/10.1007/s10459-023-10266-3
PubMedGoogle Scholar
National Board of Osteopathic Medical Examiners. (2023). Examination format. https://www.nbome.org/assessments/comlex-usa/comlex-usa-level-3/exam-format/
Google Scholar
Nickerson, R. S. (1989). New directions in educational assessment. Educational Researcher, 18(9), 3–7. https://doi.org/10.2307/1176712
Google Scholar
Norman, G. R., Smith, E. K. M., Powles, A. C. P., Rooney, P. J., Henry, N. L., & Dodd, P. E. (1987). Factors underlying performance on written tests of knowledge. Medical Education, 21(4), 297–304. https://doi.org/10.1111/j.1365-2923.1987.tb00367.x
PubMed Web of Science ®Google Scholar
Page, E. B. (1966). Grading essays by computer: Progress report. Notes from the 1966 Invitational Conference on Testing Problems, Educational Testing Service, Princeton, NJ.
Google Scholar
Page, E. B. (1967). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.
Web of Science ®Google Scholar
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084.
Google Scholar
Sam, A. H., Field, S. M., Collares, C. F., van der Vleuten, C. P., Wass, V. J., Melville, C., Harris, J., & Meeran, K. (2018). Very-short-answer questions: Reliability, discrimination and acceptability. Medical Education, 52(4), 447–455. https://doi.org/10.1111/medu.13504
PubMed Web of Science ®Google Scholar
Sam, A. H., Hameed, S., Harris, J., & Meeran, K. (2016). Validity of very short answer versus single best answer questions for undergraduate assessment. BMC Medical Education, 16(1), 1–4. https://doi.org/10.1186/s12909-016-0793-z
PubMedGoogle Scholar
Sarker, A. D., Klein, A. Z., Mee, J., Harik, P., & Gonzalez-Hernandez, G. (2019). An interpretable natural language processing system for written medical examination assessment. Journal of Biomedical Informatics, 98, 103268. https://doi.org/10.1016/j.jbi.2019.103268
PubMed Web of Science ®Google Scholar
Schuwirth, L. W. T., van der Vleuten, C. P. M., & Donkers, H. H. L. M. (1996). A closer look at cueing effects in multiple choice questions. Medical Education, 30(1), 44–49. https://doi.org/10.1111/j.1365-2923.1996.tb00716.x
PubMed Web of Science ®Google Scholar
Schuwirth, L. W., & Van Der Vleuten, C. P. (2004). Different written assessment methods: What can be said about their strengths and weaknesses? Medical Education, 38(9), 974–979. https://doi.org/10.1007/s10459-023-10266-3
PubMed Web of Science ®Google Scholar
Streeter, L., Bernstein, J., Foltz, P., & DeLand, D. (2011). Pearson’s automated scoring of writing, speaking, and mathematics ( White Paper). http://kt.pearsonassessments.com/download/PearsonAutomatedScoring-WritingSpeakingMath-051911.pdf
Google Scholar
Suen, K. Y. Yaneva, V., Ha, L. A., Mee, J., Zhou, Y., Harik, P. (2023). ACTA: Short-answer grading in high-stakes medical exams. To appear in Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada. July 14, 2023
Google Scholar
Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater: Automatic content scoring for short constructed responses. Proceedings of the Twenty-Second International FLAIRS Conference, Association for the Advancement of Artificial Intelligence, Sanibel Island, Florida (pp. 290–295).
Google Scholar
Swygert, K., Margolis, M., King, A., Siftar, T., Clyman, S., Hawkins, R., & Clauser, B. (2003). Evaluation of an automated procedure for scoring patient notes as part of a clinical skills examination. Academic Medicine, 78(10), S75–S77. https://doi.org/10.1097/00001888-200310001-00024
PubMed Web of Science ®Google Scholar
U.S. Department of Education. (2022). NAEP report card: Reading, sample questions. Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP). https://www.nationsreportcard.gov/reading/sample-questions/?grade=4
Google Scholar
Willis, A. (2015). Using NLP to support scalable assessment of short free text responses. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, Denver, Colorado (pp. 243–253). Association for Computational Linguistics.
Google Scholar
Yamamoto, K., He, Q., Shin, H. J., & von Davier, M. (2017). Developing a machine-supported coding system for constructed-response items in PISA. ETS RR–17-47. Educational Testing Service.
Google Scholar
Yaneva, V., Baldwin, P., Ha, L. A., & Runyon, C. (2023). Extracting linguistic signal from item text and its application to modeling item characteristics. In Yaneva & von Davier (Eds.), Advancing natural language processing in educational assessment (pp. 167–182). Routledge.
Google Scholar

Automated Scoring of Short-Answer Questions: A Progress Report

ABSTRACT

1. Literature Review