Comparison of Scoring Rationales Between Large Language Models and Human Raters
2509.23412v1
cs.CL, cs.LG
2025-10-01
Авторы:
Haowei Hua, Hong Jiao, Dan Song
Abstract
Advances in automated scoring are closely aligned with advances in
machine-learning and natural-language-processing techniques. With recent
progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude,
and other generative-AI chatbots for automated scoring has been explored. Given
their strong reasoning capabilities, LLMs can also produce rationales to
support the scores they assign. Thus, evaluating the rationales provided by
both human and LLM raters can help improve the understanding of the reasoning
that each type of rater applies when assigning a score. This study investigates
the rationales of human and LLM raters to identify potential causes of scoring
inconsistency. Using essays from a large-scale test, the scoring accuracy of
GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa
and normalized mutual information. Cosine similarity is used to evaluate the
similarity of the rationales provided. In addition, clustering patterns in
rationales are explored using principal component analysis based on the
embeddings of the rationales. The findings of this study provide insights into
the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve
the understanding of the rationales behind both human scoring and LLM-based
automated scoring.
Ссылки и действия
Дополнительные ресурсы: