Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
2510.22389v1
cs.DL, cs.AI
2025-10-29
Авторы:
Mike Thelwall, Ehsan Mohammadi
Abstract
Assessing published academic journal articles is a common task for
evaluations of departments and individuals. Whilst it is sometimes supported by
citation data, Large Language Models (LLMs) may give more useful indications of
article quality. Evidence of this capability exists for two of the largest LLM
families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is
unclear whether smaller LLMs and reasoning models have similar abilities. This
is important because larger models may be slow and impractical in some
situations, and reasoning models may perform differently. Four relevant
questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral
Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science
papers in 6 fields, with two different gold standards, one novel. The results
suggest that smaller (open weights) and reasoning LLMs have similar performance
to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and
4b sometimes, be too few. Moreover, averaging scores from multiple identical
queries seems to be a universally successful strategy, and few-shot prompts
(four examples) tended to help but the evidence was equivocal. Reasoning models
did not have a clear advantage. Overall, the results show, for the first time,
that smaller LLMs >4b, including reasoning models, have a substantial
capability to score journal articles for research quality, especially if score
averaging is used.
Ссылки и действия
Дополнительные ресурсы: