Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

2510.22389v1 cs.DL, cs.AI 2025-10-29

Авторы:

Mike Thelwall, Ehsan Mohammadi

Abstract

Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Авторы:

Abstract

Ссылки и действия

Связанные статьи

ARISE: Agentic Rubric-Guided Iterative Survey Engine for Automated Scholarly Pap...

AI Literacy in UAE Libraries: Assessing Competencies, Training Needs, and Ethica...

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation

Animer une base de connaissance: des ontologies aux mod{è}les d'I.A. g{é}n{é}rat...

Information Ecosystem Reengineering via Public Sector Knowledge Representation

Навигация