CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
2511.03441v2
cs.CL, cs.AI
2025-11-07
Авторы:
Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre
Abstract
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
Ссылки и действия
Дополнительные ресурсы: