MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
2511.00421v1
cs.CL, cs.AI, cs.LG
2025-11-06
Авторы:
Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa
Abstract
Large language models (LLMs) show increasing promise in medical applications,
but their ability to detect and correct errors in clinical texts -- a
prerequisite for safe deployment -- remains under-evaluated, particularly
beyond English. We introduce MedRECT, a cross-lingual benchmark
(Japanese/English) that formulates medical error handling as three subtasks:
error detection, error localization (sentence extraction), and error
correction. MedRECT is built with a scalable, automated pipeline from the
Japanese Medical Licensing Examinations (JMLE) and a curated English
counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with
comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning
proprietary, open-weight, and reasoning families. Key findings: (i) reasoning
models substantially outperform standard architectures, with up to 13.5%
relative improvement in error detection and 51.0% in sentence extraction; (ii)
cross-lingual evaluation reveals 5-10% performance gaps from English to
Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA
fine-tuning yields asymmetric improvements in error correction performance
(Japanese: +0.078, English: +0.168) while preserving reasoning capabilities;
and (iv) our fine-tuned model exceeds human expert performance on structured
medical error correction tasks. To our knowledge, MedRECT is the first
comprehensive cross-lingual benchmark for medical error correction, providing a
reproducible framework and resources for developing safer medical LLMs across
languages.
Ссылки и действия
Дополнительные ресурсы: