MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
2510.04477v1
cs.CV, cs.AI, cs.CL, cs.LG
2025-10-08
Авторы:
Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang
Abstract
Bridging clinical diagnostic reasoning with AI remains a central challenge in
medical imaging. We introduce MedCLM, an automated pipeline that converts
detection datasets into large-scale medical visual question answering (VQA)
data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ
segmentation and structured rationales. These contextual signals enable medical
vision-language models to generate question-answer pairs with step-by-step
reasoning. To utilize this data effectively, we propose an Integrated
CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes
for visual grounding, a Medium stage that encourages implicit localization, and
a Hard stage for weakly supervised reasoning. Experimental results demonstrate
that MedCLM attains state-of-the-art performance on several medical VQA
benchmarks, providing a scalable framework for developing clinically aligned
medical vision-language models.