Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department

2512.00271v1 eess.IV, cs.AI, cs.LG 2025-12-02
Авторы:

Woo Hyeon Lim, Ji Young Lee, Jong Hyuk Lee, Saehoon Kim, Hyungjin Kim

Abstract

Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P<.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P<.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P<.001), while other VLMs performed worse (41.1-71.4%; P<.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P<.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P<.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.

Ссылки и действия

Связанные статьи

Deep Active Learning for Lung Disease Severity Classification from Chest X-rays:...

#### Контекст Ранний диагноз и ранняя стадия лечения заболеваний, в том числе легочных, является ключевым фактором успе...

2025-09-02

HOTSPOT-YOLO: A Lightweight Deep Learning Attention-Driven Model for Detecting T...

#### Контекст Тепловые аномалии в солнечных панелях, такие как горячие точки (hotspots), являются критически важной про...

2025-08-28

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervise...

## КОНТЕКСТ И ПРОБЛЕМАТИКА Оценка параметров intravoxel incoherent motion (IVIM) из диффузионно-взвешенной МРТ (DW-MRI)...

2025-08-09