MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis
2511.05485v1
cs.CL, I.2.7; I.5.1
2025-11-11
Авторы:
Yuexin Wu, Shiqi Wang, Vasile Rus
Abstract
Disease diagnosis is a central pillar of modern healthcare, enabling early
detection and timely intervention for acute conditions while guiding lifestyle
adjustments and medication regimens to prevent or slow chronic disease.
Self-reports preserve clinically salient signals that templated electronic
health record (EHR) documentation often attenuates or omits, especially subtle
but consequential details. To operationalize this shift, we introduce
MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge
notes and natively aligned to WHO ICD-11 terminology. We further present
LL-Rank, a likelihood-based re-ranking framework that computes a
length-normalized joint likelihood of each label given the clinical report
context and subtracts the corresponding report-free prior likelihood for that
label. Across seven model backbones, LL-Rank consistently outperforms a strong
generation-plus-mapping baseline (GenMap). Ablation experiments show that
LL-Rank's gains primarily stem from its PMI-based scoring, which isolates
semantic compatibility from label frequency bias.
Ссылки и действия
Дополнительные ресурсы: