ECG-LLM -- training and evaluation of domain-specific large language models for electrocardiography
2510.18339v1
cs.CL, cs.LG
2025-10-23
Авторы:
Lara Ahrens, Wilhelm Haverkamp, Nils Strodthoff
Abstract
Domain-adapted open-weight large language models (LLMs) offer promising
healthcare applications, from queryable knowledge bases to multimodal
assistants, with the crucial advantage of local deployment for privacy
preservation. However, optimal adaptation strategies, evaluation methodologies,
and performance relative to general-purpose LLMs remain poorly characterized.
We investigated these questions in electrocardiography, an important area of
cardiovascular medicine, by finetuning open-weight models on domain-specific
literature and implementing a multi-layered evaluation framework comparing
finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7
as a representative general-purpose model. Finetuned Llama 3.1 70B achieved
superior performance on multiple-choice evaluations and automatic text metrics,
ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert
evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned
models significantly outperformed their base counterparts across nearly all
evaluation modes. Our findings reveal substantial performance heterogeneity
across evaluation methodologies, underscoring assessment complexity.
Nevertheless, domain-specific adaptation through finetuning and RAG achieves
competitive performance with proprietary models, supporting the viability of
privacy-preserving, locally deployable clinical solutions.
Ссылки и действия
Дополнительные ресурсы: