A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
2509.25889v2
cs.CV, cs.CL
2025-10-02
Авторы:
Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, Yizhou Sun
Abstract
We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts
(MoE) architecture for visual question answering over multi-parametric 3D brain
MRI (mpMRI). mpLLM routes across modality-level and token-level projection
experts to fuse multiple interrelated 3D modalities, enabling efficient
training without image-report pretraining. To address limited image-text paired
supervision, mpLLM integrates a synthetic visual question answering (VQA)
protocol that generates medically relevant VQA from segmentation annotations,
and we collaborate with medical experts for clinical validation. mpLLM
outperforms strong medical VLM baselines by 5.3% on average across multiple
mpMRI datasets. Our study features three main contributions: (1) the first
clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM
that handles multiple interrelated 3D modalities, and (3) strong empirical
results that demonstrate the medical utility of our methodology. Ablations
highlight the importance of modality-level and token-level experts and
prompt-conditioned routing.
Ссылки и действия
Дополнительные ресурсы: