T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
2510.27265v1
cs.CV, cs.LG
2025-11-04
Авторы:
Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
Abstract
In medical imaging, vision-language models face a critical duality:
pretrained networks offer broad robustness but lack subtle, modality-specific
characteristics, while fine-tuned expert models achieve high in-distribution
accuracy yet falter under modality shift. Existing model-merging techniques,
designed for natural-image benchmarks, are simple and efficient but fail to
deliver consistent gains across diverse medical modalities; their static
interpolation limits reliability in varied clinical tasks. To address this, we
introduce Test-Time Task adaptive merging (T^3), a backpropagation-free
framework that computes per-sample interpolation coefficients via the
Jensen-Shannon divergence between the two models' output distributions. T^3
dynamically preserves local precision when models agree and defers to
generalist robustness under drift. To overcome the inference costs of
sample-wise merging, we further propose a batch-wise extension, T^3_B, that
computes a merging coefficient across a batch of samples, dramatically reducing
computational bottleneck. Recognizing the lack of a standardized
medical-merging benchmark, we present a rigorous cross-evaluation protocol
spanning in-domain, base-to-novel, and corruptions across four modalities.
Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error
reduction, outperforming strong baselines while maintaining efficiency, paving
the way for adaptive MVLM deployment in clinical settings. Our code is
available at https://github.com/Razaimam45/TCube.
Ссылки и действия
Дополнительные ресурсы: