Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
2510.25164v1
eess.IV, cs.AI, cs.CV
2025-10-31
Авторы:
Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde
Abstract
We present a transformer-based multimodal framework for generating clinically
relevant captions for MRI scans. Our system combines a DEiT-Small vision
transformer as an image encoder, MediCareBERT for caption embedding, and a
custom LSTM-based decoder. The architecture is designed to semantically align
image and textual embeddings, using hybrid cosine-MSE loss and contrastive
inference via vector similarity. We benchmark our method on the MultiCaRe
dataset, comparing performance on filtered brain-only MRIs versus general MRI
images against state-of-the-art medical image captioning methods including
BLIP, R2GenGPT, and recent transformer-based approaches. Results show that
focusing on domain-specific data improves caption accuracy and semantic
alignment. Our work proposes a scalable, interpretable solution for automated
medical image reporting.
Ссылки и действия
Дополнительные ресурсы: