Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

2510.25164v1 eess.IV, cs.AI, cs.CV 2025-10-31

Авторы:

Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

Abstract

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Авторы:

Abstract

Ссылки и действия

Связанные статьи

MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in ...

When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evalu...

Adversarial Multi-Task Learning for Liver Tumor Segmentation, Dynamic Enhancemen...

Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging

Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Seg...

Навигация