GastroViT: A Vision Transformer Based Ensemble Learning Approach for Gastrointestinal Disease Classification with Grad CAM & SHAP Visualization
2509.26502v1
eess.IV, cs.CV
2025-10-02
Авторы:
Sumaiya Tabassum, Md. Faysal Ahamed, Hafsa Binte Kibria, Md. Nahiduzzaman, Julfikar Haider, Muhammad E. H. Chowdhury, Mohammad Tariqul Islam
Abstract
The gastrointestinal (GI) tract of humans can have a wide variety of aberrant
mucosal abnormality findings, ranging from mild irritations to extremely fatal
illnesses. Prompt identification of gastrointestinal disorders greatly
contributes to arresting the progression of the illness and improving
therapeutic outcomes. This paper presents an ensemble of pre-trained vision
transformers (ViTs) for accurately classifying endoscopic images of the GI
tract to categorize gastrointestinal problems and illnesses. ViTs,
attention-based neural networks, have revolutionized image recognition by
leveraging the transformative power of the transformer architecture, achieving
state-of-the-art (SOTA) performance across various visual tasks. The proposed
model was evaluated on the publicly available HyperKvasir dataset with 10,662
images of 23 different GI diseases for the purpose of identifying GI tract
diseases. An ensemble method is proposed utilizing the predictions of two
pre-trained models, MobileViT_XS and MobileViT_V2_200, which achieved
accuracies of 90.57% and 90.48%, respectively. All the individual models are
outperformed by the ensemble model, GastroViT, with an average precision,
recall, F1 score, and accuracy of 69%, 63%, 64%, and 91.98%, respectively, in
the first testing that involves 23 classes. The model comprises only 20 million
(M) parameters, even without data augmentation and despite the highly
imbalanced dataset. For the second testing with 16 classes, the scores are even
higher, with average precision, recall, F1 score, and accuracy of 87%, 86%,
87%, and 92.70%, respectively. Additionally, the incorporation of explainable
AI (XAI) methods such as Grad-CAM (Gradient Weighted Class Activation Mapping)
and SHAP (Shapley Additive Explanations) enhances model interpretability,
providing valuable insights for reliable GI diagnosis in real-world settings.
Ссылки и действия
Дополнительные ресурсы: