IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
2511.04727v1
cs.CV, cs.AI, cs.LG
2025-11-11
Авторы:
Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal
Abstract
Vision-language models (VLMs) have demonstrated impressive generalization
across multimodal tasks, yet most evaluation benchmarks remain Western-centric,
leaving open questions about their performance in culturally diverse and
multilingual settings. To address this gap, we introduce IndicVisionBench, the
first large-scale benchmark centered on the Indian subcontinent. Covering
English and 10 Indian languages, our benchmark spans 3 multimodal tasks,
including Optical Character Recognition (OCR), Multimodal Machine Translation
(MMT), and Visual Question Answering (VQA), covering 6 kinds of question types.
Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across
13 culturally grounded topics. In addition, we release a paired parallel corpus
of annotations across 10 Indic languages, creating a unique resource for
analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum
of 8 models, from proprietary closed-source systems to open-weights medium and
large-scale models. Our experiments reveal substantial performance gaps,
underscoring the limitations of current VLMs in culturally diverse contexts. By
centering cultural diversity and multilinguality, IndicVisionBench establishes
a reproducible evaluation framework that paves the way for more inclusive
multimodal research.
Ссылки и действия
Дополнительные ресурсы: