What do vision-language models see in the context? Investigating multimodal in-context learning
2510.24331v1
cs.LG, cs.CV
2025-10-30
Авторы:
Gabriel O. dos Santos, Esther Colombini, Sandra Avila
Abstract
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks
from demonstration examples without parameter updates. Although it has been
extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs)
remains underexplored. In this work, we present a systematic study of ICL in
VLMs, evaluating seven models spanning four architectures on three image
captioning benchmarks. We analyze how prompt design, architectural choices, and
training strategies influence multimodal ICL. To our knowledge, we are the
first to analyze how attention patterns in VLMs vary with an increasing number
of in-context demonstrations. Our results reveal that training on imag-text
interleaved data enhances ICL performance but does not imply effective
integration of visual and textual information from demonstration examples. In
contrast, instruction tuning improves instruction-following but can reduce
reliance on in-context demonstrations, suggesting a trade-off between
instruction alignment and in-context adaptation. Attention analyses further
show that current VLMs primarily focus on textual cues and fail to leverage
visual information, suggesting a limited capacity for multimodal integration.
These findings highlight key limitations in the ICL abilities of current VLMs
and provide insights for enhancing their ability to learn from multimodal
in-context examples.
Ссылки и действия
Дополнительные ресурсы: