To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
2510.08510v1
cs.CV, cs.AI, cs.CL
2025-10-11
Авторы:
Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal
Abstract
Large Vision Language Models (LVLMs) have recently emerged as powerful
architectures capable of understanding and reasoning over both visual and
textual information. These models typically rely on two key components: a
Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual
content into a sequence of image tokens and serves as the perceptual front-end
-- the eyes of the model. In contrast, the LLM interprets these tokens to
perform high-level reasoning, generates responses, and functions as the
cognitive core -- the brain of the model. However, it remains unclear which
visual tokens contribute most significantly to understanding and reasoning, and
how effectively these signals are propagated from ViT to the LLM. While most
existing works have focused on identifying attention sinks, low-semantic tokens
receiving disproportionately high attention, within the LLM, we shift the focus
to the vision encoder by identifying a class of high-norm visual tokens from
ViT, referred to as ViT attention sinks -- a problem that has been rarely
studied but is indeed very important for LVLMs. Our findings show that these
ViT sinks encapsulate high-level semantic concepts from images, allowing the
LLM to perform more effective understanding and reasoning. Despite their
importance, these sink tokens are often overlooked in existing LVLM
architectures. To explore their contribution, we present both qualitative and
quantitative analyses of the information embedded in these sink tokens. We also
propose both training-free and training-based approaches to better leverage how
this information is interpreted by the LLM, and to what extent. By explicitly
utilizing these tokens, we demonstrate substantial improvements across a range
of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT
attention sinks in enhancing visual reasoning.
Ссылки и действия
Дополнительные ресурсы: