CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
2511.02360v1
cs.CV, cs.CL
2025-11-06
Авторы:
Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
Abstract
In human cognition, there exist numerous thought processes that are tacit and
beyond verbal expression, enabling us to understand and interact with the world
in multiple ways. However, contemporary Vision-Language Models (VLMs) remain
constrained to reasoning within the discrete and rigid space of linguistic
tokens, thereby bottlenecking the rich, high-dimensional nature of visual
perception. To bridge this gap, we propose CoCoVa (Chain of Continuous
Vision-Language Thought), a novel framework for vision-language model that
leverages continuous cross-modal reasoning for diverse vision-language tasks.
The core of CoCoVa is an iterative reasoning cycle, where a novel Latent
Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a
chain of latent thought vectors through cross-modal fusion. To focus this
process, a token selection mechanism dynamically identifies salient visual
regions, mimicking attentional focus. To ensure these latent thoughts remain
grounded, we train the model with a multi-task objective that combines
contrastive learning and diffusion-based reconstruction, enforcing alignment
between latent representations and both visual and textual modalities.
Evaluations show CoCoVa improves accuracy and token efficiency over strong
baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B
models on almost all benchmarks. When scaled to 7B LLM backbones, it remains
competitive with state-of-the-art models. Qualitative analysis validates that
learned latent space captures interpretable and structured reasoning patterns,
highlighting the potential of CoCoVa to bridge the representational gap between
discrete language processing and the continuous nature of visual understanding.
Ссылки и действия
Дополнительные ресурсы: