Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
2510.12287v1
cs.CV, cs.CL
2025-10-16
Авторы:
Sifan Li, Hongkai Chen, Yujun Cai, Qingwen Ye, Liyang Chen, Junsong Yuan, Yiwei Wang
Abstract
Vision Language Models (VLMs) have achieved impressive progress in multimodal
reasoning; yet, they remain vulnerable to hallucinations, where outputs are not
grounded in visual evidence. In this paper, we investigate a previously
overlooked setting: logo hallucination, where models generate brand names or
textual content despite logos containing no visible words. Using curated splits
of pure symbols, hybrids, and text-bearing logos, as well as the challenging
Hard-60 subset, we systematically measure hallucination across leading VLMs. We
further probe robustness through nine structured perturbations and show that
hallucinations persist even under strong distortions, with occlusion exposing
the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA
demonstrates that hallucination is tied to a small subset of projector
dimensions, and targeted ablation substantially reduces errors while preserving
OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic
priors rather than genuine glyph perception, particularly for iconic circular
logos, and that projector subspaces play a decisive role in this failure mode.
Our work contributes both a novel diagnostic lens and actionable mitigation
insights, highlighting projector disentanglement and OCR-guided decoding as
promising directions for building more trustworthy multimodal systems.
Ссылки и действия
Дополнительные ресурсы: