Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
2510.20812v1
cs.CV, cs.AI, cs.CL
2025-10-25
Авторы:
Yuhan Liu, Lianhui Qin, Shengjie Wang
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in
multimodal understanding, yet they struggle when reasoning over
information-intensive images that densely interleave textual annotations with
fine-grained graphical elements. The main challenges lie in precisely
localizing critical cues in dense layouts and multi-hop reasoning to integrate
dispersed evidence. We propose Speculative Verdict (SV), a training-free
framework inspired by speculative decoding that combines multiple lightweight
draft experts with a large verdict model. In the draft stage, small VLMs act as
draft experts to generate reasoning paths that provide diverse localization
candidates; in the verdict stage, a strong VLM synthesizes these paths to
produce the final answer, minimizing computational cost while recovering
correct answers. To further improve efficiency and accuracy, SV introduces a
consensus expert selection mechanism that forwards only high-agreement
reasoning paths to the verdict. Empirically, SV achieves consistent gains on
challenging information-intensive and high-resolution visual question answering
benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K.
By synthesizing correct insights from multiple partially accurate reasoning
paths, SV achieves both error correction and cost-efficiency compared to large
proprietary models or training pipelines. Code is available at
https://github.com/Tinaliu0123/speculative-verdict
Ссылки и действия
Дополнительные ресурсы: