VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
2510.09733v1
cs.CL, cs.CV
2025-10-15
Авторы:
Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu, Chi Chen, Zhiyuan Liu, Maosong Sun
Abstract
Visual retrieval-augmented generation (VRAG) augments vision-language models
(VLMs) with external visual knowledge to ground reasoning and reduce
hallucinations. Yet current VRAG systems often fail to reliably perceive and
integrate evidence across multiple images, leading to weak grounding and
erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end
framework that learns to reason with evidence-guided multi-image to address
this issue. The model first observes retrieved images and records per-image
evidence, then derives the final answer from the aggregated evidence. To train
EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy
Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific
tokens to jointly optimize visual perception and reasoning abilities of VLMs.
Experimental results on multiple visual question answering benchmarks
demonstrate that EVisRAG delivers substantial end-to-end gains over backbone
VLM with 27\% improvements on average. Further analysis shows that, powered by
RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and
localizing question-relevant evidence across multiple images and deriving the
final answer from that evidence, much like a real detective.
Ссылки и действия
Дополнительные ресурсы: