Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
2510.22694v1
cs.CV, cs.CL, cs.IR
2025-10-29
Авторы:
Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising
method to generate factual and up-to-date responses of Multimodal Large
Language Models (MLLMs) by incorporating non-parametric knowledge from external
knowledge bases. However, existing MRAG approaches suffer from static retrieval
strategies, inflexible modality selection, and suboptimal utilization of
retrieved information, leading to three critical challenges: determining when
to retrieve, what modality to incorporate, and how to utilize retrieved
information effectively. To address these challenges, we introduce Windsock, a
query-dependent module making decisions on retrieval necessity and modality
selection, effectively reducing computational overhead and improving response
quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction
Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize
retrieved information while maintaining robustness against noise. Moreover, we
adopt a self-assessment approach leveraging knowledge within MLLMs to convert
question-answering datasets to MRAG training datasets. Extensive experiments
demonstrate that our proposed method significantly improves the generation
quality by 17.07% while reducing 8.95% retrieval times.
Ссылки и действия
Дополнительные ресурсы: