Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

2510.17354v1 cs.CL, cs.AI, cs.IR, cs.LG 2025-10-22

Авторы:

Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

General Agentic Memory Via Deep Research

Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for K...

DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text...

Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Inte...

Навигация