Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
2510.20193v1
cs.IR, cs.CL, cs.CV, cs.LG
2025-10-25
Авторы:
Rahul Raja, Arpita Vats
Abstract
Question Answering (QA) systems have traditionally relied on structured text
data, but the rapid growth of multimedia content (images, audio, video, and
structured metadata) has introduced new challenges and opportunities for
retrieval-augmented QA. In this survey, we review recent advancements in QA
systems that integrate multimedia retrieval pipelines, focusing on
architectures that align vision, language, and audio modalities with user
queries. We categorize approaches based on retrieval methods, fusion
techniques, and answer generation strategies, and analyze benchmark datasets,
evaluation protocols, and performance tradeoffs. Furthermore, we highlight key
challenges such as cross-modal alignment, latency-accuracy tradeoffs, and
semantic grounding, and outline open problems and future research directions
for building more robust and context-aware QA systems leveraging multimedia
data.