Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA
2509.24445v1
cs.CV, cs.CL
2025-10-01
Авторы:
Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao
Abstract
The performance of Video Question Answering (VideoQA) models is fundamentally
constrained by the nature of their supervision, which typically consists of
isolated, factual question-answer pairs. This "bag-of-facts" approach fails to
capture the underlying narrative and causal structure of events, limiting
models to a shallow understanding of video content. To move beyond this
paradigm, we introduce a framework to synthesize richer supervisory signals. We
propose two complementary strategies: Question-Based Paraphrasing (QBP), which
synthesizes the diverse inquiries (what, how, why) from a video's existing set
of question-answer pairs into a holistic narrative paragraph that reconstructs
the video's event structure; and Question-Based Captioning (QBC), which
generates fine-grained visual rationales, grounding the answer to each question
in specific, relevant evidence. Leveraging powerful generative models, we use
this synthetic data to train VideoQA models under a unified next-token
prediction objective. Extensive experiments on STAR and NExT-QA validate our
approach, demonstrating significant accuracy gains and establishing new
state-of-the-art results, such as improving a 3B model to 72.5\% on STAR
(+4.9\%) and a 7B model to 80.8\% on NExT-QA. Beyond accuracy, our analysis
reveals that both QBP and QBC substantially enhance cross-dataset
generalization, with QBP additionally accelerating model convergence by over
2.5x. These results demonstrate that shifting data synthesis from isolated
facts to narrative coherence and grounded rationales yields a more accurate,
efficient, and generalizable training paradigm.
Ссылки и действия
Дополнительные ресурсы: