NeMo: Needle in a Montage for Video-Language Understanding
2509.24563v1
cs.CV, cs.CL
2025-10-01
Авторы:
Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang
Abstract
Recent advances in video large language models (VideoLLMs) call for new
evaluation protocols and benchmarks for complex temporal reasoning in
video-language understanding. Inspired by the needle in a haystack test widely
used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed
to assess VideoLLMs' critical reasoning capabilities, including long-context
recall and temporal grounding. To generate video question answering data for
our task, we develop a scalable automated data generation pipeline that
facilitates high-quality data synthesis. Built upon the proposed pipeline, we
present NeMoBench, a video-language benchmark centered on our task.
Specifically, our full set of NeMoBench features 31,378 automatically generated
question-answer (QA) pairs from 13,486 videos with various durations ranging
from seconds to hours. Experiments demonstrate that our pipeline can reliably
and automatically generate high-quality evaluation data, enabling NeMoBench to
be continuously updated with the latest videos. We evaluate 20 state-of-the-art
models on our benchmark, providing extensive results and key insights into
their capabilities and limitations. Our project page is available at:
https://lavi-lab.github.io/NeMoBench.
Ссылки и действия
Дополнительные ресурсы: