FOCUS: Efficient Keyframe Selection for Long Video Understanding
2510.27280v1
cs.CV, cs.AI, cs.LG
2025-11-04
Авторы:
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You
Abstract
Multimodal large language models (MLLMs) represent images and video frames as
visual tokens. Scaling from single images to hour-long videos, however,
inflates the token budget far beyond practical limits. Popular pipelines
therefore either uniformly subsample or apply keyframe selection with
retrieval-style scoring using smaller vision-language models. However, these
keyframe selection methods still rely on pre-filtering before selection to
reduce the inference cost and can miss the most informative moments.
We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a
training-free, model-agnostic keyframe selection module that selects
query-relevant frames under a strict token budget. FOCUS formulates keyframe
selection as a combinatorial pure-exploration (CPE) problem in multi-armed
bandits: it treats short temporal clips as arms, and uses empirical means and
Bernstein confidence radius to identify informative regions while preserving
exploration of uncertain areas. The resulting two-stage
exploration-exploitation procedure reduces from a sequential policy with
theoretical guarantees, first identifying high-value temporal regions, then
selecting top-scoring frames within each region On two long-video
question-answering benchmarks, FOCUS delivers substantial accuracy improvements
while processing less than 2% of video frames. For videos longer than 20
minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating
its effectiveness as a keyframe selection method and providing a simple and
general solution for scalable long-video understanding with MLLMs.
Ссылки и действия
Дополнительные ресурсы: