Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
2510.20244v1
cs.CV, cs.LG
2025-10-25
Авторы:
Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, Sangyoun Lee
Abstract
Video Temporal Grounding (VTG) aims to localize temporal segments in long,
untrimmed videos that align with a given natural language query. This task
typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection
(HD). While recent advances have been progressed by powerful pretrained
vision-language models such as CLIP and InternVideo2, existing approaches
commonly treat all text tokens uniformly during crossmodal attention,
disregarding their distinct semantic roles. To validate the limitations of this
approach, we conduct controlled experiments demonstrating that VTG models
overly rely on [EOS]-driven global semantics while failing to effectively
utilize word-level signals, which limits their ability to achieve fine-grained
temporal alignment. Motivated by this limitation, we propose DualGround, a
dual-branch architecture that explicitly separates global and local semantics
by routing the [EOS] token through a sentence-level path and clustering word
tokens into phrase-level units for localized grounding. Our method introduces
(1) tokenrole- aware cross modal interaction strategies that align video
features with sentence-level and phrase-level semantics in a structurally
disentangled manner, and (2) a joint modeling framework that not only improves
global sentence-level alignment but also enhances finegrained temporal
grounding by leveraging structured phrase-aware context. This design allows the
model to capture both coarse and localized semantics, enabling more expressive
and context-aware video grounding. DualGround achieves state-of-the-art
performance on both Moment Retrieval and Highlight Detection tasks across
QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of
disentangled semantic modeling in video-language alignment.
Ссылки и действия
Дополнительные ресурсы: