VideoNSA: Native Sparse Attention Scales Video Understanding
2510.02295v1
cs.CV, cs.AI, cs.LG
2025-10-04
Авторы:
Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu
Abstract
Video understanding in multimodal language models remains limited by context
length: models often miss key transition frames and struggle to maintain
coherence across long time scales. To address this, we adapt Native Sparse
Attention (NSA) to video-language models. Our method, VideoNSA, adapts
Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We
employ a hardware-aware hybrid approach to attention, preserving dense
attention for text, while employing NSA for video. Compared to
token-compression and training-free sparse baselines, VideoNSA achieves
improved performance on long-video understanding, temporal reasoning, and
spatial benchmarks. Further ablation analysis reveals four key findings: (1)
reliable scaling to 128K tokens; (2) an optimal global-local attention
allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4)
the learnable combined sparse attention help induce dynamic attention sinks.
Ссылки и действия
Дополнительные ресурсы: