Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
2510.17196v1
cs.CL, cs.AI, cs.LG
2025-10-22
Авторы:
Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Abstract
Effectively processing long contexts is a critical challenge for language
models. While standard Transformers are limited by quadratic complexity and
poor length extrapolation, alternative architectures like sliding window
attention and state space models sacrifice the ability to effectively utilize
the full context due to their fixed-size memory. Chunk-based sparse attention
has emerged as a promising paradigm for extreme length generalization, yet the
key architectural principles underpinning its success are not yet fully
understood. In this work, we present a systematic dissection of these models to
identify the core components driving their performance. Through a unified
framework and comprehensive ablation studies, we demonstrate that a combination
of three design principles is critical: (1) an expressive, non-linear Chunk
Encoder with a dedicated CLS token to produce representations for retrieval;
(2) a Bypassing Residual Path to stably integrate retrieved global information
without it being overridden by the local residual stream; and (3) enforced
selection sparsity during pre-training to bridge the train-test distribution
gap. We provide a theoretical motivation for intra-chunk information processing
and landmark generation. By combining these principles, we establish a new
state-of-the-art for training-free length extrapolation, successfully
generalizing models trained on a 4K context to 32 million tokens on RULER and
BABILong. Our findings provide a clear and empirically-grounded set of design
principles for developing future, highly-capable long-context language models.
Ссылки и действия
Дополнительные ресурсы: