Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
2510.20787v1
cs.CL, cs.LG
2025-10-25
Авторы:
Mutian He, Philip N. Garner
Abstract
Linear-attention models that compress the entire input sequence into a
fixed-size recurrent state offer an efficient alternative to Transformers, but
their finite memory induces forgetfulness that harms retrieval-intensive tasks.
To mitigate the issue, we explore a series of hybrid models that restore direct
access to past tokens. We interleave token mixers with intermediate time and
space complexity between linear and full attention, including sparse attention
with token eviction, and the query-aware native sparse attention. Particularly,
we propose a novel learnable token eviction approach. Combined with
sliding-window attention, an end-to-end trainable lightweight CNN aggregates
information from both past and future adjacent tokens to adaptively retain a
limited set of critical KV-pairs per head, maintaining linear attention's
constant time and space complexity. Efficient Triton kernels for the sparse
attention mechanisms are provided. Empirical evaluations on retrieval-intensive
benchmarks support the effectiveness of our approaches.
Ссылки и действия
Дополнительные ресурсы: