DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
2510.09883v1
cs.CL, cs.LG
2025-10-15
Авторы:
Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm
Abstract
Large reasoning models (LRMs) achieve state-of-the-art performance on
challenging benchmarks by generating long chains of intermediate steps, but
their inference cost is dominated by decoding, where each new token must attend
to the entire growing sequence. Existing sparse attention methods reduce
computation by pruning the key-value (KV) cache, yet they suffer from severe
accuracy degradation on reasoning tasks due to cumulative selection errors and
the dynamic importance of tokens over long derivations. We present
\textbf{DELTA}, a training-free sparse attention mechanism that achieves
computational efficiency without sacrificing model accuracy. DELTA partitions
transformer layers into three groups: initial layers that use full attention, a
small set of \emph{selection layers} that identify salient tokens via
aggregated head-level attention scores, and subsequent \emph{sparse-attention
layers} that attend only to the selected subset. This design preserves the full
KV cache in GPU memory for accuracy, while avoiding expensive full-attention
computation over many layers. On reasoning benchmarks such as AIME and
GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while
reducing the number of attended tokens by up to $5\times$ and delivering
$1.5\times$ end-to-end speedup. Our results show that selective reuse of
intermediate attention maps offers a robust path toward efficient long-context
reasoning.
Ссылки и действия
Дополнительные ресурсы: