Attention Is All You Need for KV Cache in Diffusion LLMs
2510.14973v1
cs.CL, cs.AI, cs.LG
2025-10-18
Авторы:
Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
Abstract
This work studies how to adaptively recompute key-value (KV) caches for
diffusion large language models (DLMs) to maximize prediction accuracy while
minimizing decoding latency. Prior methods' decoders recompute QKV for all
tokens at every denoising step and layer, despite KV states changing little
across most steps, especially in shallow layers, leading to substantial
redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens
primarily act as a length-bias and can be cached block-wise beyond the active
prediction window; (2) KV dynamics increase with depth, suggesting that
selective refresh starting from deeper layers is sufficient; and (3) the
most-attended token exhibits the smallest KV drift, providing a conservative
lower bound on cache change for other tokens. Building on these, we propose
${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that
jointly decides ${when}$ to refresh (via an attention-aware drift test on the
most-attended token) and ${where}$ to refresh (via a depth-aware schedule that
recomputes from a chosen layer onward while reusing shallow-layer caches and
off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs
adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant
computation and accelerating decoding with negligible loss in generation
quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across
mathematical reasoning and code generation tasks demonstrate consistent
speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences,
and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy
than the baseline. Our method achieves significantly higher throughput
($6.8\times$ on GSM8K) than existing confidence-based approaches while
preserving generation quality, enabling practical deployment of diffusion LLMs.
Ссылки и действия
Дополнительные ресурсы: