Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods
2510.05901v1
cs.LG, cs.AI, cs.CL
2025-10-09
Авторы:
Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Abstract
Transformers' quadratic computational complexity limits their scalability
despite remarkable performance. While linear attention reduces this to linear
complexity, pre-training such models from scratch remains, in most cases,
prohibitively expensive. Recent post-training linearisation methods convert
pre-trained Transformers to linear models efficiently, often using hybrid
approaches that combine linear attention with sliding-window softmax. We
identify a critical flaw: existing hybrid methods inadvertently bypass the
linear component, relying almost entirely on SWA. Component-level diagnostics
reveal this previously undetected behaviour stems from overlooked evaluation
practices on common-sense benchmarks. We propose three solutions to ensure
balanced component usage: (i) inference-time hybridisation of linear-only
conversions with sliding-window softmax; (ii) HedgeCATs, combining
attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled
Sliding-window Dropout (SSD), which stochastically suppresses the softmax
branch during training to prevent component collapse. Our methods maintain
computational efficiency while recovering most base model performance and
ensuring genuine linear attention adoption, restoring the validity of
performance attributions in hybrid conversions.
Ссылки и действия
Дополнительные ресурсы: