Sparse Attention Post-Training for Mechanistic Interpretability

2512.05865v1 cs.LG, cs.AI 2025-12-08

Авторы:

Florent Draye, Anson Lei, Ingmar Posner, Bernhard Schölkopf

Abstract

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Sparse Attention Post-Training for Mechanistic Interpretability

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Approximation of Box Decomposition Algorithm for Fast Hypervolume-Based Multi-Ob...

NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Mole...

Neural Coherence : Find higher performance to out-of-distribution tasks from few...

Impugan: Learning Conditional Generative Models for Robust Data Imputation

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Att...

Навигация