AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
2510.00404v2
cs.LG, cs.AI, cs.CL
2025-10-04
Авторы:
Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu
Abstract
Sparse autoencoders (SAEs) have emerged as powerful techniques for
interpretability of large language models (LLMs), aiming to decompose hidden
states into meaningful semantic features. While several SAE variants have been
proposed, there remains no principled framework to derive SAEs from the
original dictionary learning formulation. In this work, we introduce such a
framework by unrolling the proximal gradient method for sparse coding. We show
that a single-step update naturally recovers common SAE variants, including
ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation
of existing SAEs: their sparsity-inducing regularizers enforce non-negativity,
preventing a single feature from representing bidirectional concepts (e.g.,
male vs. female). This structural constraint fragments semantic axes into
separate, redundant features, limiting representational completeness. To
address this issue, we propose AbsTopK SAE, a new variant derived from the
$\ell_0$ sparsity constraint that applies hard thresholding over the
largest-magnitude activations. By preserving both positive and negative
activations, AbsTopK uncovers richer, bidirectional conceptual representations.
Comprehensive experiments across four LLMs and seven probing and steering tasks
show that AbsTopK improves reconstruction fidelity, enhances interpretability,
and enables single features to encode contrasting concepts. Remarkably, AbsTopK
matches or even surpasses the Difference-in-Mean method, a supervised approach
that requires labeled data for each concept and has been shown in prior work to
outperform SAEs.
Ссылки и действия
Дополнительные ресурсы: