Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training
2510.08855v1
cs.LG, cs.AI, cs.CL
2025-10-14
Авторы:
T. Ed Li, Junyu Ren
Abstract
Understanding the internal representations of large language models is
crucial for ensuring their reliability and safety, with sparse autoencoders
(SAEs) emerging as a promising interpretability approach. However, current SAE
training methods face feature absorption, where features (or neurons) are
absorbed into each other to minimize $L_1$ penalty, making it difficult to
consistently identify and analyze model behaviors. We introduce Adaptive
Temporal Masking (ATM), a novel training approach that dynamically adjusts
feature selection by tracking activation magnitudes, frequencies, and
reconstruction contributions to compute importance scores that evolve over
time. ATM applies a probabilistic masking mechanism based on statistical
thresholding of these importance scores, creating a more natural feature
selection process. Through extensive experiments on the Gemma-2-2b model, we
demonstrate that ATM achieves substantially lower absorption scores compared to
existing methods like TopK and JumpReLU SAEs, while maintaining excellent
reconstruction quality. These results establish ATM as a principled solution
for learning stable, interpretable features in neural networks, providing a
foundation for more reliable model analysis.
Ссылки и действия
Дополнительные ресурсы: