Circuit Insights: Towards Interpretability Beyond Activations
2510.14936v1
cs.LG, cs.AI, cs.CL
2025-10-18
Авторы:
Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin
Abstract
The fields of explainable AI and mechanistic interpretability aim to uncover
the internal structure of neural networks, with circuit discovery as a central
tool for understanding model computations. Existing approaches, however, rely
on manual inspection and remain limited to toy tasks. Automated
interpretability offers scalability by analyzing isolated features and their
activations, but it often misses interactions between features and depends
strongly on external LLMs and dataset quality. Transcoders have recently made
it possible to separate feature attributions into input-dependent and
input-invariant components, providing a foundation for more systematic circuit
analysis. Building on this, we propose WeightLens and CircuitLens, two
complementary methods that go beyond activation-based analysis. WeightLens
interprets features directly from their learned weights, removing the need for
explainer models or datasets while matching or exceeding the performance of
existing methods on context-independent features. CircuitLens captures how
feature activations arise from interactions between components, revealing
circuit-level dynamics that activation-only approaches cannot identify.
Together, these methods increase interpretability robustness and enhance
scalable mechanistic analysis of circuits while maintaining efficiency and
quality.
Ссылки и действия
Дополнительные ресурсы: