Atlas-Alignment: Making Interpretability Transferable Across Language Models
2510.27413v1
cs.LG, cs.AI, cs.CL
2025-11-04
Авторы:
Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek
Abstract
Interpretability is crucial for building safe, reliable, and controllable
language models, yet existing interpretability pipelines remain costly and
difficult to scale. Interpreting a new model typically requires costly training
of model-specific sparse autoencoders, manual or semi-automated labeling of SAE
components, and their subsequent validation. We introduce Atlas-Alignment, a
framework for transferring interpretability across language models by aligning
unknown latent spaces to a Concept Atlas - a labeled, human-interpretable
latent space - using only shared inputs and lightweight representational
alignment techniques. Once aligned, this enables two key capabilities in
previously opaque models: (1) semantic feature search and retrieval, and (2)
steering generation along human-interpretable atlas concepts. Through
quantitative and qualitative evaluations, we show that simple representational
alignment methods enable robust semantic retrieval and steerable generation
without the need for labeled concept data. Atlas-Alignment thus amortizes the
cost of explainable AI and mechanistic interpretability: by investing in one
high-quality Concept Atlas, we can make many new models transparent and
controllable at minimal marginal cost.
Ссылки и действия
Дополнительные ресурсы: