Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
2510.00845v2
cs.LG, cs.AI, cs.CL
2025-10-04
Авторы:
Maxime Méloux, François Portet, Maxime Peyrard
Abstract
The development of trustworthy artificial intelligence requires moving beyond
black-box performance metrics toward an understanding of models' internal
computations. Mechanistic Interpretability (MI) aims to meet this need by
identifying the algorithmic mechanisms underlying model behaviors. Yet, the
scientific rigor of MI critically depends on the reliability of its findings.
In this work, we argue that interpretability methods, such as circuit
discovery, should be viewed as statistical estimators, subject to questions of
variance and robustness. To illustrate this statistical framing, we present a
systematic stability analysis of a state-of-the-art circuit discovery method:
EAP-IG. We evaluate its variance and robustness through a comprehensive suite
of controlled perturbations, including input resampling, prompt paraphrasing,
hyperparameter variation, and injected noise within the causal analysis itself.
Across a diverse set of models and tasks, our results demonstrate that EAP-IG
exhibits high structural variance and sensitivity to hyperparameters,
questioning the stability of its findings. Based on these results, we offer a
set of best-practice recommendations for the field, advocating for the routine
reporting of stability metrics to promote a more rigorous and statistically
grounded science of interpretability.
Ссылки и действия
Дополнительные ресурсы: