BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods
2510.06811v1
cs.CL, cs.LG
2025-10-10
Авторы:
Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, Barbara Plank
Abstract
The Circuit Localization track of the Mechanistic Interpretability Benchmark
(MIB) evaluates methods for localizing circuits within large language models
(LLMs), i.e., subnetworks responsible for specific task behaviors. In this
work, we investigate whether ensembling two or more circuit localization
methods can improve performance. We explore two variants: parallel and
sequential ensembling. In parallel ensembling, we combine attribution scores
assigned to each edge by different methods-e.g., by averaging or taking the
minimum or maximum value. In the sequential ensemble, we use edge attribution
scores obtained via EAP-IG as a warm start for a more expensive but more
precise circuit identification method, namely edge pruning. We observe that
both approaches yield notable gains on the benchmark metrics, leading to a more
precise circuit identification approach. Finally, we find that taking a
parallel ensemble over various methods, including the sequential ensemble,
achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB
Shared Task, comparing ensemble scores to official baselines across multiple
model-task combinations.
Ссылки и действия
Дополнительные ресурсы: