From Black-box to Causal-box: Towards Building More Interpretable Models
2510.21998v1
cs.LG, cs.AI, stat.ML
2025-10-29
Авторы:
Inwoo Hwang, Yushu Pan, Elias Bareinboim
Abstract
Understanding the predictions made by deep learning models remains a central
challenge, especially in high-stakes applications. A promising approach is to
equip models with the ability to answer counterfactual questions --
hypothetical ``what if?'' scenarios that go beyond the observed data and
provide insight into a model reasoning. In this work, we introduce the notion
of causal interpretability, which formalizes when counterfactual queries can be
evaluated from a specific class of models and observational data. We analyze
two common model classes -- blackbox and concept-based predictors -- and show
that neither is causally interpretable in general. To address this gap, we
develop a framework for building models that are causally interpretable by
design. Specifically, we derive a complete graphical criterion that determines
whether a given model architecture supports a given counterfactual query. This
leads to a fundamental tradeoff between causal interpretability and predictive
accuracy, which we characterize by identifying the unique maximal set of
features that yields an interpretable model with maximal predictive
expressiveness. Experiments corroborate the theoretical findings.
Ссылки и действия
Дополнительные ресурсы: