What Does It Take to Build a Performant Selective Classifier?
2510.20242v1
cs.LG, cs.AI, stat.ML
2025-10-25
Авторы:
Stephan Rabanser, Nicolas Papernot
Abstract
Selective classifiers improve model reliability by abstaining on inputs the
model deems uncertain. However, few practical approaches achieve the
gold-standard performance of a perfect-ordering oracle that accepts examples
exactly in order of correctness. Our work formalizes this shortfall as the
selective-classification gap and present the first finite-sample decomposition
of this gap to five distinct sources of looseness: Bayes noise, approximation
error, ranking error, statistical noise, and implementation- or shift-induced
slack. Crucially, our analysis reveals that monotone post-hoc calibration --
often believed to strengthen selective classifiers -- has limited impact on
closing this gap, since it rarely alters the model's underlying score ranking.
Bridging the gap therefore requires scoring mechanisms that can effectively
reorder predictions rather than merely rescale them. We validate our
decomposition on synthetic two-moons data and on real-world vision and language
benchmarks, isolating each error component through controlled experiments. Our
results confirm that (i) Bayes noise and limited model capacity can account for
substantial gaps, (ii) only richer, feature-aware calibrators meaningfully
improve score ordering, and (iii) data shift introduces a separate slack that
demands distributionally robust training. Together, our decomposition yields a
quantitative error budget as well as actionable design guidelines that
practitioners can use to build selective classifiers which approximate ideal
oracle behavior more closely.
Ссылки и действия
Дополнительные ресурсы: