The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models
2510.23191v1
cs.LG, stat.ML
2025-10-29
Авторы:
Timo Freiesleben, Sebastian Zezulka
Abstract
Predictive benchmarking, the evaluation of machine learning models based on
predictive performance and competitive ranking, is a central epistemic practice
in machine learning research and an increasingly prominent method for
scientific inquiry. Yet, benchmark scores alone provide at best measurements of
model performance relative to an evaluation dataset and a concrete learning
problem. Drawing substantial scientific inferences from the results, say about
theoretical tasks like image classification, requires additional assumptions
about the theoretical structure of the learning problems, evaluation functions,
and data distributions. We make these assumptions explicit by developing
conditions of construct validity inspired by psychological measurement theory.
We examine these assumptions in practice through three case studies, each
exemplifying a typical intended inference: measuring engineering progress in
computer vision with ImageNet; evaluating policy-relevant weather predictions
with WeatherBench; and examining limitations of the predictability of life
events with the Fragile Families Challenge. Our framework clarifies the
conditions under which benchmark scores can support diverse scientific claims,
bringing predictive benchmarking into perspective as an epistemological
practice and a key site of conceptual and theoretical reasoning in machine
learning.
Ссылки и действия
Дополнительные ресурсы: