Why Less is More (Sometimes): A Theory of Data Curation
2511.03492v1
cs.LG, stat.ML
2025-11-07
Авторы:
Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat
Abstract
This paper introduces a theoretical framework to resolve a central paradox in
modern machine learning: When is it better to use less data? This question has
become critical as classical scaling laws suggesting ``more is more'' (Sun et
al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et
al., 2025; Muenighoff et al., 2025), which achieve superior performance with
small, aggressively curated datasets. Here, we study data curation strategies
where an imperfect oracle selects the training examples according to their
difficulty and correctness. Our results provide exact scaling law curves for
test error under both label-agnostic and label-aware curation rules, revealing
when and why keeping only a subset of data can improve generalization. In
contrast to classical scaling laws, we show that under certain conditions,
small curated datasets can outperform full datasets, and we provide analytical
conditions for this by deriving precise phase transition curves tied to data
size and quality. We validate these theoretical claims with empirical results
on ImageNet, confirming our predictions about when curation improves accuracy
and can even mitigate model collapse. Furthermore, our framework provides a
principled explanation for the contradictory curation strategies recently
observed in LLM mathematical reasoning.
Ссылки и действия
Дополнительные ресурсы: