The Coverage Principle: How Pre-training Enables Post-Training
2510.15020v1
stat.ML, cs.AI, cs.CL, cs.LG, math.ST, stat.TH
2025-10-21
Авторы:
Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster
Abstract
Language models demonstrate remarkable abilities when pre-trained on large
text corpora and fine-tuned for specific tasks, but how and why pre-training
shapes the success of the final model remains poorly understood. Notably,
although pre-training success is often quantified by cross entropy loss,
cross-entropy can be a poor predictor of downstream performance. Instead, we
provide a theoretical perspective on this relationship through the lens of
\emph{coverage}, which quantifies the probability mass the pre-trained model
places on high-quality responses and which is necessary and sufficient for
post-training and test-time scaling methods such as Best-of-N to succeed. Our
main results develop an understanding of \emph{the coverage principle}, a
phenomenon whereby next-token prediction implicitly optimizes toward a model
with good coverage. In particular, we uncover a mechanism that explains the
power of coverage in predicting downstream performance: \emph{coverage
generalizes faster than cross entropy}, avoiding spurious dependence on
problem-dependent parameters such as the sequence length. We also study
practical algorithmic interventions with provable benefits for improving
coverage, including (i) model/checkpoint selection procedures, (ii) gradient
normalization schemes, and (iii) test-time decoding strategies.