Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
2509.25380v1
cs.LG, cs.AI, cs.CL
2025-10-02
Авторы:
Shane Bergsma, Nolan Dey, Joel Hestness
Abstract
Data curriculums have become central to successful LLM training, yet
principles governing optimal data placement remain unclear. We introduce the
*training re-evaluation curve (TREC)*, a diagnostic that retrospectively
evaluates training batches *using the final model weights*. The TREC
characterizes how well a trained model retains training data as a function of
*when* the data was encountered during training. Analyzing TRECs for models
from 111M to 3.9B parameters, we show that placing high-quality data at low
points on the TREC significantly improves performance. Importantly, while a
TREC is initially observable only after training, we demonstrate it can be
*predicted in advance* from AdamW's implicit EMA coefficients, enabling
proactive curriculum design. By predicting TRECs for published training
recipes, we explain prior ablations and reveal suboptimal data placements. We
also align high-quality data with TREC minima in order to improve continual
pre-training of a 3.9B-parameter LLM trained on 900B tokens.
Ссылки и действия
Дополнительные ресурсы: