Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
2510.01624v1
cs.LG, cs.AI, cs.CL
2025-10-04
Авторы:
Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani
Abstract
In post-training for reasoning Large Language Models (LLMs), the current
state of practice trains LLMs in two independent stages: Supervised Fine-Tuning
(SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as
``RL'' below). In this work, we challenge whether high SFT scores translate to
improved performance after RL. We provide extensive counter-examples where this
is not true. We find high SFT scores can be biased toward simpler or more
homogeneous data and are not reliably predictive of subsequent RL gains or
scaled-up post-training effectiveness. In some cases, RL training on models
with improved SFT performance could lead to substantially worse outcome
compared to RL on the base model without SFT. We study alternative metrics and
identify generalization loss on held-out reasoning examples and Pass@large k
performance to provide strong proxies for the RL outcome. We trained hundreds
of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive
evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU
hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple
state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL
performance, prediction based on generalization loss and Pass@large k achieves
substantial higher precision, improving $R^2$ coefficient and Spearman's rank
correlation coefficient by up to 0.5 (2x). This provides strong utility for
broad use cases. For example, in most experiments, we find SFT training on
unique examples for a one epoch underperforms training on half examples for two
epochs, either after SFT or SFT-then-RL; With the same SFT budget, training
only on short examples may lead to better SFT performance, though, it often
leads to worse outcome after RL compared to training on examples with varying
lengths. Evaluation tool will be open-sourced.
Ссылки и действия
Дополнительные ресурсы: