Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
2510.18874v1
cs.LG, cs.CL
2025-10-23
Авторы:
Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
Abstract
Adapting language models (LMs) to new tasks via post-training carries the
risk of degrading existing capabilities -- a phenomenon classically known as
catastrophic forgetting. In this paper, toward identifying guidelines for
mitigating this phenomenon, we systematically compare the forgetting patterns
of two widely adopted post-training methods: supervised fine-tuning (SFT) and
reinforcement learning (RL). Our experiments reveal a consistent trend across
LM families (Llama, Qwen) and tasks (instruction following, general knowledge,
and arithmetic reasoning): RL leads to less forgetting than SFT while achieving
comparable or higher target task performance. To investigate the cause for this
difference, we consider a simplified setting in which the LM is modeled as a
mixture of two distributions, one corresponding to prior knowledge and the
other to the target task. We identify that the mode-seeking nature of RL, which
stems from its use of on-policy data, enables keeping prior knowledge intact
when learning the target task. We then verify this insight by demonstrating
that the use on-policy data underlies the robustness of RL to forgetting in
practical settings, as opposed to other algorithmic choices such as the KL
regularization or advantage estimation. Lastly, as a practical implication, our
results highlight the potential of mitigating forgetting using approximately
on-policy data, which can be substantially more efficient to obtain than fully
on-policy data.
Ссылки и действия
Дополнительные ресурсы: