Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
2510.21090v1
cs.CL, cs.AI, cs.LG
2025-10-28
Авторы:
Qingru Zhang, Liang Qiu, Ilgee Hong, Zhenghao Xu, Tianyi Liu, Shiyang Li, Rongzhi Zhang, Zheng Li, Lihong Li, Bing Yin, Chao Zhang, Jianshu Chen, Haoming Jiang, Tuo Zhao
Abstract
Supervised fine-tuning (SFT) has emerged as a crucial method for aligning
large language models (LLMs) with human-annotated demonstrations. However, SFT,
being an off-policy approach similar to behavior cloning, often struggles with
overfitting and poor out-of-domain generalization, especially in limited-data
scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel
fine-tuning method that leverages on-policy techniques to enhance
generalization performance. Our approach combines the strengths of SFT and
proximal policy optimization (PPO) to achieve more effective alignment from
demonstration data. At its core is a reward function designed as the log policy
ratio between the SFT model and the pretrained base model. This function serves
as an implicit reward signal, using the pretrained policy as a baseline and the
SFT policy as a target. By doing so, it enables on-policy fine-tuning without
relying on human preference annotations. The integration of this self-rewarding
mechanism with PPO addresses key limitations of SFT, improving generalization,
data efficiency, and robustness. Our empirical evaluation across a range of
natural language processing tasks demonstrates that Self-Rewarding PPO
consistently outperforms traditional SFT methods. The results highlight the
effectiveness of our approach in aligning LLMs using demonstration data,
particularly in scenarios where high-quality annotated data is scarce.
Ссылки и действия
Дополнительные ресурсы: