PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
2509.25774v1
cs.CV, cs.AI, cs.LG
2025-10-02
Авторы:
Jeongjae Lee, Jong Chul Ye
Abstract
While reinforcement learning has advanced the alignment of text-to-image
(T2I) models, state-of-the-art policy gradient methods are still hampered by
training instability and high variance, hindering convergence speed and
compromising image quality. Our analysis identifies a key cause of this
instability: disproportionate credit assignment, in which the mathematical
structure of the generative sampler produces volatile and non-proportional
feedback across timesteps. To address this, we introduce Proportionate Credit
Policy Optimization (PCPO), a framework that enforces proportional credit
assignment through a stable objective reformulation and a principled
reweighting of timesteps. This correction stabilizes the training process,
leading to significantly accelerated convergence and superior image quality.
The improvement in quality is a direct result of mitigating model collapse, a
common failure mode in recursive training. PCPO substantially outperforms
existing policy gradient baselines on all fronts, including the
state-of-the-art DanceGRPO.
Ссылки и действия
Дополнительные ресурсы: