Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
2510.21184v1
cs.LG, cs.AI, cs.CL, stat.ML
2025-10-28
Авторы:
Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
Abstract
Reinforcement learning (RL) has become a predominant technique to align
language models (LMs) with human preferences or promote outputs which are
deemed to be desirable by a given reward function. Standard RL approaches
optimize average reward, while methods explicitly focused on reducing the
probability of undesired outputs typically come at a cost to average-case
performance. To improve this tradeoff, we introduce RePULSe, a new training
method that augments the standard RL loss with an additional loss that uses
learned proposals to guide sampling low-reward outputs, and then reduces those
outputs' probability. We run experiments demonstrating that RePULSe produces a
better tradeoff of expected reward versus the probability of undesired outputs
and is more adversarially robust, compared to standard RL alignment approaches
and alternatives.