Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

2510.21184v1 cs.LG, cs.AI, cs.CL, stat.ML 2025-10-28

Авторы:

Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse

Abstract

Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Авторы:

Abstract

Ссылки и действия

Связанные статьи

LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

Towards Scalable Meta-Learning of near-optimal Interpretable Models via Syntheti...

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Ste...

Deep sequence models tend to memorize geometrically; it is unclear why

Sequences of Logits Reveal the Low Rank Structure of Language Models

Навигация