Adaptive Margin RLHF via Preference over Preferences
2509.22851v1
cs.LG, cs.AI, cs.CL
2025-10-01
Авторы:
Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum
Abstract
Margin-based optimization is fundamental to improving generalization and
robustness in classification tasks. In the context of reward model learning
from preferences within Reinforcement Learning from Human Feedback (RLHF),
existing methods typically rely on no margins, fixed margins, or margins that
are simplistic functions of preference ratings. However, such formulations
often fail to account for the varying strengths of different preferences, for
example some preferences are associated with larger margins between responses,
or they rely on noisy margin information derived from ratings. We argue that
modeling the strength of preferences can lead to better generalization and more
faithful alignment. Furthermore, many existing methods that use adaptive
margins assume access to accurate preference scores, which can be difficult for
humans to provide reliably. We propose an approach that leverages preferences
over preferences, that is annotations indicating which of two preferences
reflects a stronger distinction. We use this ordinal signal to infer adaptive
margins on a per-datapoint basis. We introduce an extension to Direct
Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from
preference-over-preference supervision, enabling improved discriminative and
generative performance. Empirically, our method outperforms vanilla DPO, DPO
with fixed margins, and DPO with ground-truth margins on the UltraFeedback
dataset. Additionally, we show that there is a tradeoff between discriminative
and generative performance: improving test classification accuracy,
particularly by correctly labeling weaker preferences at the expense of
stronger ones, can lead to a decline in generative quality. To navigate this
tradeoff, we propose two sampling strategies to gather
preference-over-preference labels: one favoring discriminative performance and
one favoring generative performance.
Ссылки и действия
Дополнительные ресурсы: