Offline Preference Optimization via Maximum Marginal Likelihood Estimation
2510.22881v1
cs.LG, cs.CL
2025-10-29
Авторы:
Saeed Najafi, Alona Fyshe
Abstract
Aligning Large Language Models (LLMs) with human preferences is crucial, but
standard methods like Reinforcement Learning from Human Feedback (RLHF) are
often complex and unstable. In this work, we propose a new, simpler approach
that recasts alignment through the lens of Maximum Marginal Likelihood (MML)
estimation. Our new MML based Preference Optimization (MMPO) maximizes the
marginal log-likelihood of a preferred text output, using the preference pair
as samples for approximation, and forgoes the need for both an explicit reward
model and entropy maximization. We theoretically demonstrate that MMPO
implicitly performs preference optimization, producing a weighted gradient that
naturally up-weights chosen responses over rejected ones. Across models ranging
from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable
with respect to the hyperparameter $\beta$ compared to alternative baselines,
and 2) achieves competitive or superior preference alignment while better
preserving the base model's general language capabilities. Through a series of
ablation experiments, we show that this improved performance is indeed
attributable to MMPO's implicit preference optimization within the gradient
updates.
Ссылки и действия
Дополнительные ресурсы: