Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization
2510.08256v1
cs.LG, cs.AI, cs.CL
2025-10-11
Авторы:
Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev
Abstract
Direct Preference Optimization (DPO) has recently emerged as a simple and
effective alternative to reinforcement learning from human feedback (RLHF) for
aligning large language models (LLMs) with user preferences. However, existing
DPO formulations rely on a single monolithic model, which limits their
expressivity in multi-task settings and their adaptability to heterogeneous or
diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a
framework that extends DPO with both soft mixture models and mixture-of-experts
(MoE) architectures, using a stochastic variational inference approach. Our
method introduces a latent-variable model over expert assignments and optimizes
a variational evidence lower bound (ELBO), enabling stable and efficient
learning of specialized expert policies from preference data. Mix- and MoE-DPO
provides three key advantages over standard DPO: (i) generalization via
universal function approximation through mixtures; (ii) reward and policy
specialization through expert components tailored to distinct preference modes;
and (iii) contextual alignment through input-dependent soft gating that enables
user-specific mixture policies. Our framework supports both shared base
architectures with expert-specific policy heads and fully independent expert
models, allowing flexible trade-offs between parameter efficiency and
specialization. We validate our approach on a variety of model sizes and
multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a
powerful and scalable method for preference-based LLM alignment.
Ссылки и действия
Дополнительные ресурсы: