The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
2510.23965v2
cs.AI, cs.LG, stat.ML
2025-10-30
Авторы:
Ali Aouad, Aymane El Gadarri, Vivek F. Farias
Abstract
Traditional LLM alignment methods are vulnerable to heterogeneity in human
preferences. Fitting a na\"ive probabilistic model to pairwise comparison data
(say over prompt-completion pairs) yields an inconsistent estimate of the
population-average utility -a canonical measure of social welfare. We propose a
new method, dubbed the sign estimator, that provides a simple, provably
consistent, and efficient estimator by replacing cross-entropy with binary
classification loss in the aggregation step. This simple modification recovers
consistent ordinal alignment under mild assumptions and achieves the first
polynomial finite-sample error bounds in this setting. In realistic simulations
of LLM alignment using digital twins, the sign estimator substantially reduces
preference distortion over a panel of simulated personas, cutting (angular)
estimation error by nearly 35% and decreasing disagreement with true population
preferences from 12% to 8% compared to standard RLHF. Our method also compares
favorably to panel data heuristics that explicitly model user heterogeneity and
require tracking individual-level preference data-all while maintaining the
implementation simplicity of existing LLM alignment pipelines.
Ссылки и действия
Дополнительные ресурсы: