The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

2510.23965v2 cs.AI, cs.LG, stat.ML 2025-10-30

Авторы:

Ali Aouad, Aymane El Gadarri, Vivek F. Farias

Abstract

Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

Авторы:

Abstract

Ссылки и действия

Связанные статьи

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detect...

Epidemiology of Large Language Models: A Benchmark for Observational Distributio...

Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spati...

Understanding the Role of Training Data in Test-Time Scaling

Навигация