Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
2510.17426v1
cs.CL, cs.AI, cs.LG
2025-10-22
Авторы:
Tiancheng Hu, Benjamin Minixhofer, Nigel Collier
Abstract
The "alignment tax" of post-training is typically framed as a drop in task
accuracy. We show it also involves a severe loss of calibration, making models
overconfident, less reliable, and model outputs less diverse. We show that this
trade-off can be navigated effectively via a simple post-hoc intervention:
interpolating between a model's weights before and after alignment. Crucially,
this is not a strict trade-off. We find that the process consistently reveals
Pareto-optimal interpolations - models that improve accuracy beyond both
parents while substantially recovering the calibration lost during alignment.
Our work demonstrates that simple model merging provides a computationally
efficient method for mitigating the full scope of the alignment tax, yielding
models that are more capable and more reliable.
Ссылки и действия
Дополнительные ресурсы: