The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
2510.06096v2
cs.LG, cs.CL
2025-10-09
Авторы:
Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
Abstract
The objectives that Large Language Models (LLMs) implicitly optimize remain
dangerously opaque, making trustworthy alignment and auditing a grand
challenge. While Inverse Reinforcement Learning (IRL) can infer reward
functions from behaviour, existing approaches either produce a single,
overconfident reward estimate or fail to address the fundamental ambiguity of
the task (non-identifiability). This paper introduces a principled auditing
framework that re-frames reward inference from a simple estimation task to a
comprehensive process for verification. Our framework leverages Bayesian IRL to
not only recover a distribution over objectives but to enable three critical
audit capabilities: (i) Quantifying and systematically reducing
non-identifiability by demonstrating posterior contraction over sequential
rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics
that expose spurious shortcuts and identify out-of-distribution prompts where
the inferred objective cannot be trusted; and (iii) Validating policy-level
utility by showing that the refined, low-uncertainty reward can be used
directly in RLHF to achieve training dynamics and toxicity reductions
comparable to the ground-truth alignment process. Empirically, our framework
successfully audits a detoxified LLM, yielding a well-calibrated and
interpretable objective that strengthens alignment guarantees. Overall, this
work provides a practical toolkit for auditors, safety teams, and regulators to
verify what LLMs are truly trying to achieve, moving us toward more trustworthy
and accountable AI.
Ссылки и действия
Дополнительные ресурсы: