The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity
2511.04418v1
cs.LG, cs.CL
2025-11-08
Авторы:
Tim Tomov, Dominik Fuchsgruber, Tom Wollschläger, Stephan Günnemann
Abstract
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is
critical for trustworthy deployment. While real-world language is inherently
ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically
benchmarked against tasks with no ambiguity. In this work, we demonstrate that
while current uncertainty estimators perform well under the restrictive
assumption of no ambiguity, they degrade to close-to-random performance on
ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first
ambiguous question-answering (QA) datasets equipped with ground-truth answer
distributions estimated from factual co-occurrence. We find this performance
deterioration to be consistent across different estimation paradigms: using the
predictive distribution itself, internal representations throughout the model,
and an ensemble of models. We show that this phenomenon can be theoretically
explained, revealing that predictive-distribution and ensemble-based estimators
are fundamentally limited under ambiguity. Overall, our study reveals a key
shortcoming of current UQ methods for LLMs and motivates a rethinking of
current modeling paradigms.
Ссылки и действия
Дополнительные ресурсы: