Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models
2510.06107v2
cs.CL, cs.AI, cs.CE
2025-10-10
Авторы:
Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona
Abstract
Large Language Models (LLMs) are prone to hallucination, the generation of
plausible yet factually incorrect statements. This work investigates the
intrinsic, architectural origins of this failure mode through three primary
contributions. First, to enable the reliable tracing of internal semantic
failures, we propose Distributional Semantics Tracing (DST), a unified
framework that integrates established interpretability techniques to produce a
causal map of a model's reasoning, treating meaning as a function of context
(distributional semantics). Second, we pinpoint the model's layer at which a
hallucination becomes inevitable, identifying a specific commitment layer where
a model's internal representations irreversibly diverge from factuality. Third,
we identify the underlying mechanism for these failures. We observe a conflict
between distinct computational pathways, which we interpret using the lens of
dual-process theory: a fast, heuristic associative pathway (akin to System 1)
and a slow, deliberate, contextual pathway (akin to System 2), leading to
predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's
ability to quantify the coherence of the contextual pathway reveals a strong
negative correlation ($\rho = -0.863$) with hallucination rates, implying that
these failures are predictable consequences of internal semantic weakness. The
result is a mechanistic account of how, when, and why hallucinations occur
within the Transformer architecture.
Ссылки и действия
Дополнительные ресурсы: