Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge
2511.03070v1
cs.AI, cs.LG, stat.ML
2025-11-07
Авторы:
Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim
Abstract
Artificial intelligence (AI) systems hold great promise for advancing various
scientific disciplines, and are increasingly used in real-world applications.
Despite their remarkable progress, further capabilities are expected in order
to achieve more general types of intelligence. A critical distinction in this
context is between factual knowledge, which can be evaluated against true or
false answers (e.g., "what is the capital of England?"), and probabilistic
knowledge, reflecting probabilistic properties of the real world (e.g., "what
is the sex of a computer science graduate in the US?"). In this paper, our goal
is to build a benchmark for understanding the capabilities of LLMs in terms of
knowledge of probability distributions describing the real world. Given that
LLMs are trained on vast amounts of text, it may be plausible that they
internalize aspects of these distributions. Indeed, LLMs are touted as powerful
universal approximators of real-world distributions. At the same time,
classical results in statistics, known as curse of dimensionality, highlight
fundamental challenges in learning distributions in high dimensions,
challenging the notion of universal distributional learning. In this work, we
develop the first benchmark to directly test this hypothesis, evaluating
whether LLMs have access to empirical distributions describing real-world
populations across domains such as economics, health, education, and social
behavior. Our results demonstrate that LLMs perform poorly overall, and do not
seem to internalize real-world statistics naturally. When interpreted in the
context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that
language models do not contain knowledge on observational distributions (Layer
1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional
(Layer 2) and counterfactual (Layer 3) knowledge of these models is also
limited.
Ссылки и действия
Дополнительные ресурсы: