Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

2511.03070v1 cs.AI, cs.LG, stat.ML 2025-11-07

Авторы:

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

Abstract

Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Авторы:

Abstract

Ссылки и действия

Связанные статьи

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detect...

Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spati...

The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

Understanding the Role of Training Data in Test-Time Scaling

Навигация