Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
2510.16096v1
cs.CL, cs.LG
2025-10-22
Авторы:
Tina Behnia, Puneesh Deora, Christos Thrampoulidis
Abstract
Language models are pretrained on sequences that blend statistical
regularities (making text fluent) with factual associations between specific
tokens (knowledge of facts). While recent work suggests that the variability of
their interaction, such as paraphrases of factual associations, critically
determines generalization ability, we lack a systematic analysis of these
impacts. This paper introduces a flexible synthetic testbed that combines a
statistical stream of generic tokens with an abstract factual stream of
source-target token pairs, enabling fine-grained control over their
interaction. The design enables the independent control of diversity nature by
manipulating stream composition (contextual structure) and the diversity level
by varying which statistical streams each fact appears in. Through controlled
experiments, we find that while higher contextual diversity delays
in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD)
factual generalization depends critically on contextual structure. In some
cases, OOD performance follows the same trend as ID, but in others, diversity
becomes essential for non-trivial factual recall. Even when low diversity
prohibits factual recall, optimal diversity levels depend on training duration.
Beyond factual recall failures, we identify structures where statistical
generalization fails independently, and others where both capabilities degrade.
This shows how the interplay between contextual design and diversity level
impacts different generalization aspects. Further, through a series of
controlled interventions on the model components, we trace the OOD failures to
distinct optimization bottlenecks, highlighting the importance of the embedding
and unembedding layers. Our synthetic framework allows us to isolate effects
that would be confounded in large-scale studies, offering a controlled testbed
for future investigations.
Ссылки и действия
Дополнительные ресурсы: