Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models
2510.13915v1
cs.CL, cs.AI, cs.LG
2025-10-18
Авторы:
Ivan Lee, Taylor Berg-Kirkpatrick
Abstract
Recent studies suggest that very small language models (SLMs) can generate
surprisingly coherent text when trained on simplified, child-directed corpora
such as TinyStories. These findings have been interpreted as evidence that
readability -- characterized by accessible vocabulary, familiar narrative
structure, and simple syntax -- plays a key role in enabling such capabilities
to emerge. In this paper, we challenge that interpretation. We construct
synthetic datasets with matched structure but varied readability, and find that
readability alone does not predict coherence or learning efficiency in SLMs.
Models trained on complex, adult-level text perform comparably to those trained
on simplified language, and even exhibit faster development of coherence during
training. Instead, we show that statistical simplicity, as measured by n-gram
diversity, is a stronger predictor of learnability. Our findings caution
against the growing trend of anthropomorphizing language model training --
drawing parallels to human cognitive development without empirical basis -- and
argue for more precise reasoning about what properties actually support
capability emergence in small models.
Ссылки и действия
Дополнительные ресурсы: