Unraveling Syntax: How Language Models Learn Context-Free Grammars
2510.02524v1
cs.CL, cs.FL, cs.LG
2025-10-07
Авторы:
Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio
Abstract
We introduce a new framework for understanding how language models acquire
syntax. While large models achieve impressive results, little is known about
their learning dynamics. Our approach starts with the observation that most
domains of interest, such as natural language syntax, coding languages,
arithmetic problems, are captured by probabilistic context-free grammars
(PCFGs). We study the learning dynamics of small models trained on synthetic
languages generated from PCFGs, enabling precise control over grammar
complexity, recursion depth, and subgrammar structure. We prove several
general, recursive formulae for the training loss and Kullback-Leibler
divergence over the subgrammar structure of a PCFG. Empirically, we find that
unlike children, who first master simple substructures before progressing to
more complex constructions, transformers reduce loss across all subgrammars in
parallel. We further show that subgrammar pretraining can improve the final
loss for smaller models, and that pretrained models develop internal
representations more aligned with the grammar's substructure. Finally, we
demonstrate that models struggle with deeper recursive structures (a limitation
even of large language models), revealing fundamental challenges in how neural
networks represent hierarchical syntax. Overall, our work initiates the study
of the learning dynamics of transformers on PCFGs as a versatile testbed for
probing learning in language models, opening a research direction with many
open questions.
Ссылки и действия
Дополнительные ресурсы: