Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations
2511.05295v1
cs.DS, cs.CL, cs.DM, cs.LG
2025-11-11
Авторы:
Jon Kleinberg, Fan Wei
Abstract
The success of large language models (LLMs) has motivated formal theories of
language generation and learning. We study the framework of \emph{language
generation in the limit}, where an adversary enumerates strings from an unknown
language $K$ drawn from a countable class, and an algorithm must generate
unseen strings from $K$. Prior work showed that generation is always possible,
and that some algorithms achieve positive lower density, revealing a
\emph{validity--breadth} trade-off between correctness and coverage. We resolve
a main open question in this line, proving a tight bound of $1/2$ on the best
achievable lower density. We then strengthen the model to allow \emph{partial
enumeration}, where the adversary reveals only an infinite subset $C \subseteq
K$. We show that generation in the limit remains achievable, and if $C$ has
lower density $\alpha$ in $K$, the algorithm's output achieves density at least
$\alpha/2$, matching the upper bound. This generalizes the $1/2$ bound to the
partial-information setting, where the generator must recover within a factor
$1/2$ of the revealed subset's density. We further revisit the classical
Gold--Angluin model of \emph{language identification} under partial
enumeration. We characterize when identification in the limit is possible --
when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ -- and in
the process give a new topological formulation of Angluin's characterization,
showing that her condition is precisely equivalent to an appropriate
topological space having the $T_D$ separation property.