A Characterization of List Language Identification in the Limit
2511.04103v1
cs.CL, cs.AI, cs.DS, cs.LG
2025-11-08
Авторы:
Moses Charikar, Chirag Pabbaraju, Ambuj Tewari
Abstract
We study the problem of language identification in the limit, where given a
sequence of examples from a target language, the goal of the learner is to
output a sequence of guesses for the target language such that all the guesses
beyond some finite time are correct. Classical results of Gold showed that
language identification in the limit is impossible for essentially any
interesting collection of languages. Later, Angluin gave a precise
characterization of language collections for which this task is possible.
Motivated by recent positive results for the related problem of language
generation, we revisit the classic language identification problem in the
setting where the learner is given the additional power of producing a list of
$k$ guesses at each time step. The goal is to ensure that beyond some finite
time, one of the guesses is correct at each time step.
We give an exact characterization of collections of languages that can be
$k$-list identified in the limit, based on a recursive version of Angluin's
characterization (for language identification with a list of size $1$). This
further leads to a conceptually appealing characterization: A language
collection can be $k$-list identified in the limit if and only if the
collection can be decomposed into $k$ collections of languages, each of which
can be identified in the limit (with a list of size $1$). We also use our
characterization to establish rates for list identification in the statistical
setting where the input is drawn as an i.i.d. stream from a distribution
supported on some language in the collection. Our results show that if a
collection is $k$-list identifiable in the limit, then the collection can be
$k$-list identified at an exponential rate, and this is best possible. On the
other hand, if a collection is not $k$-list identifiable in the limit, then it
cannot be $k$-list identified at any rate that goes to zero.