Morpheme Induction for Emergent Language
2510.03439v1
cs.CL, I.2.7; I.6.m
2025-10-08
Авторы:
Brendon Boldt, David Mortensen
Abstract
We introduce CSAR, an algorithm for inducing morphemes from emergent language
corpora of parallel utterances and meanings. It is a greedy algorithm that (1)
weights morphemes based on mutual information between forms and meanings, (2)
selects the highest-weighted pair, (3) removes it from the corpus, and (4)
repeats the process to induce further morphemes (i.e., Count, Select, Ablate,
Repeat). The effectiveness of CSAR is first validated on procedurally generated
datasets and compared against baselines for related tasks. Second, we validate
CSAR's performance on human language data to show that the algorithm makes
reasonable predictions in adjacent domains. Finally, we analyze a handful of
emergent languages, quantifying linguistic characteristics like degree of
synonymy and polysemy.
Ссылки и действия
Дополнительные ресурсы: