Lossless Vocabulary Reduction for Auto-Regressive Language Models
2510.08102v1
cs.CL, cs.AI, cs.LG, stat.ML
2025-10-11
Авторы:
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi
Abstract
Tokenization -- the process of decomposing a given text into a sequence of
subwords called tokens -- is one of the key components in the development of
language models. Particularly, auto-regressive language models generate texts
token by token, i.e., by predicting the next-token distribution given the
previous ones, and thus tokenization directly affects their efficiency in text
generation. Since each language model has their own vocabulary as a set of
possible tokens, they struggle to cooperate with each other at the level of
next-token distributions such as model ensemble. In this paper, we establish a
theoretical framework of lossless vocabulary reduction, which efficiently
converts a given auto-regressive language model into the one with an
arbitrarily small vocabulary without any loss in accuracy. As an application,
we demonstrate that language models with different tokenization can cooperate
with each other efficiently through their maximal common vocabulary.