Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
2510.12389v1
cs.CL, cs.AI, I.2.7; I.2.1; H.3.3; F.2.2
2025-10-16
Авторы:
Hailay Kidu Teklehaymanot, Wolfgang Nejdl
Abstract
Tokenization disparities pose a significant barrier to achieving equitable
access to artificial intelligence across linguistically diverse populations.
This study conducts a large-scale cross-linguistic evaluation of tokenization
efficiency in over 200 languages to systematically quantify computational
inequities in large language models (LLMs). Using a standardized experimental
framework, we applied consistent preprocessing and normalization protocols,
followed by uniform tokenization through the tiktoken library across all
language samples. Comprehensive tokenization statistics were collected using
established evaluation metrics, including Tokens Per Sentence (TPS) and
Relative Tokenization Cost (RTC), benchmarked against English baselines. Our
cross-linguistic analysis reveals substantial and systematic disparities:
Latin-script languages consistently exhibit higher tokenization efficiency,
while non-Latin and morphologically complex languages incur significantly
greater token inflation, often 3-5 times higher RTC ratios. These
inefficiencies translate into increased computational costs and reduced
effective context utilization for underrepresented languages. Overall, the
findings highlight structural inequities in current AI systems, where speakers
of low-resource and non-Latin languages face disproportionate computational
disadvantages. Future research should prioritize the development of
linguistically informed tokenization strategies and adaptive vocabulary
construction methods that incorporate typological diversity, ensuring more
inclusive and computationally equitable multilingual AI systems.