Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

2512.06169v1 cs.CL 2025-12-09

Авторы:

Chris Crawford

Abstract

This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

LOCUS: A System and Method for Low-Cost Customization for Universal Specializati...

Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-J...

Knowing What's Missing: Assessing Information Sufficiency in Question Answering

Modeling Contextual Passage Utility for Multihop Question Answering

Навигация