Arabic Little STT: Arabic Children Speech Recognition Dataset
2510.23319v1
cs.CL, cs.AI, cs.HC, cs.LG, cs.SD
2025-10-29
Авторы:
Mouhand Alkadri, Dania Desouki, Khloud Al Jallad
Abstract
The performance of Artificial Intelligence (AI) systems fundamentally depends
on high-quality training data. However, low-resource languages like Arabic
suffer from severe data scarcity. Moreover, the absence of child-specific
speech corpora is an essential gap that poses significant challenges. To
address this gap, we present our created dataset, Arabic Little STT, a dataset
of Levantine Arabic child speech recorded in classrooms, containing 355
utterances from 288 children (ages 6 - 13). We further conduct a systematic
assessment of Whisper, a state-of-the-art automatic speech recognition (ASR)
model, on this dataset and compare its performance with adult Arabic
benchmarks. Our evaluation across eight Whisper variants reveals that even the
best-performing model (Large_v3) struggles significantly, achieving a 0.66 word
error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on
adult datasets. These results align with other research on English speech.
Results highlight the critical need for dedicated child speech benchmarks and
inclusive training data in ASR development. Emphasizing that such data must be
governed by strict ethical and privacy frameworks to protect sensitive child
information. We hope that this study provides an initial step for future work
on equitable speech technologies for Arabic-speaking children. We hope that our
publicly available dataset enrich the children's demographic representation in
ASR datasets.