CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
2511.04139v1
cs.CL, cs.SD
2025-11-08
Авторы:
Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R., Fung
Abstract
Automatic speech recognition (ASR) is critical for language accessibility,
yet low-resource Cantonese remains challenging due to limited annotated data,
six lexical tones, tone sandhi, and accent variation. Existing ASR models, such
as Whisper, often suffer from high word error rates. Large audio-language
models (LALMs), in contrast, can leverage broader contextual reasoning but
still require explicit tonal and prosodic acoustic cues. We introduce CantoASR,
a collaborative ASR-LALM error correction framework that integrates forced
alignment for acoustic feature extraction, a LoRA-finetuned Whisper for
improved tone discrimination, and an instruction-tuned Qwen-Audio for
prosody-aware correction. Evaluations on spontaneous Cantonese data show
substantial CER gains over Whisper-Large-V3. These findings suggest that
integrating acoustic cues with LALM reasoning provides a scalable strategy for
low-resource tonal and dialectal ASR.
Ссылки и действия
Дополнительные ресурсы: