CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

2511.04139v1 cs.CL, cs.SD 2025-11-08

Авторы:

Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R., Fung

Abstract

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

A new kid on the block: Distributional semantics predicts the word-specific tone...

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokk...

POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Tex...

MLMA: Towards Multilingual ASR With Mamba-based Architectures

Навигация