Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking
2510.09528v1
cs.CL, cs.SD, eess.AS
2025-10-14
Авторы:
Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, Hossein Sameti
Abstract
Pre-trained transformer-based models have significantly advanced automatic
speech recognition (ASR), yet they remain sensitive to accent and dialectal
variations, resulting in elevated word error rates (WER) in linguistically
diverse languages such as English and Persian. To address this challenge, we
propose an accent-invariant ASR framework that integrates accent and dialect
classification into the recognition pipeline. Our approach involves training a
spectrogram-based classifier to capture accent-specific cues, masking the
regions most influential to its predictions, and using the masked spectrograms
for data augmentation. This enhances the robustness of ASR models against
accent variability. We evaluate the method using both English and Persian
speech. For Persian, we introduce a newly collected dataset spanning multiple
regional accents, establishing the first systematic benchmark for accent
variation in Persian ASR that fills a critical gap in multilingual speech
research and provides a foundation for future studies on low-resource,
linguistically diverse languages. Experimental results with the Whisper model
demonstrate that our masking and augmentation strategy yields substantial WER
reductions in both English and Persian settings, confirming the effectiveness
of the approach. This research advances the development of multilingual ASR
systems that are resilient to accent and dialect diversity. Code and dataset
are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR
Ссылки и действия
Дополнительные ресурсы: