Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
2510.16387v1
cs.CL, cs.AI, cs.SD, eess.AS
2025-10-22
Авторы:
Fu-An Chao, Bi-Cheng Yan, Berlin Chen
Abstract
In this paper, we explore the untapped potential of Whisper, a
well-established automatic speech recognition (ASR) foundation model, in the
context of L2 spoken language assessment (SLA). Unlike prior studies that
extrinsically analyze transcriptions produced by Whisper, our approach goes a
step further to probe its latent capabilities by extracting acoustic and
linguistic features from hidden representations. With only a lightweight
classifier being trained on top of Whisper's intermediate and final outputs,
our method achieves strong performance on the GEPT picture-description dataset,
outperforming existing cutting-edge baselines, including a multimodal approach.
Furthermore, by incorporating image and text-prompt information as auxiliary
relevance cues, we demonstrate additional performance gains. Finally, we
conduct an in-depth analysis of Whisper's embeddings, which reveals that, even
without task-specific fine-tuning, the model intrinsically encodes both ordinal
proficiency patterns and semantic aspects of speech, highlighting its potential
as a powerful foundation for SLA and other spoken language understanding tasks.