SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

2510.26190v1 cs.SD, cs.CL, eess.AS 2025-11-01

Авторы:

Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang, Zhizheng Wu

Abstract

The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Авторы:

Abstract

Ссылки и действия

Связанные статьи

emg2speech: synthesizing speech from electromyography using self-supervised spee...

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Sci-Phi: A Large Language Model Spatial Audio Descriptor

Навигация