SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

2511.07931v1 cs.SD, cs.AI, cs.CL 2025-11-15
Авторы:

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

Abstract

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Ссылки и действия

Связанные статьи

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated C...

## Контекст Область исследования генерируемой речи, особенно в сфере медицины, набирает все большую популярность. Многие...

2025-09-25

Spatial Audio Motion Understanding and Reasoning

## Контекст Спектр применений звуковой распознаваемости и рассуждений в машинном обучении растёт, но на данный момент с...

2025-09-20

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

## Контекст Современное развитие технологий глубокого обучения позволило создавать аудио-глубокие подделки (audio deepf...

2025-09-13