From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling
2510.00743v1
cs.SD, cs.AI, cs.CL, eess.AS
2025-10-04
Авторы:
Yifei Cao, Changhao Jiang, Jiabao Zhuang, Jiajun Sun, Ming Zhang, Zhiheng Xi, Hui Li, Shihan Dou, Yuran Wang, Yunke Zhang, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract
Assessing the perceptual quality of synthetic speech is crucial for guiding
the development and refinement of speech generation models. However, it has
traditionally relied on human subjective ratings such as the Mean Opinion Score
(MOS), which depend on manual annotations and often suffer from inconsistent
rating standards and poor reproducibility. To address these limitations, we
introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS
datasets into a preference-comparison setting, enabling rigorous evaluation
across different datasets. Building on MOS-RMBench, we systematically construct
and evaluate three paradigms for reward modeling: scalar reward models,
semi-scalar reward models, and generative reward models (GRMs). Our experiments
reveal three key findings: (1) scalar models achieve the strongest overall
performance, consistently exceeding 74% accuracy; (2) most models perform
considerably worse on synthetic speech than on human speech; and (3) all models
struggle on pairs with very small MOS differences. To improve performance on
these challenging pairs, we propose a MOS-aware GRM that incorporates an
MOS-difference-based reward function, enabling the model to adaptively scale
rewards according to the difficulty of each sample pair. Experimental results
show that the MOS-aware GRM significantly improves fine-grained quality
discrimination and narrows the gap with scalar models on the most challenging
cases. We hope this work will establish both a benchmark and a methodological
framework to foster more rigorous and scalable research in automatic speech
quality assessment.