PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
2510.00485v1
cs.SD, cs.AI, eess.AS
2025-10-05
Авторы:
Yujia Xiao, Liumeng Xue, Lei He, Xinyi Chen, Aemon Yat Fei Chiu, Wenjie Tian, Shaofei Zhang, Qiuqiang Kong, Xinfa Zhu, Wei Xue, Tan Lee
Abstract
Recently, an increasing number of multimodal (text and audio) benchmarks have
emerged, primarily focusing on evaluating models' understanding capability.
However, exploration into assessing generative capabilities remains limited,
especially for open-ended long-form content generation. Significant challenges
lie in no reference standard answer, no unified evaluation metrics and
uncontrollable human judgments. In this work, we take podcast-like audio
generation as a starting point and propose PodEval, a comprehensive and
well-designed open-source evaluation framework. In this framework: 1) We
construct a real-world podcast dataset spanning diverse topics, serving as a
reference for human-level creative quality. 2) We introduce a multimodal
evaluation strategy and decompose the complex task into three dimensions: text,
speech and audio, with different evaluation emphasis on "Content" and "Format".
3) For each modality, we design corresponding evaluation methods, involving
both objective metrics and subjective listening test. We leverage
representative podcast generation systems (including open-source, close-source,
and human-made) in our experiments. The results offer in-depth analysis and
insights into podcast generation, demonstrating the effectiveness of PodEval in
evaluating open-ended long-form audio. This project is open-source to
facilitate public use: https://github.com/yujxx/PodEval.
Ссылки и действия
Дополнительные ресурсы: