YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

2512.04793v1 cs.SD, cs.AI 2025-12-06

Авторы:

Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen

Abstract

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Large Speech Model Enabled Semantic Communication

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-...

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio ...

State Space Models for Bioacoustics: A comparative Evaluation with Transformers

Probabilistic Fusion and Calibration of Neural Speaker Diarization Models

Навигация