Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

2511.21342v1 cs.SD, cs.AI 2025-11-27

Авторы:

Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira

Abstract

Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Large Speech Model Enabled Semantic Communication

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-...

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GR...

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio ...

State Space Models for Bioacoustics: A comparative Evaluation with Transformers

Навигация