MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
2510.09065v1
cs.SD, cs.CV, cs.LG, eess.AS
2025-10-14
Авторы:
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Abstract
We introduce MMAudioSep, a generative model for video/text-queried sound
separation that is founded on a pretrained video-to-audio model. By leveraging
knowledge about the relationship between video/text and audio learned through a
pretrained audio generative model, we can train the model more efficiently,
i.e., the model does not need to be trained from scratch. We evaluate the
performance of MMAudioSep by comparing it to existing separation models,
including models based on both deterministic and generative approaches, and
find it is superior to the baseline models. Furthermore, we demonstrate that
even after acquiring functionality for sound separation via fine-tuning, the
model retains the ability for original video-to-audio generation. This
highlights the potential of foundational sound generation models to be adopted
for sound-related downstream tasks. Our code is available at
https://github.com/sony/mmaudiosep.