Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

2510.24103v1 cs.SD, cs.AI, cs.MM, eess.AS 2025-10-30
Авторы:

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

Abstract

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

Ссылки и действия

Связанные статьи

AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

## Контекст Область исследования, связанная с выработкой ролевой игры в крупных языковых моделях (LLMs), является важной...

2025-10-01

Disentangling Score Content and Performance Style for Joint Piano Rendering and ...

#### Контекст Изучение музыкальных процессов в области музыкального информационного восстанования (MIR) является ключев...

2025-10-01

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

########################## ## Контекст ########################## Область исследования Image-to-Music (I2M) генерировани...

2025-09-30

Emotion-Aware Speech Generation with Character-Specific Voices for Comics

## Контекст Современные комиксы, помимо текстов и картинок, часто включают сюжетные линии и персонажей со специфичными х...

2025-09-22

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

**Резюме:** Музыкальные записи, особенно созданные в непрофессиональных условиях, часто имеют дефекты, такие как избыто...

2025-08-06