FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

2510.05829v1 cs.SD, cs.CV, cs.LG, cs.MM, eess.AS 2025-10-09

Авторы:

Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello

Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

Авторы:

Abstract

Ссылки и действия

Связанные статьи

StereoSync: Spatially-Aware Stereo Audio Generation from Video

Навигация