FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
2510.05829v1
cs.SD, cs.CV, cs.LG, cs.MM, eess.AS
2025-10-09
Авторы:
Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
Abstract
In this work, we present FoleyGRAM, a novel approach to video-to-audio
generation that emphasizes semantic conditioning through the use of aligned
multimodal encoders. Building on prior advancements in video-to-audio
generation, FoleyGRAM leverages the Gramian Representation Alignment Measure
(GRAM) to align embeddings across video, text, and audio modalities, enabling
precise semantic control over the audio generation process. The core of
FoleyGRAM is a diffusion-based audio synthesis model conditioned on
GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness
and temporal alignment with the corresponding input video. We evaluate
FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio
models. Our experiments demonstrate that aligning multimodal encoders using
GRAM enhances the system's ability to semantically align generated audio with
video content, advancing the state of the art in video-to-audio synthesis.