Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
2510.06060v1
cs.MM, cs.AI, cs.CV
2025-10-09
Авторы:
Christian Marinoni, Riccardo Fosco Gramaccioni, Eleonora Grassucci, Danilo Comminiello
Abstract
The generation of sounding videos has seen significant advancements with the
advent of diffusion models. However, existing methods often lack the
fine-grained control needed to generate viewpoint-specific content from larger,
immersive 360-degree environments. This limitation restricts the creation of
audio-visual experiences that are aware of off-camera events. To the best of
our knowledge, this is the first work to introduce a framework for controllable
audio-visual generation, addressing this unexplored gap. Specifically, we
propose a diffusion model by introducing a set of powerful conditioning signals
derived from the full 360-degree space: a panoramic saliency map to identify
regions of interest, a bounding-box-aware signed distance map to define the
target viewpoint, and a descriptive caption of the entire scene. By integrating
these controls, our model generates spatially-aware viewpoint videos and audios
that are coherently influenced by the broader, unseen environmental context,
introducing a strong controllability that is essential for realistic and
immersive audio-visual generation. We show audiovisual examples proving the
effectiveness of our framework.
Ссылки и действия
Дополнительные ресурсы: