Learning to Generate Object Interactions with Physics-Guided Video Diffusion
2510.02284v1
cs.CV, cs.AI, cs.LG
2025-10-04
Авторы:
David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev
Abstract
Recent models for video generation have achieved remarkable progress and are
now deployed in film, social media production, and advertising. Beyond their
creative potential, such models also hold promise as world simulators for
robotics and embodied decision making. Despite strong advances, however,
current approaches still struggle to generate physically plausible object
interactions and lack physics-grounded control mechanisms. To address this
limitation, we introduce KineMask, an approach for physics-guided video
generation that enables realistic rigid body control, interactions, and
effects. Given a single image and a specified object velocity, our method
generates videos with inferred motions and future object interactions. We
propose a two-stage training strategy that gradually removes future motion
supervision via object masks. Using this strategy we train video diffusion
models (VDMs) on synthetic scenes of simple interactions and demonstrate
significant improvements of object interactions in real scenes. Furthermore,
KineMask integrates low-level motion control with high-level textual
conditioning via predictive scene descriptions, leading to effective support
for synthesis of complex dynamical phenomena. Extensive experiments show that
KineMask achieves strong improvements over recent models of comparable size.
Ablation studies further highlight the complementary roles of low- and
high-level conditioning in VDMs. Our code, model, and data will be made
publicly available.
Ссылки и действия
Дополнительные ресурсы: