Unified Multimodal Diffusion Forcing for Forceful Manipulation
2511.04812v1
cs.RO, cs.AI, cs.LG
2025-11-11
Авторы:
Zixuan Huang, Huaidian Hou, Dmitry Berenson
Abstract
Given a dataset of expert trajectories, standard imitation learning
approaches typically learn a direct mapping from observations (e.g., RGB
images) to actions. However, such methods often overlook the rich interplay
between different modalities, i.e., sensory inputs, actions, and rewards, which
is crucial for modeling robot behavior and understanding task outcomes. In this
work, we propose Multimodal Diffusion Forcing, a unified framework for learning
from multimodal robot trajectories that extends beyond action generation.
Rather than modeling a fixed distribution, MDF applies random partial masking
and trains a diffusion model to reconstruct the trajectory. This training
objective encourages the model to learn temporal and cross-modal dependencies,
such as predicting the effects of actions on force signals or inferring states
from partial observations. We evaluate MDF on contact-rich, forceful
manipulation tasks in simulated and real-world environments. Our results show
that MDF not only delivers versatile functionalities, but also achieves strong
performance, and robustness under noisy observations. More visualizations can
be found on our website https://unified-df.github.io
Ссылки и действия
Дополнительные ресурсы: