OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation
2511.01210v2
cs.CV, cs.RO
2025-11-07
Авторы:
Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu
Abstract
Vision-language-action (VLA) models have shown strong generalization for
robotic action prediction through large-scale vision-language pretraining.
However, most existing models rely solely on RGB cameras, limiting their
perception and, consequently, manipulation capabilities. We present OmniVLA, an
omni-modality VLA model that integrates novel sensing modalities for
physically-grounded spatial intelligence beyond RGB perception. The core of our
approach is the sensor-masked image, a unified representation that overlays
spatially grounded and physically meaningful masks onto the RGB images, derived
from sensors including an infrared camera, a mmWave radar, and a microphone
array. This image-native unification keeps sensor input close to RGB statistics
to facilitate training, provides a uniform interface across sensor hardware,
and enables data-efficient learning with lightweight per-sensor projectors.
Built on this, we present a multisensory vision-language-action model
architecture and train the model based on an RGB-pretrained VLA backbone. We
evaluate OmniVLA on challenging real-world tasks where sensor-modality
perception guides the robotic manipulation. OmniVLA achieves an average task
success rate of 84%, significantly outperforms both RGB-only and
raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing
higher learning efficiency and stronger generalization capability.
Ссылки и действия
Дополнительные ресурсы: