PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
2511.01571v1
cs.CV, cs.RO
2025-11-06
Авторы:
Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong
Abstract
Vision-Language-Action models (VLAs) are emerging as powerful tools for
learning generalizable visuomotor control policies. However, current VLAs are
mostly trained on large-scale image-text-action data and remain limited in two
key ways: (i) they struggle with pixel-level scene understanding, and (ii) they
rely heavily on textual prompts, which reduces their flexibility in real-world
settings. To address these challenges, we introduce PixelVLA, the first VLA
model designed to support both pixel-level reasoning and multimodal prompting
with text and visual inputs. Our approach is built on a new visuomotor
instruction tuning framework that integrates a multiscale pixel-aware encoder
with a visual prompting encoder. To train PixelVLA effectively, we further
propose a two-stage automated annotation pipeline that generates Pixel-160K, a
large-scale dataset with pixel-level annotations derived from existing robot
data. Experiments on three standard VLA benchmarks and two VLA model variants
show that PixelVLA improves manipulation success rates by 10.1%-17.8% over
OpenVLA, while requiring only 1.5% of its pretraining cost. These results
demonstrate that PixelVLA can be integrated into existing VLAs to enable more
accurate, efficient, and versatile robot control in complex environments. The
dataset and code will be released as open source.
Ссылки и действия
Дополнительные ресурсы: