Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification
2510.16281v1
cs.RO, cs.AI, cs.LG
2025-10-22
Авторы:
Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, Claudia P'erez-D'Arpino
Abstract
Reasoning Vision Language Action (VLA) models improve robotic
instruction-following by generating step-by-step textual plans before low-level
actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language
models. Yet even with a correct textual plan, the generated actions can still
miss the intended outcomes in the plan, especially in out-of-distribution (OOD)
scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness,
and introduce a training-free, runtime policy steering method for
reasoning-action alignment. Given a reasoning VLA's intermediate textual plan,
our framework samples multiple candidate action sequences from the same model,
predicts their outcomes via simulation, and uses a pre-trained Vision-Language
Model (VLM) to select the sequence whose outcome best aligns with the VLA's own
textual plan. Only executing action sequences that align with the textual
reasoning turns our base VLA's natural action diversity from a source of error
into a strength, boosting robustness to semantic and visual OOD perturbations
and enabling novel behavior composition without costly re-training. We also
contribute a reasoning-annotated extension of LIBERO-100, environment
variations tailored for OOD evaluation, and demonstrate up to 15% performance
gain over prior work on behavior composition tasks and scales with compute and
data diversity. Project Website at:
https://yilin-wu98.github.io/steering-reasoning-vla/
Ссылки и действия
Дополнительные ресурсы: