Learning Affordances at Inference-Time for Vision-Language-Action Models
2510.19752v1
cs.RO, cs.AI, 68T40, I.2.9; I.2.8
2025-10-24
Авторы:
Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, Sergey Levine
Abstract
Solving complex real-world control tasks often takes multiple tries: if we
fail at first, we reflect on what went wrong, and change our strategy
accordingly to avoid making the same mistake. In robotics,
Vision-Language-Action models (VLAs) offer a promising path towards solving
complex control tasks, but lack the ability to contextually and dynamically
readjust behavior when they fail to accomplish a task. In this work, we
introduce Learning from Inference-Time Execution (LITEN), which connects a VLA
low-level policy to a high-level VLM that conditions on past experiences by
including them in-context, allowing it to learn the affordances and
capabilities of the low-level VLA. Our approach iterates between a reasoning
phase that generates and executes plans for the low-level VLA, and an
assessment phase that reflects on the resulting execution and draws useful
conclusions to be included in future reasoning contexts. Unlike similar
approaches to self-refinement in non-robotics domains, LITEN must reflect on
unstructured real-world robot trajectories (e.g., raw videos), which requires
structured guiderails during assessment. Our experimental results demonstrate
LITEN is able to effectively learn from past experience to generate plans that
use high-affordance instructions to accomplish long-horizon tasks.