INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
2510.01389v1
cs.RO, cs.AI, cs.LG
2025-10-04
Авторы:
Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald
Abstract
Recent Vision-Language-Action (VLA) models show strong generalization
capabilities, yet they lack introspective mechanisms for anticipating failures
and requesting help from a human supervisor. We present \textbf{INSIGHT}, a
learning framework for leveraging token-level uncertainty signals to predict
when a VLA should request help. Using $\pi_0$-FAST as the underlying model, we
extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based
estimates of \emph{aleatoric and epistemic uncertainty}, and train compact
transformer classifiers to map these sequences to help triggers. We explore
supervision regimes for strong or weak supervision, and extensively compare
them across in-distribution and out-of-distribution tasks. Our results show a
trade-off: strong labels enable models to capture fine-grained uncertainty
dynamics for reliable help detection, while weak labels, though noisier, still
support competitive introspection when training and evaluation are aligned,
offering a scalable path when dense annotation is impractical. Crucially, we
find that modeling the temporal evolution of token-level uncertainty signals
with transformers provides far greater predictive power than static
sequence-level scores. This study provides the first systematic evaluation of
uncertainty-based introspection in VLAs, opening future avenues for active
learning and for real-time error mitigation through selective human
intervention.
Ссылки и действия
Дополнительные ресурсы: