AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

2511.18617v1 cs.RO, cs.CV 2025-11-26

Авторы:

Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li, Erdem Biyik

Abstract

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Авторы:

Abstract

Ссылки и действия

Связанные статьи

From Generated Human Videos to Physically Plausible Robot Trajectories

Sign Language Recognition using Bidirectional Reservoir Computing

FOM-Nav: Frontier-Object Maps for Object Goal Navigation

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

Estimation of Kinematic Motion from Dashcam Footage

Навигация