Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
2510.08442v1
cs.CV, cs.AI, cs.RO
2025-10-11
Авторы:
Andrew Lee, Ian Chuang, Dechen Gao, Kai Fukazawa, Iman Soltani
Abstract
Visual Reinforcement Learning (RL) agents must learn to act based on
high-dimensional image data where only a small fraction of the pixels is
task-relevant. This forces agents to waste exploration and computational
resources on irrelevant features, leading to sample-inefficient and unstable
learning. To address this, inspired by human visual foveation, we introduce
Gaze on the Prize. This framework augments visual RL with a learnable foveal
attention mechanism (Gaze), guided by a self-supervised signal derived from the
agent's experience pursuing higher returns (the Prize). Our key insight is that
return differences reveal what matters most: If two similar representations
produce different outcomes, their distinguishing features are likely
task-relevant, and the gaze should focus on them accordingly. This is realized
through return-guided contrastive learning that trains the attention to
distinguish between the features relevant to success and failure. We group
similar visual representations into positives and negatives based on their
return differences and use the resulting labels to construct contrastive
triplets. These triplets provide the training signal that teaches the attention
mechanism to produce distinguishable representations for states associated with
different outcomes. Our method achieves up to 2.4x improvement in sample
efficiency and can solve tasks that the baseline fails to learn, demonstrated
across a suite of manipulation tasks from the ManiSkill3 benchmark, all without
modifying the underlying algorithm or hyperparameters.
Ссылки и действия
Дополнительные ресурсы: