🎁👀 Gaze on the Prize:
Shaping Visual Attention
with Return-Guided Contrastive Learning

1Department of Computer Science, University of California, Davis 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 3Department of Mechanical and Aerospace Engineering,University of California, Davis

Side-by-side Comparison of Attention Methods and Contrastive Learning

task, at step




Gaze on the Prize

A Versatile Framework for learning Human-like Gaze

Gaze on the Prize is a framework that guides attention to focus on task-relevant visual features in RL through return-guided contrastive learning. By contrasting similar states with different outcomes, our method guides attention toward the features that matter for task success.



gaze on the prize architecture

a) A CNN backbone encodes observations into feature maps. Instead of passing them directly to the RL algorithm (baseline), our method refines them with a gaze module that predicts Gaussian attention weights parameterized by μx, μy, σx, σy, σxy. Multiplying features by these weights (⊙) creates a human-like, foveated representation for the RL algorithm. b) During training, we store CNN features and returns in a buffer. Triplet mining groups together similar features that yield different returns. c) The attention is applied on each triplet and a contrastive loss on cosine distances (anchor za, positive z+, negative z) guides the module to adjust its attention to better distinguish features by reward.



Q. How do different attention mechanisms affect visual RL performance?

RQ1 figure. Our foveal attention appears to provide essential regularization, preventing these failure modes while maintaining flexibility to focus on task-relevant regions. However for patch attention, as it lacks structural constraints, attention may focus on misleading features or shift too rapidly during training which leads to unstable training.



Q. Does return-guided contrastive learning enhance attention?

RQ2 figure. For challenging tasks, contrastive learning provides stronger impact, where for PushT, contrastive learning provides a 1.48x improvement in sample efficiency to reach 50% success, and for LiftPegUpright, only the contrastive variant reaches 50% success within reasonable steps. Notably, PokeCube shows the highest improvement, with 2.4x better sample efficiency compared to the baseline.



Q. Does contrastive learning improve robustness to visual clutter?

RQ3 figure. The performance gap is more apparent. For example, while foveal attention without contrastive learning is unable to solve the PushTClutter task, even underperforming the baseline, contrastive learning provides the necessary guidance to find critical cues from the cluttered environment. Also for PokeCubeClutter, the foveal attention with contrastive learning reaches a robust performance of ~90% success rate, while without contrastive learning, the success rate plateaus below 80%.



Q. Is our approach applicable across different RL algorithms?

RQ4 figure. We evaluate our method with off-policy SAC (Soft-Actor-Critic) on five Maniskill3 tasks. We observe improvements over the baseline, either faster convergence or higher final success rates. The trend is similar to that of PPO, demonstrating that our approach is not tied to a single RL algorithm, but can be applied to different RL methods without heavy modifications.



BibTeX

@misc{lee2025gazeprizeshapingvisual,
      title={Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning}, 
      author={Andrew Lee and Ian Chuang and Dechen Gao and Kai Fukazawa and Iman Soltani},
      year={2025},
      eprint={2510.08442},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.08442}, 
}