Audio-Guided Visual Perception for Audio-Visual Navigation
2510.11760v1
cs.SD, cs.AI, cs.CV, cs.MM
2025-10-16
Авторы:
Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng
Abstract
Audio-Visual Embodied Navigation aims to enable agents to autonomously
navigate to sound sources in unknown 3D environments using auditory cues. While
current AVN methods excel on in-distribution sound sources, they exhibit poor
cross-source generalization: navigation success rates plummet and search paths
become excessively long when agents encounter unheard sounds or unseen
environments. This limitation stems from the lack of explicit alignment
mechanisms between auditory signals and corresponding visual regions. Policies
tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations
during training, leading to blind exploration when exposed to novel sound
sources. To address this, we propose the AGVP framework, which transforms sound
from policy-memorable acoustic fingerprint cues into spatial guidance. The
framework first extracts global auditory context via audio self-attention, then
uses this context as queries to guide visual feature attention, highlighting
sound-source-related regions at the feature level. Subsequent temporal modeling
and policy optimization are then performed. This design, centered on
interpretable cross-modal alignment and region reweighting, reduces dependency
on specific acoustic fingerprints. Experimental results demonstrate that AGVP
improves both navigation efficiency and robustness while achieving superior
cross-scenario generalization on previously unheard sounds.