Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs
2511.13273v1
cs.SD, cs.AI
2025-11-19
Авторы:
Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang
Abstract
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
Ссылки и действия
Дополнительные ресурсы: