OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models
2509.26140v1
cs.SD, cs.AI
2025-10-02
Авторы:
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Abstract
Spatial reasoning is fundamental to auditory perception, yet current audio
large language models (ALLMs) largely rely on unstructured binaural cues and
single step inference. This limits both perceptual accuracy in direction and
distance estimation and the capacity for interpretable reasoning. Recent work
such as BAT demonstrates spatial QA with binaural audio, but its reliance on
coarse categorical labels (left, right, up, down) and the absence of explicit
geometric supervision constrain resolution and robustness. We introduce the
$\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio
encoder that aligns binaural acoustic features with 3D spatial structure using
panoramic depth images and room-impulse responses at training time, while
requiring only audio at inference. Building on this representation, we present
$\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially
grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and
distance estimates. Through curriculum learning from perceptual QA to
multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA
estimation. To enable large-scale training and evaluation, we construct and
release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining
binaural audio with panoramic depth images and room impulse responses across
both in-room and out-of-room scenarios. Across two benchmark datasets, our new
$\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean
DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves
spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.
Ссылки и действия
Дополнительные ресурсы: