Sci-Phi: A Large Language Model Spatial Audio Descriptor
2510.05542v1
cs.SD, cs.CL, eess.AS
2025-10-09
Авторы:
Xilin Jiang, Hannes Gamper, Sebastian Braun
Abstract
Acoustic scene perception involves describing the type of sounds, their
timing, their direction and distance, as well as their loudness and
reverberation. While audio language models excel in sound recognition,
single-channel input fundamentally limits spatial understanding. This work
presents Sci-Phi, a spatial audio large language model with dual spatial and
spectral encoders that estimates a complete parameter set for all sound sources
and the surrounding environment. Learning from over 4,000 hours of synthetic
first-order Ambisonics recordings including metadata, Sci-Phi enumerates and
describes up to four directional sound sources in one pass, alongside
non-directional background sounds and room characteristics. We evaluate the
model with a permutation-invariant protocol and 15 metrics covering content,
location, timing, loudness, and reverberation, and analyze its robustness
across source counts, signal-to-noise ratios, reverberation levels, and
challenging mixtures of acoustically, spatially, or temporally similar sources.
Notably, Sci-Phi generalizes to real room impulse responses with only minor
performance degradation. Overall, this work establishes the first audio LLM
capable of full spatial-scene description, with strong potential for real-world
deployment. Demo: https://sci-phi-audio.github.io/demo
Ссылки и действия
Дополнительные ресурсы: