Sci-Phi: A Large Language Model Spatial Audio Descriptor

2510.05542v1 cs.SD, cs.CL, eess.AS 2025-10-09

Авторы:

Xilin Jiang, Hannes Gamper, Sebastian Braun

Abstract

Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Sci-Phi: A Large Language Model Spatial Audio Descriptor

Авторы:

Abstract

Ссылки и действия

Связанные статьи

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

emg2speech: synthesizing speech from electromyography using self-supervised spee...

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Навигация