Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?
2510.14249v1
cs.SD, cs.AI, eess.AS
2025-10-18
Авторы:
Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas
Abstract
Understanding and modeling the relationship between language and sound is
critical for applications such as music information retrieval,text-guided music
generation, and audio captioning. Central to these tasks is the use of joint
language-audio embedding spaces, which map textual descriptions and auditory
content into a shared embedding space. While multimodal embedding models such
as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning
language and audio, their correspondence to human perception of timbre, a
multifaceted attribute encompassing qualities such as brightness, roughness,
and warmth, remains underexplored. In this paper, we evaluate the above three
joint language-audio embedding models on their ability to capture perceptual
dimensions of timbre. Our findings show that LAION-CLAP consistently provides
the most reliable alignment with human-perceived timbre semantics across both
instrumental sounds and audio effects.
Ссылки и действия
Дополнительные ресурсы: