Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

2510.14249v1 cs.SD, cs.AI, eess.AS 2025-10-18

Авторы:

Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas

Abstract

Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

Авторы:

Abstract

Ссылки и действия

Связанные статьи

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup an...

Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup ...

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Featu...

Навигация