Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

2509.24901v1 cs.SD, cs.LG 2025-10-01

Авторы:

Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz

Abstract

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regressio...

Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation

Differentiable Attenuation Filters for Feedback Delay Networks

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Навигация