Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
2509.24901v1
cs.SD, cs.LG
2025-10-01
Авторы:
Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz
Abstract
Although probing frozen models has become a standard evaluation paradigm,
self-supervised learning in audio defaults to fine-tuning. A key reason is that
global pooling creates an information bottleneck causing linear probes to
misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial
token information about dispersed, localized events in multi-label audio. This
weakness is rooted in the mismatch between the pretraining objective (operating
globally) and the downstream task (localized events). Across a comprehensive
benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate
the global pooling bottleneck. We then introduce binarized prototypical probes:
a lightweight and simple pooling method that learns prototypes to perform
class-wise information aggregation. Despite its simplicity, our method notably
outperforms linear and attentive probing. Our work establishes probing as a
competitive and efficient paradigm for evaluating audio SSL models, challenging
the reliance on costly fine-tuning.
Ссылки и действия
Дополнительные ресурсы: