Representation-Based Data Quality Audits for Audio
2509.26291v1
cs.SD, cs.AI, cs.LG
2025-10-02
Авторы:
Alvaro Gonzalez-Jimenez, Fabian Gröger, Linda Wermelinger, Andrin Bürli, Iason Kastanis, Simone Lionetti, Marc Pouly
Abstract
Data quality issues such as off-topic samples, near duplicates, and label
errors often limit the performance of audio-based systems. This paper addresses
these issues by adapting SelfClean, a representation-to-rank data auditing
framework, from the image to the audio domain. This approach leverages
self-supervised audio representations to identify common data quality issues,
creating ranked review lists that surface distinct issues within a single,
unified process. The method is benchmarked on the ESC-50, GTZAN, and a
proprietary industrial dataset, using both synthetic and naturally occurring
corruptions. The results demonstrate that this framework achieves
state-of-the-art ranking performance, often outperforming issue-specific
baselines and enabling significant annotation savings by efficiently guiding
human review.
Ссылки и действия
Дополнительные ресурсы: