TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

2509.26329v1 eess.AS, cs.CL, cs.LG, cs.SD 2025-10-02
Авторы:

Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee

Abstract

Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.

Ссылки и действия

Связанные статьи

Breathing and Semantic Pause Detection and Exertion-Level Classification in Post...

## Контекст Область исследования связана с анализом пост-тренировочной речи, которая содержит богатые физиологические и ...

2025-09-23

Unified Learnable 2D Convolutional Feature Extraction for ASR

#### Контекст Автоматический распознавание речи (ASR) является ключевым направлением в области обработки естественного ...

2025-09-16

Error Analysis in a Modular Meeting Transcription System

## Контекст Meeting transcription является областью высокой актуальности и существенного прогресса в последние годы. Одн...

2025-09-16

Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapt...

#### Контекст Текстовое-к-речевое преобразование (Text-to-Speech, TTS) является важной областью искусственного интеллек...

2025-08-27

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs

## Контекст Просодия (speech prosody) — это важный аспект речи, который необходим для передачи эмоций, интонаций, информ...

2025-08-15