ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

2510.23558v1 cs.SD, cs.CL, eess.AS 2025-10-29

Авторы:

Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu

Abstract

Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

Авторы:

Abstract

Ссылки и действия

Связанные статьи

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

emg2speech: synthesizing speech from electromyography using self-supervised spee...

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Sci-Phi: A Large Language Model Spatial Audio Descriptor

Навигация