ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
2510.23558v1
cs.SD, cs.CL, eess.AS
2025-10-29
Авторы:
Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu
Abstract
Large Audio Language Models (LALMs), which couple acoustic perception with
large language models (LLMs) to extract and understand diverse information from
audio, have attracted intense interest from both academic and industrial
communities. However, existing LALMs are highly sensitive to how instructions
are phrased, affecting both (i) instruction-following rates and (ii) task
performance. Yet, no existing benchmarks offer a systematic and comprehensive
evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark
evaluating instruction sensitivity for LALMs along three axes: instruction
description, output format, and task composition. We assess recent open-source
and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy
under controlled instruction variations. Experimental results reveal that even
state-of-the-art LALMs suffer significant instruction sensitivity, leading to
degraded performance on fundamental audio understanding tasks. To mitigate this
issue, we fine-tune Qwen2-Audio on a specifically constructed complex
instruction-variant dataset, achieving a marked improvement in
instruction-following performance. However, this also induces nontrivial
catastrophic forgetting: the model loses some previously mastered task
capabilities when exposed to new instruction styles. Our benchmark provides a
standardized basis for assessing and improving instruction sensitivity in
LALMs, underscoring the need for instruction-robust audio understanding in
real-world pipelines.
Ссылки и действия
Дополнительные ресурсы: