An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM
2511.02234v1
cs.MM, cs.CL, cs.SD
2025-11-06
Авторы:
Jiawei Liu, Enis Berk Çoban, Zarina Schevchenko, Hao Tang, Zhigang Zhu, Michael I Mandel, Johanna Devaney
Abstract
Standard training for Multi-modal Large Language Models (MLLMs) involves
concatenating non-textual information, like vision or audio, with a text
prompt. This approach may not encourage deep integration of modalities,
limiting the model's ability to leverage the core language model's reasoning
capabilities. This work examined the impact of interleaved instruction tuning
in an audio MLLM, where audio tokens are interleaved within the prompt. Using
the Listen, Think, and Understand (LTU) model as a testbed, we conduct an
experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our
newly created reasoning benchmark for audio-based semantic reasoning focusing
on synonym and hypernym recognition. Our findings show that while even
zero-shot interleaved prompting improves performance on our reasoning tasks, a
small amount of fine-tuning using interleaved training prompts improves the
results further, however, at the expense of the MLLM's audio labeling ability.
Ссылки и действия
Дополнительные ресурсы: