Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
2510.15231v1
cs.CL, cs.AI, cs.SD, eess.AS
2025-10-21
Авторы:
Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul
Abstract
Large Audio-Language Models (LALMs) are often constrained by short audio
context windows, even when their text backbones support long contexts, limiting
long-form audio understanding. Prior work has introduced context-extension
methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains
unexplored. First, building on RoPE-based context extension, we introduce
Partial YaRN, a training-free, audio-only extension method that modifies only
audio token positions, leaving text positions intact to preserve the base LLM's
text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a
training strategy that extends Partial YaRN into a training-time positional
augmentation. VLAT simulates diverse audio lengths during training, enabling
generalization to inputs far longer than those seen in training and improving
robustness for long-context audio understanding. Our experiments on SALMONN and
Qwen2-Audio show that Partial YaRN outperforms the original models across wide
range of settings, and VLAT training strategy provides substantial improvement,
achieving strong performance on long audio of unseen lengths.