Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
2510.11454v1
cs.SD, cs.AI
2025-10-15
Авторы:
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee
Abstract
Recent advancements in large multimodal models (LMMs) have shown strong
capabilities in audio understanding. However, most systems rely solely on
end-to-end reasoning, limiting interpretability and accuracy for tasks that
require structured knowledge or specialized signal analysis. In this work, we
present Audio-Maestro -- a tool-augmented audio reasoning framework that
enables audio-language models to autonomously call external tools and integrate
their timestamped outputs into the reasoning process. This design allows the
model to analyze, transform, and interpret audio signals through specialized
tools rather than relying solely on end-to-end inference. Experiments show that
Audio-Maestro consistently improves general audio reasoning performance:
Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%,
DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our
knowledge, Audio-Maestro is the first framework to integrate structured tool
output into the large audio language model reasoning process.
Ссылки и действия
Дополнительные ресурсы: