MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
2510.08567v1
cs.CV, cs.AI, cs.CL
2025-10-11
Авторы:
Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
Abstract
Vision language models (VLMs) are increasingly deployed as controllers with
access to external tools for complex reasoning and decision-making, yet their
effectiveness remains limited by the scarcity of high-quality multimodal
trajectories and the cost of manual annotation. We address this challenge with
a vision-centric agent tuning framework that automatically synthesizes
multimodal trajectories, generates step-wise preference pairs, and trains a VLM
controller for robust tool-use reasoning. Our pipeline first constructs
M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified
trajectories, enabling imitation-based trajectory tuning. Building on this, we
develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool
reasoning. To achieve finer alignment, we further introduce Pref-X, a set of
11K automatically generated preference pairs, and optimize MATRIX on it via
step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA,
MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating
scalable and effective multimodal tool use. Our data and code is avaliable at
https://github.com/mbzuai-oryx/MATRIX.
Ссылки и действия
Дополнительные ресурсы: