Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

2508.03404v1 cs.CV, cs.AI 2025-08-06

Авторы:

Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan

Резюме на русском

**Резюме:** Существующие vision-language модели (VLMs) имеют ограничения в параметрах, ограниченные возможности самокоррекции и сниженную эффективность при работе с длинными визуальными контекстами и сложным логическим выводом, что приводит к неудовлетворительному результату на задачах, связанных с документами. Для решения этой проблемы был предложен MACT — Multi-Agent Collaboration framework с test-time scaling, разработанный для визуального понимания документов и визуального ответа на вопросы (VQA). MACT состоит из четырех малых агентов с четкими ролями: планирование, выполнение, оценка и ответ. Особенностью является judgment agent, который проверяет корректность и направляет задачу на доработку к предыдущим агентам, что является более эффективным по сравнению с традиционными методами коррекции. Дополнительно, используется mixed reward modeling для балансировки агентских и глобальных целей, а также agent-wise hybrid test-time scaling, адаптирующий стратегии масштабирования для каждого агента в зависимости от их функций. Эксперименты показали, что MACT превосходит существующие модели в задачах с длинными контекстами и сложным логическим выводом, показывая высокую эффективность при меньшем количестве параметров. Три варианта MACT занимают лидирующие позиции в средних баллах на 13 из 15 тестов.

Abstract

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

Dual-Stream Spectral Decoupling Distillation for Remote Sensing Object Detection

Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion an...

GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglem...

Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent...

Навигация