Composition-Grounded Instruction Synthesis for Visual Reasoning
2510.15040v1
cs.CV, cs.CL, cs.LG
2025-10-21
Авторы:
Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He
Abstract
Pretrained multi-modal large language models (MLLMs) demonstrate strong
performance on diverse multimodal tasks, but remain limited in reasoning
capabilities for domains where annotations are difficult to collect. In this
work, we focus on artificial image domains such as charts, rendered documents,
and webpages, which are abundant in practice yet lack large-scale human
annotated reasoning datasets. We introduce COGS (COmposition-Grounded
instruction Synthesis), a data-efficient framework for equipping MLLMs with
advanced reasoning abilities from a small set of seed questions. The key idea
is to decompose each seed question into primitive perception and reasoning
factors, which can then be systematically recomposed with new images to
generate large collections of synthetic question-answer pairs. Each generated
question is paired with subquestions and intermediate answers, enabling
reinforcement learning with factor-level process rewards. Experiments on chart
reasoning show that COGS substantially improves performance on unseen
questions, with the largest gains on reasoning-heavy and compositional
questions. Moreover, training with a factor-level mixture of different seed
data yields better transfer across multiple datasets, suggesting that COGS
induces generalizable capabilities rather than dataset-specific overfitting. We
further demonstrate that the framework extends beyond charts to other domains
such as webpages.
Ссылки и действия
Дополнительные ресурсы: