Generative Universal Verifier as Multimodal Meta-Reasoner
2510.13804v1
cs.CV, cs.AI, cs.CL
2025-10-17
Авторы:
Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
Abstract
We introduce Generative Universal Verifier, a novel concept and plugin
designed for next-generation multimodal reasoning in vision-language models and
unified multimodal models, providing the fundamental capability of reflection
and refinement on visual outcomes during the reasoning and generation process.
This work makes three main contributions: (1) We build ViVerBench, a
comprehensive benchmark spanning 16 categories of critical tasks for evaluating
visual outcomes in multimodal reasoning. Results show that existing VLMs
consistently underperform across these tasks, underscoring a substantial gap
from human-level capability in reliable visual verification. (2) We design two
automated pipelines to construct large-scale visual verification data and train
OmniVerifier-7B, the first omni-capable generative verifier trained for
universal visual verification and achieves notable gains on ViVerBench(+8.3).
Through training, we identify three atomic capabilities in visual verification
and demonstrate how they generalize and interact synergistically. (3) We
propose OmniVerifier-TTS, a sequential test-time scaling paradigm that
leverages the universal verifier to bridge image generation and editing within
unified models, enhancing the upper bound of generative ability through
iterative fine-grained optimization. Beyond generation, we extend universal
verifier to broader world-modeling interleaved reasoning scenarios.
Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7),
and GenEval++(+4.3), outperforming existing parallel test-time scaling methods,
such as Best-of-N. By endowing multimodal reasoning with reliable visual
verification, OmniVerifier advances both reliable reflection during generation
and scalable test-time refinement, marking a step toward more trustworthy and
controllable next-generation reasoning systems.
Ссылки и действия
Дополнительные ресурсы: