Unleashing Perception-Time Scaling to Multimodal Reasoning Models
2510.08964v1
cs.CV, cs.CL
2025-10-14
Авторы:
Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu
Abstract
Recent advances in inference-time scaling, particularly those leveraging
reinforcement learning with verifiable rewards, have substantially enhanced the
reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by
this success, similar strategies have been applied to multimodal reasoning, yet
their impact on visual perception remains unclear. To investigate this gap, we
introduce DisTANCE, a perception-centric benchmark for visual estimation tasks.
Evaluation results show that LVLMs exhibit limited estimation precision, and
inference-time scaling offers only marginal gains. We attribute this to the
fast perception paradigm of current LVLMs, where visual understanding is
treated as a one-shot output without modeling the underlying perceptual
process. To address this, we propose Perception-Time Scaling (PTS), a novel
paradigm that encourages token-rich perception and decomposes complex
perception problems into intermediate tractable sub-problems, thereby enabling
perception to align with and benefit from inference-time scaling. Combined with
reinforcement learning techniques, PTS significantly improves perception
accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%,
and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data
are purely synthetic, combining them with math reasoning data yields consistent
gains in both reasoning and real-world perception benchmarks. Further analysis
reveals that PTS introduces more perception-related tokens and increases the
model's attention to image tokens. Our code and data will be publicly released.
Ссылки и действия
Дополнительные ресурсы: