Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
2509.23808v2
cs.LG, cs.CL
2025-10-02
Авторы:
Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang
Abstract
A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR)
interprets recent progress through the lens of an exploration-exploitation
trade-off, a perspective largely shaped by token-level metrics. We re-examine
this perspective, proposing that this perceived trade-off may not be a
fundamental constraint but rather an artifact of the measurement level. To
investigate this, we shift the analysis to the semantically rich hidden-state
space, adopting Effective Rank (ER) to quantify exploration and proposing its
novel first- and second-order derivatives, named Effective Rank Velocity (ERV)
and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our
analysis reveals that at the hidden-state level, exploration and exploitation
could be decoupled (Sec. 4). This finding reveals an opportunity to enhance
both capacities simultaneously. This insight motivates our method,
Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the
principle of synergistic exploration-exploitation enhancement by directly
shaping the RL advantage function. The key innovation is leveraging the
theoretically stable ERA as a predictive meta-controller to create a
synergistic, dual-channel incentive structure. Instead of forcing a trade-off,
VERL prospectively amplifies rewards for exploration to preempt overconfidence
and reinforces exploitative gains to consolidate reasoning. Experiments across
diverse LLMs and reasoning benchmarks show consistent gains, including up to
21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
Ссылки и действия
Дополнительные ресурсы: