The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
2510.02230v1
cs.AI, cs.CL, cs.CV
2025-10-04
Авторы:
Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen, Khoa D. Doan
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key
method for improving Large Language Models' reasoning capabilities, yet recent
evidence suggests it may paradoxically shrink the reasoning boundary rather
than expand it. This paper investigates the shrinkage issue of RLVR by
analyzing its learning dynamics and reveals two critical phenomena that explain
this failure. First, we expose negative interference in RLVR, where learning to
solve certain training problems actively reduces the likelihood of correct
solutions for others, leading to the decline of Pass@$k$ performance, or the
probability of generating a correct solution within $k$ attempts. Second, we
uncover the winner-take-all phenomenon: RLVR disproportionately reinforces
problems with high likelihood, correct solutions, under the base model, while
suppressing other initially low-likelihood ones. Through extensive theoretical
and empirical analysis on multiple mathematical reasoning benchmarks, we show
that this effect arises from the inherent on-policy sampling in standard RL
objectives, causing the model to converge toward narrow solution strategies.
Based on these insights, we propose a simple yet effective data curation
algorithm that focuses RLVR learning on low-likelihood problems, achieving
notable improvement in Pass@$k$ performance. Our code is available at
https://github.com/mail-research/SELF-llm-interference.
Ссылки и действия
Дополнительные ресурсы: