Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
2510.10959v1
cs.LG, cs.AI, cs.CL, stat.ML
2025-10-15
Авторы:
Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, Xing Hu
Abstract
Reasoning ability has become a defining capability of Large Language Models
(LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as
a key paradigm to enhance it. However, RLVR training often suffers from policy
entropy collapse, where the policy becomes overly deterministic, hindering
exploration and limiting reasoning performance. While entropy regularization is
a common remedy, its effectiveness is highly sensitive to the fixed
coefficient, making it unstable across tasks and models. In this work, we
revisit entropy regularization in RLVR and argue that its potential has been
largely underestimated. Our analysis shows that (i) tasks of varying difficulty
demand distinct exploration intensities, and (ii) balanced exploration may
require the policy entropy to be maintained within a moderate range below its
initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a
framework that dynamically balances exploration and exploitation via three
components: difficulty-aware coefficient allocation, initial-anchored target
entropy, and dynamic global coefficient adjustment. Experiments on multiple
mathematical reasoning benchmarks show that AER consistently outperforms
baselines, improving both reasoning accuracy and exploration capability.