More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
2510.02227v1
cs.CL, cs.AI, cs.LG
2025-10-04
Авторы:
Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Hengtao Shen
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm
for enhancing the reasoning ability in Large Language Models (LLMs). However,
prevailing methods primarily rely on self-exploration or a single off-policy
teacher to elicit long chain-of-thought (LongCoT) reasoning, which may
introduce intrinsic model biases and restrict exploration, ultimately limiting
reasoning diversity and performance. Drawing inspiration from multi-teacher
strategies in knowledge distillation, we introduce Adaptive Multi-Guidance
Policy Optimization (AMPO), a novel framework that adaptively leverages
guidance from multiple proficient teacher models, but only when the on-policy
model fails to generate correct solutions. This "guidance-on-demand" approach
expands exploration while preserving the value of self-discovery. Moreover,
AMPO incorporates a comprehension-based selection mechanism, prompting the
student to learn from the reasoning paths that it is most likely to comprehend,
thus balancing broad exploration with effective exploitation. Extensive
experiments show AMPO substantially outperforms a strong baseline (GRPO), with
a 4.3% improvement on mathematical reasoning tasks and 12.2% on
out-of-distribution tasks, while significantly boosting Pass@k performance and
enabling more diverse exploration. Notably, using four peer-sized teachers, our
method achieves comparable results to approaches that leverage a single, more
powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate
a more efficient and scalable path to superior reasoning and generalizability.
Our code is available at https://github.com/SII-Enigma/AMPO.
Ссылки и действия
Дополнительные ресурсы: