CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

2508.02298v1 cs.LG, cs.AI, cs.CL 2025-08-09

Авторы:

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

Резюме на русском

**Резюме** В статье предлагается CAPO (Credit Assignment Policy Optimization) — метод, улучшающий точность подкрепленного обучения с верифицируемыми наградами (RLVR) для бо LLM. Проблема заключается в том, что традиционные методы RLVR назначают одинаковый вес всем токенам ответа, что затрудняет точное присвоение кредита за успех или неудачу каждого токена. Разработанный CAPO использует общецелевую обработку естественных языков для построения шаг за шагом критики ответа, что позволяет назначить точные, проверяемые награды на уровне токенов. Для повышения точности используется механизм голосования, основанный на нескольких генерируемых критиках. Эксперименты показали, что CAPO превосходит супервизированные и другие RL-методы на математических и других бенчмарках, подтверждая его эффективность в улучшении точности и эффективности обучения бол LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

Multi-LLM Collaboration for Medication Recommendation

Network of Theseus (like the ship)

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

Mode-Conditioning Unlocks Superior Test-Time Scaling

Навигация