Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
2510.18731v1
cs.CL, cs.AI, cs.LG
2025-10-23
Авторы:
Ming Li
Abstract
Large Language Models demonstrate strong capabilities in single-turn
instruction following but suffer from Lost-in-Conversation (LiC), a degradation
in performance as information is revealed progressively in multi-turn settings.
Motivated by the current progress on Reinforcement Learning with Verifiable
Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable
Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not
only to generate correct answers, but also to judge the solvability of
questions in the multi-turn conversation setting. Our approach employs a
competence-gated curriculum that incrementally increases dialogue difficulty
(in terms of instruction shards), stabilizing training while promoting
reliability. Using multi-turn, on-policy rollouts and a mixed-reward system,
RLAAR teaches models to balance problem-solving with informed abstention,
reducing premature answering behaviors that cause LiC. Evaluated on LiC
benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to
75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together,
these results provide a practical recipe for building multi-turn reliable and
trustworthy LLMs.
Ссылки и действия
Дополнительные ресурсы: