RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models
2510.25206v1
cs.AI, cs.CL, cs.LG, I.2.7
2025-10-31
Авторы:
Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, Bo Zheng
Abstract
Reinforcement learning (RL) can refine the reasoning abilities of large
language models (LLMs), but critically depends on a key prerequisite: the LLM
can already generate high-utility reasoning paths with non-negligible
probability. For tasks beyond the LLM's current competence, such reasoning path
can be hard to sample, and learning risks reinforcing familiar but suboptimal
reasoning. We are motivated by the insight from cognitive science that Why is
this the answer is often an easier question than What is the answer, as it
avoids the heavy cognitive load of open-ended exploration, opting instead for
explanatory reconstruction-systematically retracing the reasoning that links a
question to its answer. We show that LLMs can similarly leverage answers to
derive high-quality reasoning paths. We formalize this phenomenon and prove
that conditioning on answer provably increases the expected utility of sampled
reasoning paths, thereby transforming intractable problems into learnable ones.
Building on this insight, we introduce RAVR (Reference-Answer-guided
Variational Reasoning), an end-to-end framework that uses answer-conditioned
reasoning as a variational surrogate for question-only reasoning. Experiments
in both general and math domains demonstrate consistent improvements over
strong baselines. We further analyze the reasoning behavior and find that RAVR
reduces hesitation, strengthens conclusion consolidation, and promotes
problem-specific strategies in reasoning.