From Faithfulness to Correctness: Generative Reward Models that Think Critically
2509.25409v1
cs.CL, cs.AI, cs.LG
2025-10-02
Авторы:
Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, Ting Yao
Abstract
Through reinforcement learning with verifiable rewards (RLVR), large language
models have achieved substantial progress in domains with easily verifiable
outcomes, such as mathematics and coding. However, when applied to more complex
tasks like open-domain question answering, RLVR faces significant challenges
due to the difficulty of verifying correctness. The nuanced and ambiguous
nature of real-world knowledge makes it difficult to reliably evaluate
correctness in these settings, necessitating further abilities that extend
beyond mere logical consistency to encompass an understanding and assessment of
both external and internal knowledge. Recent work has primarily focused on
improving faithfulness, defined as semantic alignment with supporting
documents, which can cause models to rely excessively on external sources and
diminish their capacity for critical assessment. To address this, we propose
the Thinking-supervised Reward Model (TRM), which incorporates sentence-level
thinking supervision to endow reward models with critical thinking abilities.
Given a query, answer, and supporting documents, TRM first assesses the
faithfulness of each answer sentence to the supporting documents, and then
applies a reasoning step to evaluate sentence-level correctness. By structuring
reward modeling as a sequence of faithfulness, reasoning, and correctness
evaluations, TRM encourages models to critically assess and leverage both
external and internal knowledge. Experiments on reward signals demonstrate that
TRM substantially improves the identification of incorrect sentences, and
incorporating TRM into policy optimization leads to significant gains in both
answer correctness and usefulness.
Ссылки и действия
Дополнительные ресурсы: