Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
2510.17733v1
cs.CL, cs.LG
2025-10-22
Авторы:
Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman
Abstract
Language models often generate factually incorrect information unsupported by
their training data, a phenomenon known as extrinsic hallucination. Existing
mitigation approaches often degrade performance on open-ended generation and
downstream tasks, limiting their practical utility. We propose an online
reinforcement learning method using a novel binary retrieval-augmented reward
(RAR) to address this tradeoff. Unlike continuous reward schemes, our approach
assigns a reward of one only when the model's output is entirely factually
correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models
across diverse tasks. For open-ended generation, binary RAR achieves a 39.3%
reduction in hallucination rates, substantially outperforming both supervised
training and continuous-reward RL baselines. In short-form question answering,
the model learns calibrated abstention, strategically outputting "I don't know"
when faced with insufficient parametric knowledge. This yields 44.4% and 21.7%
fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these
factuality gains come without performance degradation on instruction following,
math, or code, whereas continuous-reward RL, despite improving factuality,
induces quality regressions.
Ссылки и действия
Дополнительные ресурсы: