One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
2510.26167v1
cs.AI, cs.CL
2025-11-01
Авторы:
Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang
Abstract
Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
Ссылки и действия
Дополнительные ресурсы: