Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
2510.25860v1
cs.AI, cs.CL, cs.HC
2025-11-01
Авторы:
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei
Abstract
Large language models (LLMs) are increasingly used as raters for evaluation
tasks. However, their reliability is often limited for subjective tasks, when
human judgments involve subtle reasoning beyond annotation labels. Thinking
traces, the reasoning behind a judgment, are highly informative but challenging
to collect and curate. We present a human-LLM collaborative framework to infer
thinking traces from label-only annotations. The proposed framework uses a
simple and effective rejection sampling method to reconstruct these traces at
scale. These inferred thinking traces are applied to two complementary tasks:
(1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation
guidelines for proprietary LLM raters. Across multiple datasets, our methods
lead to significantly improved LLM-human agreement. Additionally, the refined
annotation guidelines increase agreement among different LLM models. These
results suggest that LLMs can serve as practical proxies for otherwise
unrevealed human thinking traces, enabling label-only corpora to be extended
into thinking-trace-augmented resources that enhance the reliability of LLM
raters.
Ссылки и действия
Дополнительные ресурсы: