Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges
2511.04478v1
cs.HC, cs.AI
2025-11-08
Авторы:
Hyo Jin Do, Zahra Ashktorab, Jasmina Gajcin, Erik Miehling, Martín Santillán Cooper, Qian Pan, Elizabeth M. Daly, Werner Geyer
Abstract
The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but
its effectiveness is often limited by the scarcity of diverse, representative
data for refining criteria. We present a tool that integrates synthetic data
generation into the LLM-as-a-judge workflow, empowering users to create
tailored and challenging test cases with configurable domains, personas,
lengths, and desired outcomes, including borderline cases. The tool also
supports AI-assisted inline editing of existing test cases. To enhance
transparency and interpretability, it reveals the prompts and explanations
behind each generation. In a user study (N=24), 83% of participants preferred
the tool over manually creating or selecting test cases, as it allowed them to
rapidly generate diverse synthetic data without additional workload. The
generated synthetic data proved as effective as hand-crafted data for both
refining evaluation criteria and aligning with human preferences. These
findings highlight synthetic data as a promising alternative, particularly in
contexts where efficiency and scalability are critical.
Ссылки и действия
Дополнительные ресурсы: