AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
2510.22593v1
cs.CL, cs.AI, I.2.7; I.2.11; H.3.4; D.2.8
2025-10-29
Авторы:
Dario Loi, Elena Maria Muià, Federico Siciliano, Giovanni Trappolini, Vincenzo Crisà, Peter Kruger, Fabrizio Silvestri
Abstract
We present AutoBench, a fully automated and self-sustaining framework for
evaluating Large Language Models (LLMs) through reciprocal peer assessment.
This paper provides a rigorous scientific validation of the AutoBench
methodology, originally developed as an open-source project by eZecute S.R.L..
Unlike static benchmarks that suffer from test-set contamination and limited
adaptability, AutoBench dynamically generates novel evaluation tasks while
models alternately serve as question generators, contestants, and judges across
diverse domains. An iterative weighting mechanism amplifies the influence of
consistently reliable evaluators, aggregating peer judgments into
consensus-based rankings that reflect collective model agreement. Our
experiments demonstrate strong correlations with established benchmarks
including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this
peer-driven evaluation paradigm. The multi-judge design significantly
outperforms single-judge baselines, confirming that distributed evaluation
produces more robust and human-consistent assessments. AutoBench offers a
scalable, contamination-resistant alternative to static benchmarks for the
continuous evaluation of evolving language models.