ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
2510.08569v1
cs.CL, cs.AI, cs.LG
2025-10-11
Авторы:
Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen
Abstract
Benchmarks are central to measuring the capabilities of large language models
and guiding model development, yet widespread data leakage from pretraining
corpora undermines their validity. Models can match memorized content rather
than demonstrate true generalization, which inflates scores, distorts
cross-model comparisons, and misrepresents progress. We introduce ArenaBencher,
a model-agnostic framework for automatic benchmark evolution that updates test
cases while preserving comparability. Given an existing benchmark and a diverse
pool of models to be evaluated, ArenaBencher infers the core ability of each
test case, generates candidate question-answer pairs that preserve the original
objective, verifies correctness and intent with an LLM as a judge, and
aggregates feedback from multiple models to select candidates that expose
shared weaknesses. The process runs iteratively with in-context demonstrations
that steer generation toward more challenging and diagnostic cases. We apply
ArenaBencher to math problem solving, commonsense reasoning, and safety domains
and show that it produces verified, diverse, and fair updates that uncover new
failure modes, increase difficulty while preserving test objective alignment,
and improve model separability. The framework provides a scalable path to
continuously evolve benchmarks in step with the rapid progress of foundation
models.
Ссылки и действия
Дополнительные ресурсы: