Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
2510.10457v1
cs.CL, cs.LG
2025-10-15
Авторы:
Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang
Abstract
As the demand for comprehensive evaluations of diverse model capabilities
steadily increases, benchmark suites have correspondingly grown significantly
in scale. Despite notable advances in redundancy reduction and subset-level
performance prediction, a systematic framework that effectively integrates
these methods to ensure both prediction accuracy and ranking consistency is
still largely elusive. In this paper, we first perform a sample-level analysis
of benchmark redundancy and identify several highly similar samples that can be
eliminated. Besides, we frame benchmark compression as an optimization problem
with the aim of score reconstruction. Building on these, we then propose
EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic
Algorithm (GA), which takes the advantages of fitness-based subset search and
attribution-based sample search. Compared to previous methods, our approach
yields superior compression results with lower reconstruction error and
markedly higher efficiency. In particular, on the HellaSwag benchmark (10K
samples), our method preserves the ranking of all models shifting within 5%
using 25x fewer samples, and achieves 95% ranking preservation shifting within
5% using only 200x fewer samples.
Ссылки и действия
Дополнительные ресурсы: