Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
2510.24358v1
cs.SE, cs.CL
2025-10-30
Авторы:
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu
Abstract
Recent advances in code agents have enabled automated software development at
the project level, supported by large language models (LLMs) and widely adopted
tools. However, existing benchmarks for code agent evaluation face two major
limitations: high annotation cost and expertise requirements, and rigid
evaluation metrics that rely primarily on unit tests. To address these
challenges, we propose an agent-driven benchmark construction pipeline that
leverages human supervision to efficiently generate diverse and challenging
project-level tasks. Based on this approach, we introduce PRDBench, a novel
benchmark comprising 50 real-world Python projects across 20 domains, each with
structured Product Requirement Document (PRD) requirements, comprehensive
evaluation criteria, and reference implementations. PRDBench features rich data
sources, high task complexity, and flexible metrics. We further employ an
Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of
various test types beyond unit tests. Extensive experiments on PRDBench
demonstrate its effectiveness in assessing the capabilities of both code agents
and evaluation agents, providing a scalable and robust framework for annotation
and evaluation.
Ссылки и действия
Дополнительные ресурсы: