E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
2510.14509v1
cs.SE, cs.AI, cs.CL
2025-10-18
Авторы:
Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng
Abstract
E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple
BDD test scenarios with corresponding Python step implementations for each
requirement}, and (iii) a fully automated testing pipeline built on the Behave
framework. To ensure its quality while reducing the annotation effort, E2EDev
leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework
(HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with
E2EDev}, our analysis reveals a persistent struggle to effectively solve these
tasks, underscoring the critical need for more effective and cost-efficient
E2ESD solutions. Our codebase and benchmark are publicly available at
https://github.com/SCUNLP/E2EDev.
Ссылки и действия
Дополнительные ресурсы: