E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

2510.14509v1 cs.SE, cs.AI, cs.CL 2025-10-18

Авторы:

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng

Abstract

E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple BDD test scenarios with corresponding Python step implementations for each requirement}, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with E2EDev}, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Process-Centric Analysis of Agentic Software Systems

Progressive Code Integration for Abstractive Bug Report Summarization

SecureReviewer: Enhancing Large Language Models for Secure Code Review through S...

Process-Level Trajectory Evaluation for Environment Configuration in Software En...

Does Model Size Matter? A Comparison of Small and Large Language Models for Requ...

Навигация