Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
2510.25694v1
cs.SE, cs.AI, cs.CL
2025-10-31
Авторы:
Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, Philip S. Yu
Abstract
Large language model-based agents show promise for software engineering, but
environment configuration remains a bottleneck due to heavy manual effort and
scarce large-scale, high-quality datasets. Existing benchmarks assess only
end-to-end build/test success, obscuring where and why agents succeed or fail.
We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench,
which provides process-level trajectory assessment of fine-grained agent
capabilities during environment setup-planning, perception-driven error
diagnosis, feedback-driven repair, and action to execute final environment
configuration. Our task instances are automatically constructed by injecting
realistic README errors and are validated in Docker for scalable, high-quality
evaluation. Enconda-bench combines process-level analysis with end-to-end
executability to enable capability assessments beyond aggregate success rates.
Evaluations across state-of-the-art LLMs and agent frameworks show that while
agents can localize errors, they struggle to translate feedback into effective
corrections, limiting end-to-end performance. To our knowledge, Enconda-bench
is the first framework to provide process-level internal capability assessment
for environment configuration, offering actionable insights for improving
software engineering agents.
Ссылки и действия
Дополнительные ресурсы: