SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
2511.05459v2
cs.SE, cs.AI
2025-11-11
Авторы:
Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Xiaojiang Zhang, Jinghui Wang, Huiming Wang, Wenhao Zhuang, Zhaoxiang Zhang, Yuqun Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu
Abstract
Evaluating large language models (LLMs) for software engineering has been
limited by narrow task coverage, language bias, and insufficient alignment with
real-world developer workflows. Existing benchmarks often focus on algorithmic
problems or Python-centric bug fixing, leaving critical dimensions of software
engineering underexplored. To address these gaps, we introduce SWE-Compass1, a
comprehensive benchmark that unifies heterogeneous code-related evaluations
into a structured and production-aligned framework. SWE-Compass spans 8 task
types, 8 programming scenarios, and 10 programming languages, with 2000
high-quality instances curated from authentic GitHub pull requests and refined
through systematic filtering and validation. We benchmark ten state-of-the-art
LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear
hierarchy of difficulty across task types, languages, and scenarios. Moreover,
by aligning evaluation with real-world developer practices, SWE-Compass
provides a rigorous and reproducible foundation for diagnosing and advancing
agentic coding capabilities in large language models.
Ссылки и действия
Дополнительные ресурсы: