TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
2510.17163v1
cs.SE, cs.AI
2025-10-22
Авторы:
Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu
Abstract
Large foundation models are fundamentally transforming the software
engineering landscape, demonstrating exceptional capabilities across diverse
tasks such as code generation, debugging, and testing. Despite this rapid
progress, a significant gap remains in how to comprehensively evaluate these
models' trustworthiness in real-world software engineering scenarios. Existing
benchmarks suffer from limited task scope and fail to incorporate critical
evaluation aspects such as the robustness and reliability of models. To bridge
this gap, we present an evaluation framework called TREAT (Code LLMs
Trustworthiness / Reliability Evaluation And Testing) that provides a holistic
assessment of model performance in code intelligence tasks. Our evaluation
framework addresses key limitations in existing approaches with four main
improvements: (1) Multi-Task Holistic Evaluation that spans diverse software
engineering activities rather than limited coding tasks; (2) Multi-Language and
Multi-Modality Assessment that extends beyond traditional single-language,
text-only benchmarks to include multi-modality coding tasks; (3) Robustness
Assessment that evaluates model reliability under semantically-preserving code
transformations; and (4) Rigorous Evaluation Methodology that enhances the
trustworthiness of evaluation results through diverse evaluation prompts and
adaptive solution extraction. Based on this evaluation framework, we assess 26
state-of-the-art models and uncover both their strengths and limitations,
yielding several key insights:(1) Current models show substantial performance
variation across programming tasks; (2) Multi-modal language models demonstrate
specific performance limitations in UI code generation and edit;
Ссылки и действия
Дополнительные ресурсы: