Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

2510.26130v1 cs.SE, cs.AI, cs.LG 2025-11-01

Авторы:

Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab

Abstract

Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Teaching an Online Multi-Institutional Research Level Software Engineering Cours...

Data-Driven Methods and AI in Engineering Design: A Systematic Literature Review...

MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Gen...

Keeping Code-Aware LLMs Fresh: Full Refresh, In-Context Deltas, and Incremental ...

Навигация