Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
2510.26130v1
cs.SE, cs.AI, cs.LG
2025-11-01
Авторы:
Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab
Abstract
Large language models (LLMs) have advanced code generation at the function
level, yet their ability to produce correct class-level implementations in
authentic software projects remains poorly understood. This work introduces a
novel benchmark derived from open-source repositories, comprising real-world
classes divided into seen and unseen partitions to evaluate generalization
under practical conditions. The evaluation examines multiple LLMs under varied
input specifications, retrieval-augmented configurations, and documentation
completeness levels.
Results reveal a stark performance disparity: LLMs achieve 84% to 89%
correctness on established synthetic benchmarks but only 25% to 34% on
real-world class tasks, with negligible differences between familiar and novel
codebases. Comprehensive docstrings yield modest gains of 1% to 3% in
functional accuracy, though statistical significance is rare.
Retrieval-augmented generation proves most effective with partial
documentation, improving correctness by 4% to 7% by supplying concrete
implementation patterns absent from specifications. Error profiling identifies
AttributeError, TypeError, and AssertionError as dominant failure modes (84% of
cases), with synthetic tests overemphasizing assertion issues and real-world
scenarios highlighting type and attribute mismatches. Retrieval augmentation
reduces logical flaws but can introduce dependency conflicts.
The benchmark and analysis expose critical limitations in current LLM
capabilities for class-level engineering, offering actionable insights for
enhancing context modelling, documentation strategies, and retrieval
integration in production code assistance tools.
Ссылки и действия
Дополнительные ресурсы: