Investigating The Smells of LLM Generated Code
2510.03029v1
cs.SE, cs.AI
2025-10-07
Авторы:
Debalina Ghosh Paul, Hong Zhu, Ian Bayley
Abstract
Context: Large Language Models (LLMs) are increasingly being used to generate
program code. Much research has been reported on the functional correctness of
generated code, but there is far less on code quality.
Objectives: In this study, we propose a scenario-based method of evaluating
the quality of LLM-generated code to identify the weakest scenarios in which
the quality of LLM generated code should be improved.
Methods: The method measures code smells, an important indicator of code
quality, and compares them with a baseline formed from reference solutions of
professionally written code. The test dataset is divided into various subsets
according to the topics of the code and complexity of the coding tasks to
represent different scenarios of using LLMs for code generation. We will also
present an automated test system for this purpose and report experiments with
the Java programs generated in response to prompts given to four
state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon.
Results: We find that LLM-generated code has a higher incidence of code
smells compared to reference solutions. Falcon performed the least badly, with
a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%)
and finally Codex (84.97%). The average smell increase across all LLMs was
63.34%, comprising 73.35% for implementation smells and 21.42% for design
smells. We also found that the increase in code smells is greater for more
complex coding tasks and for more advanced topics, such as those involving
object-orientated concepts.
Conclusion: In terms of code smells, LLM's performances on various coding
task complexities and topics are highly correlated to the quality of human
written code in the corresponding scenarios. However, the quality of LLM
generated code is noticeably poorer than human written code.
Ссылки и действия
Дополнительные ресурсы: