Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
2510.25506v1
cs.SE, cs.AI
2025-10-31
Авторы:
Florian Angermeir, Maximilian Amougou, Mark Kreitz, Andreas Bauer, Matthias Linhuber, Davide Fucci, Fabiola Moyón C., Daniel Mendez, Tony Gorschek
Abstract
Large Language Models have gained remarkable interest in industry and
academia. The increasing interest in LLMs in academia is also reflected in the
number of publications on this topic over the last years. For instance, alone
78 of the around 425 publications at ICSE 2024 performed experiments with LLMs.
Conducting empirical studies with LLMs remains challenging and raises questions
on how to achieve reproducible results, for both other researchers and
practitioners. One important step towards excelling in empirical research on
LLMs and their application is to first understand to what extent current
research results are eventually reproducible and what factors may impede
reproducibility. This investigation is within the scope of our work. We
contribute an analysis of the reproducibility of LLM-centric studies, provide
insights into the factors impeding reproducibility, and discuss suggestions on
how to improve the current state. In particular, we studied the 86 articles
describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 86
articles, 18 provided research artefacts and used OpenAI models. We attempted
to replicate those 18 studies. Of the 18 studies, only five were fit for
reproduction. For none of the five studies, we were able to fully reproduce the
results. Two studies seemed to be partially reproducible, and three studies did
not seem to be reproducible. Our results highlight not only the need for
stricter research artefact evaluations but also for more robust study designs
to ensure the reproducible value of future publications.
Ссылки и действия
Дополнительные ресурсы: