Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

2511.16858v1 cs.SE, cs.LG 2025-11-25
Авторы:

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel

Abstract

Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

Ссылки и действия