Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

2511.16858v1 cs.SE, cs.LG 2025-11-25

Авторы:

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel

Abstract

Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Найти цитирования в Google Scholar
Поиск в Semantic Scholar
Другие статьи категории cs.SE, cs.LG

Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Large Language Models for Software Engineering: A Reproducibility Crisis

Neural Variable Name Repair: Learning to Rename Identifiers for Readability

stable-pretraining-v1: Foundation Model Research Made Simple

Agint: Agentic Graph Compilation for Software Engineering Agents

CREST: Improving Interpretability and Effectiveness of Troubleshooting at Ericss...

Навигация