Diff-XYZ: A Benchmark for Evaluating Diff Understanding

2510.12487v1 cs.SE, cs.LG 2025-10-16

Авторы:

Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov

Abstract

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Large Language Models for Software Engineering: A Reproducibility Crisis

Neural Variable Name Repair: Learning to Rename Identifiers for Readability

stable-pretraining-v1: Foundation Model Research Made Simple

Agint: Agentic Graph Compilation for Software Engineering Agents

Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated ...

Навигация