Diff-XYZ: A Benchmark for Evaluating Diff Understanding
2510.12487v1
cs.SE, cs.LG
2025-10-16
Авторы:
Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov
Abstract
Reliable handling of code diffs is central to agents that edit and refactor
repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff
understanding with three supervised tasks: apply (old code $+$ diff
$\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code),
and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in
the benchmark are triples $\langle \textit{old code}, \textit{new code},
\textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with
automatic metrics and a clear evaluation protocol. We use the benchmark to do a
focused empirical study of the unified diff format and run a cross-format
comparison of different diff representations. Our findings reveal that
different formats should be used depending on the use case and model size. For
example, representing diffs in search-replace format is good for larger models
in the diff generation scenario, yet not suited well for diff analysis and
smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing
and improving diff handling in LLMs that can aid future development of diff
formats and models editing code. The dataset is published on HuggingFace Hub:
https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
Ссылки и действия
Дополнительные ресурсы: