Verification Limits Code LLM Training
2509.20837v1
cs.SE, cs.AI, cs.CL
2025-09-27
Авторы:
Srishti Gureja, Elena Tommasone, Jingyi He, Sara Hooker, Matthias Gallé, Marzieh Fadaee
Резюме на русском
#### Контекст
Modern large language models (LLMs) for code generation increasingly depend on synthetic data, where both problems and their solutions are generated by these models. While this approach enables scalable data creation, it introduces a novel limitation: the **verification ceiling**. This ceiling arises when the quality and diversity of training data are constrained by the capabilities of synthetic verifiers. Such a bottleneck restricts the ability of models to generalize and improve beyond a certain point. This study systematically investigates how verification design and strategies impact model performance, aiming to understand and overcome this limitation.
#### Метод
The methodology focuses on analyzing the interplay between verification strategies and model training. Researchers evaluate two key aspects:
1. **What we verify**: Tests are categorized by their complexity and quantity. Richer test suites enhance model capabilities, while excessive quantity yields diminishing returns.
2. **How we verify**: Relaxed pass thresholds and LLM-based soft verification methods are explored. These approaches recover valuable training data, leading to performance improvements.
3. **Why verification remains necessary**: Controlled comparisons between formally correct and incorrect solutions, alongside human evaluations, emphasize the importance of diverse and high-quality solutions.
The study provides a nuanced understanding of the limitations and potential recalibration of verification processes.
#### Результаты
Experiments reveal that richer test suites improve code generation capabilities significantly, with an average increase of +3 pass@1. However, simply increasing quantity leads to diminishing returns. Relaxed pass thresholds and LLM-based soft verification demonstrate the potential to recover valuable training data, achieving a 2-4 point improvement in pass@1 performance. However, this benefit depends on the strength and diversity of the test cases. The findings underscore the necessity of recalibrating verification processes rather than discarding them.
#### Значимость
The recalibrated verification process offers significant potential across various domains, including software development, education, and AI-driven code generation. By overcoming the verification ceiling, this approach can unlock stronger and more generalizable LLMs for code. The findings highlight the importance of balancing test diversity and complexity to improve model performance.
#### Выводы
This work highlights the critical role of verification in LLMs for code generation and identifies key areas for improvement. By combining calibrated verification with diverse and challenging problem-solution pairs, the study outlines a path to break the verification ceiling, paving the way for the next generation of stronger and more versatile code generation models. Future research will focus on further refining verification strategies and exploring their application in real-world scenarios.
Abstract
Large language models for code generation increasingly rely on synthetic
data, where both problem solutions and verification tests are generated by
models. While this enables scalable data creation, it introduces a previously
unexplored bottleneck: the verification ceiling, in which the quality and
diversity of training data are fundamentally constrained by the capabilities of
synthetic verifiers. In this work, we systematically study how verification
design and strategies influence model performance. We investigate (i) what we
verify by analyzing the impact of test complexity and quantity: richer test
suites improve code generation capabilities (on average +3 pass@1), while
quantity alone yields diminishing returns, (ii) how we verify by exploring
relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By
allowing for relaxed thresholds or incorporating LLM-based soft verification,
we can recover valuable training data, leading to a 2-4 point improvement in
pass@1 performance. However, this benefit is contingent upon the strength and
diversity of the test cases used, and (iii) why verification remains necessary
through controlled comparisons of formally correct versus incorrect solutions
and human evaluation: retaining diverse correct solutions per problem yields
consistent generalization gains. Our results show that Verification as
currently practiced is too rigid, filtering out valuable diversity. But it
cannot be discarded, only recalibrated. By combining calibrated verification
with diverse, challenging problem-solution pairs, we outline a path to break
the verification ceiling and unlock stronger code generation models.
Ссылки и действия
Дополнительные ресурсы: