Verification Limits Code LLM Training

2509.20837v1 cs.SE, cs.AI, cs.CL 2025-09-27
Авторы:

Srishti Gureja, Elena Tommasone, Jingyi He, Sara Hooker, Matthias Gallé, Marzieh Fadaee

Резюме на русском

#### Контекст Modern large language models (LLMs) for code generation increasingly depend on synthetic data, where both problems and their solutions are generated by these models. While this approach enables scalable data creation, it introduces a novel limitation: the **verification ceiling**. This ceiling arises when the quality and diversity of training data are constrained by the capabilities of synthetic verifiers. Such a bottleneck restricts the ability of models to generalize and improve beyond a certain point. This study systematically investigates how verification design and strategies impact model performance, aiming to understand and overcome this limitation. #### Метод The methodology focuses on analyzing the interplay between verification strategies and model training. Researchers evaluate two key aspects: 1. **What we verify**: Tests are categorized by their complexity and quantity. Richer test suites enhance model capabilities, while excessive quantity yields diminishing returns. 2. **How we verify**: Relaxed pass thresholds and LLM-based soft verification methods are explored. These approaches recover valuable training data, leading to performance improvements. 3. **Why verification remains necessary**: Controlled comparisons between formally correct and incorrect solutions, alongside human evaluations, emphasize the importance of diverse and high-quality solutions. The study provides a nuanced understanding of the limitations and potential recalibration of verification processes. #### Результаты Experiments reveal that richer test suites improve code generation capabilities significantly, with an average increase of +3 pass@1. However, simply increasing quantity leads to diminishing returns. Relaxed pass thresholds and LLM-based soft verification demonstrate the potential to recover valuable training data, achieving a 2-4 point improvement in pass@1 performance. However, this benefit depends on the strength and diversity of the test cases. The findings underscore the necessity of recalibrating verification processes rather than discarding them. #### Значимость The recalibrated verification process offers significant potential across various domains, including software development, education, and AI-driven code generation. By overcoming the verification ceiling, this approach can unlock stronger and more generalizable LLMs for code. The findings highlight the importance of balancing test diversity and complexity to improve model performance. #### Выводы This work highlights the critical role of verification in LLMs for code generation and identifies key areas for improvement. By combining calibrated verification with diverse and challenging problem-solution pairs, the study outlines a path to break the verification ceiling, paving the way for the next generation of stronger and more versatile code generation models. Future research will focus on further refining verification strategies and exploring their application in real-world scenarios.

Abstract

Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models. While this enables scalable data creation, it introduces a previously unexplored bottleneck: the verification ceiling, in which the quality and diversity of training data are fundamentally constrained by the capabilities of synthetic verifiers. In this work, we systematically study how verification design and strategies influence model performance. We investigate (i) what we verify by analyzing the impact of test complexity and quantity: richer test suites improve code generation capabilities (on average +3 pass@1), while quantity alone yields diminishing returns, (ii) how we verify by exploring relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By allowing for relaxed thresholds or incorporating LLM-based soft verification, we can recover valuable training data, leading to a 2-4 point improvement in pass@1 performance. However, this benefit is contingent upon the strength and diversity of the test cases used, and (iii) why verification remains necessary through controlled comparisons of formally correct versus incorrect solutions and human evaluation: retaining diverse correct solutions per problem yields consistent generalization gains. Our results show that Verification as currently practiced is too rigid, filtering out valuable diversity. But it cannot be discarded, only recalibrated. By combining calibrated verification with diverse, challenging problem-solution pairs, we outline a path to break the verification ceiling and unlock stronger code generation models.

Ссылки и действия