Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
2510.14232v1
cs.LG, cs.AI, cs.CL
2025-10-18
Авторы:
Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg
Abstract
Competitive programming has become a rigorous benchmark for evaluating the
reasoning and problem-solving capabilities of large language models (LLMs). The
International Olympiad in Informatics (IOI) stands out as one of the most
prestigious annual competitions in competitive programming and has become a key
benchmark for comparing human and AI-level programming ability. While several
proprietary models have been claimed to achieve gold medal-level performance at
the IOI, often with undisclosed methods, achieving comparable results with
open-weight models remains a significant challenge. In this paper, we present
\gencluster, a scalable and reproducible test-time compute framework that
attains IOI gold-level performance using open-weight models. It combines
large-scale generation, behavioral clustering, ranking, and a round-robin
submission strategy to efficiently explore diverse solution spaces under
limited validation budgets. Our experiments show that the performance of our
proposed approach scales consistently with available compute, narrowing the gap
between open and closed systems. Notably, we will show that GenCluster can
achieve a gold medal at IOI 2025 for the first time with an open-weight model
gpt-oss-120b, setting a new benchmark for transparent and reproducible
evaluation of reasoning in LLMs.
Ссылки и действия
Дополнительные ресурсы: