LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
2510.09595v1
cs.AI, cs.CL, cs.LG
2025-10-14
Авторы:
Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
Abstract
Competitive programming problems increasingly serve as valuable benchmarks to
evaluate the coding capabilities of large language models (LLMs) due to their
complexity and ease of verification. Yet, current coding benchmarks face
limitations such as lack of exceptionally challenging problems, insufficient
test case coverage, reliance on online platform APIs that limit accessibility.
To address these issues, we introduce LiveOIBench, a comprehensive benchmark
featuring 403 expert-curated Olympiad-level competitive programming problems,
each with an average of 60 expert-designed test cases. The problems are sourced
directly from 72 official Informatics Olympiads in different regions conducted
between 2023 and 2025. LiveOIBench distinguishes itself through four key
features: (1) meticulously curated high-quality tasks with detailed subtask
rubrics and extensive private test cases; (2) direct integration of elite
contestant performance data to enable informative comparison against
top-performing humans; (3) planned continuous, contamination-free updates from
newly released Olympiad problems; and (4) a self-contained evaluation system
facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular
general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable
81.76th percentile, a strong result that nonetheless falls short of top human
contestant performance, who usually place above 90th. In contrast, among
open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile,
underscoring significant capability disparities from frontier closed models.
Detailed analyses indicate that robust reasoning models prioritize precise
problem analysis over excessive exploration, suggesting future models should
emphasize structured analysis and minimize unnecessary exploration. All data,
code, and leaderboard results will be made publicly available on our website.
Ссылки и действия
Дополнительные ресурсы: