The AI Productivity Index (APEX)
2509.25721v2
econ.GN, cs.AI, cs.CL, cs.HC, q-fin.EC
2025-10-03
Авторы:
Bertie Vidgen, Abby Fennelly, Evan Pinnix, Chirag Mahapatra, Zach Richards, Austin Bridges, Calix Huang, Ben Hunsberger, Fez Zafar, Brendan Foody, Dominic Barton, Cass R. Sunstein, Eric Topol, Osvald Nitski
Abstract
We introduce the first version of the AI Productivity Index (APEX), a
benchmark for assessing whether frontier AI models can perform knowledge work
with high economic value. APEX addresses one of the largest inefficiencies in
AI research: outside of coding, benchmarks often fail to test economically
relevant capabilities. APEX-v1.0 contains 200 test cases and covers four
domains: investment banking, management consulting, law, and primary medical
care. It was built in three steps. First, we sourced experts with top-tier
experience e.g., investment bankers from Goldman Sachs. Second, experts created
prompts that reflect high-value tasks in their day-to-day work. Third, experts
created rubrics for evaluating model responses. We evaluate 23 frontier models
on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest
mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking =
On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh
best overall. There is a large gap between the performance of even the best
models and human experts, highlighting the need for better measurement of
models' ability to produce economically valuable work.