The AI Consumer Index (ACE)

2512.04921v1 cs.AI, cs.CL, cs.HC 2025-12-06

Авторы:

Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen

Abstract

We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

The AI Consumer Index (ACE)

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Ra...

How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse ...

Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Project...

Everyone prefers human writers, including AI

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by ...

Навигация