QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture
2510.22087v1
cs.AR, cs.AI, cs.LG, cs.SE
2025-10-29
Авторы:
Shvetank Prakash, Andrew Cheng, Arya Tschand, Mark Mazumder, Varun Gohil, Jeffrey Ma, Jason Yik, Zishen Wan, Jessica Quaye, Elisavet Lydia Alvanaki, Avinash Kumar, Chandrashis Mazumdar, Tuhin Khare, Alexander Ingare, Ikechukwu Uchendu, Radhika Ghosal, Abhishek Tyagi, Chenyu Wang, Andrea Mattia Garavagno, Sarah Gu, Alice Guo, Grace Hur, Luca Carloni, Tushar Krishna, Ankita Nayak, Amir Yazdanbakhsh, Vijay Janapa Reddi
Abstract
The field of computer architecture, which bridges high-level software
abstractions and low-level hardware implementations, remains absent from
current large language model (LLM) evaluations. To this end, we present QuArch
(pronounced 'quark'), the first benchmark designed to facilitate the
development and evaluation of LLM knowledge and reasoning capabilities
specifically in computer architecture. QuArch provides a comprehensive
collection of 2,671 expert-validated question-answer (QA) pairs covering
various aspects of computer architecture, including processor design, memory
systems, and interconnection networks. Our evaluation reveals that while
frontier models possess domain-specific knowledge, they struggle with skills
that require higher-order thinking in computer architecture. Frontier model
accuracies vary widely (from 34% to 72%) on these advanced questions,
highlighting persistent gaps in architectural reasoning across analysis,
design, and implementation QAs. By holistically assessing fundamental skills,
QuArch provides a foundation for building and measuring LLM capabilities that
can accelerate innovation in computing systems. With over 140 contributors from
40 institutions, this benchmark represents a community effort to set the
standard for architectural reasoning in LLM evaluation.