EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering
2511.01650v1
cs.CL, cs.AI, cs.LG
2025-11-06
Авторы:
Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie
Abstract
Large Language Models (LLMs) are increasingly being applied to specialized,
high-stakes domains like engineering, which demands rigorous evaluation of
their complex reasoning capabilities. While current benchmarks assess language
understanding, factual recall, mathematics or code generation, none capture the
integrative reasoning central to engineering where scientific principles,
quantitative modeling and practical constraints must converge. To address this
gap, we introduce EngChain, a benchmark for verifiable multi-step engineering
problem-solving. EngChain contains 90 problems spanning three engineering
branches, organized into 9 domains and 20 distinct areas. The problems are
generated from symbolic templates with a high degree of randomization to ensure
diversity and eliminate the risk of contamination. With this benchmark, we move
beyond final answer accuracy with a two-stage evaluation: we first
quantitatively verify the numerical and semantic validity of each reasoning
step and then introduce LLM-As-A-Judge, an automated system to qualitatively
categorize the identified reasoning errors.
Ссылки и действия
Дополнительные ресурсы: