Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning
2510.08710v1
cs.CL, 68T50, I.2.7; I.2.4
2025-10-14
Авторы:
Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley
Abstract
Case-based reasoning is a cornerstone of U.S. legal practice, requiring
professionals to argue about a current case by drawing analogies to and
distinguishing from past precedents. While Large Language Models (LLMs) have
shown remarkable capabilities, their proficiency in this complex, nuanced form
of reasoning needs further investigation. We propose a formal framework that
decomposes the process of identifying significant distinctions between cases
into three-stage reasoning tasks. Our framework models cases using factual
predicates called factors, organizes them into a legal knowledge hierarchy, and
defines verifiable rules for identifying distinctions, analyzing their
argumentative support, and evaluating their significance. Through comprehensive
evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve
high accuracy on surface-level reasoning (Task 1), performance degrades on
hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated
analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models
consistently expend more computational resources on incorrect responses than
correct ones, suggesting that "thinking longer" does not always mean "thinking
smarter." Our work provides a methodology for fine-grained analysis of LLM
reasoning capabilities in complex domains and reveals fundamental limitations
that must be addressed for robust and trustworthy legal AI.