What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions
2511.05320v1
cs.CL, cs.AI
2025-11-11
Авторы:
Klára Bendová, Tomáš Knap, Jan Černý, Vojtěch Pour, Jaromir Savelka, Ivana Kvapilíková, Jakub Drápal
Abstract
Criminal justice administrative data contain only a limited amount of
information about the committed offense. However, there is an unused source of
extensive information in continental European courts' decisions: descriptions
of criminal behaviors in verdicts by which offenders are found guilty. In this
paper, we study the feasibility of extracting these descriptions from publicly
available court decisions from Slovakia. We use two different approaches for
retrieval: regular expressions and large language models (LLMs). Our baseline
was a simple method employing regular expressions to identify typical words
occurring before and after the description. The advanced regular expression
approach further focused on "sparing" and its normalization (insertion of
spaces between individual letters), typical for delineating the description.
The LLM approach involved prompting the Gemini Flash 2.0 model to extract the
descriptions using predefined instructions. Although the baseline identified
descriptions in only 40.5% of verdicts, both methods significantly outperformed
it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and
99.5% when combined. Evaluation by law students showed that both advanced
methods matched human annotations in about 90% of cases, compared to just 34.5%
for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of
instances, and a combination of advanced regular expressions with LLMs reached
92%.
Ссылки и действия
Дополнительные ресурсы: