Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
2510.06999v1
cs.CL, cs.IR, I.2.7; H.3.3; K.5.0
2025-10-10
Авторы:
Markus Reuter, Tobias Lingenberg, Rūta Liepiņa, Francesca Lagioia, Marco Lippi, Giovanni Sartor, Andrea Passerini, Burcu Sayin
Abstract
Retrieval-Augmented Generation (RAG) is a promising approach to mitigate
hallucinations in Large Language Models (LLMs) for legal applications, but its
reliability is critically dependent on the accuracy of the retrieval step. This
is particularly challenging in the legal domain, where large databases of
structurally similar documents often cause retrieval systems to fail. In this
paper, we address this challenge by first identifying and quantifying a
critical failure mode we term Document-Level Retrieval Mismatch (DRM), where
the retriever selects information from entirely incorrect source documents. To
mitigate DRM, we investigate a simple and computationally efficient technique
which we refer to as Summary-Augmented Chunking (SAC). This method enhances
each text chunk with a document-level synthetic summary, thereby injecting
crucial global context that would otherwise be lost during a standard chunking
process. Our experiments on a diverse set of legal information retrieval tasks
show that SAC greatly reduces DRM and, consequently, also improves text-level
retrieval precision and recall. Interestingly, we find that a generic
summarization strategy outperforms an approach that incorporates legal expert
domain knowledge to target specific legal elements. Our work provides evidence
that this practical, scalable, and easily integrable technique enhances the
reliability of RAG systems when applied to large-scale legal document datasets.