PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora
2510.14377v1
cs.CL, cs.IR, cs.LG
2025-10-18
Авторы:
Mykolas Sveistrys, Richard Kunert
Abstract
Recent advances in large language models (LLMs) and retrieval-augmented
generation (RAG) have enabled progress on question answering (QA) when relevant
evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many
realistic questions about recurring report data - medical records, compliance
filings, maintenance logs - require aggregation across all documents, with no
clear stopping point for retrieval and high sensitivity to even one missed
passage. We term these pluri-hop questions and formalize them by three
criteria: recall sensitivity, exhaustiveness, and exactness. To study this
setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48
pluri-hop questions built from 191 real-world wind industry reports in German
and English. We show that PluriHopWIND is 8-40% more repetitive than other
common datasets and thus has higher density of distractor documents, better
reflecting practical challenges of recurring report corpora. We test a
traditional RAG pipeline as well as graph-based and multimodal variants, and
find that none of the tested approaches exceed 40% in statement-wise F1 score.
Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a
"check all documents individually, filter cheaply" approach: it (i) decomposes
queries into document-level subquestions and (ii) uses a cross-encoder filter
to discard irrelevant documents before costly LLM reasoning. We find that
PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base
LLM. Despite its modest size, PluriHopWIND exposes the limitations of current
QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance
highlights the value of exhaustive retrieval and early filtering as a powerful
alternative to top-k methods.
Ссылки и действия
Дополнительные ресурсы: