PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
2510.22242v1
cs.IR, cs.AI, cs.CL
2025-10-29
Авторы:
Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma
Abstract
Large Language Models (LLMs) increasingly serve as research assistants, yet
their reliability in scholarly tasks remains under-evaluated. In this work, we
introduce PaperAsk, a benchmark that systematically evaluates LLMs across four
key research tasks: citation retrieval, content extraction, paper discovery,
and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under
realistic usage conditions-via web interfaces where search operations are
opaque to the user. Through controlled experiments, we find consistent
reliability failures: citation retrieval fails in 48-98% of multi-reference
queries, section-specific content extraction fails in 72-91% of cases, and
topical paper discovery yields F1 scores below 0.32, missing over 60% of
relevant literature. Further human analysis attributes these failures to the
uncontrolled expansion of retrieved context and the tendency of LLMs to
prioritize semantically relevant text over task instructions. Across basic
tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds
responses rather than risk errors, whereas Gemini produces fluent but
fabricated answers. To address these issues, we develop lightweight reliability
classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk
provides a reproducible and diagnostic framework for advancing the reliability
evaluation of LLM-based scholarly assistance systems.
Ссылки и действия
Дополнительные ресурсы: