PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading

2510.22242v1 cs.IR, cs.AI, cs.CL 2025-10-29

Авторы:

Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma

Abstract

Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading

Авторы:

Abstract

Ссылки и действия

Связанные статьи

What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Langua...

Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information ...

REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce ...

Pctx: Tokenizing Personalized Context for Generative Recommendation

Simple Projection Variants Improve ColBERT Performance

Навигация