Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
2510.07926v1
cs.CL, I.2.7
2025-10-11
Авторы:
Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano
Abstract
Despite demonstrating remarkable performance across a wide range of tasks,
large language models (LLMs) have also been found to frequently produce outputs
that are incomplete or selectively omit key information. In sensitive domains,
such omissions can result in significant harm comparable to that posed by
factual inaccuracies, including hallucinations. In this study, we address the
challenge of evaluating the comprehensiveness of LLM-generated texts, focusing
on the detection of missing information or underrepresented viewpoints. We
investigate three automated evaluation strategies: (1) an NLI-based method that
decomposes texts into atomic statements and uses natural language inference
(NLI) to identify missing links, (2) a Q&A-based approach that extracts
question-answer pairs and compares responses across sources, and (3) an
end-to-end method that directly identifies missing content using LLMs. Our
experiments demonstrate the surprising effectiveness of the simple end-to-end
approach compared to more complex methods, though at the cost of reduced
robustness, interpretability and result granularity. We further assess the
comprehensiveness of responses from several popular open-weight LLMs when
answering user queries based on multiple sources.
Ссылки и действия
Дополнительные ресурсы: