Investigating LLM Variability in Personalized Conversational Information Retrieval
2510.03795v1
cs.IR, cs.CL
2025-10-08
Авторы:
Simon Lupart, Daniël van Dijk, Eric Langezaal, Ian van Dort, Mohammad Aliannejadi
Abstract
Personalized Conversational Information Retrieval (CIR) has seen rapid
progress in recent years, driven by the development of Large Language Models
(LLMs). Personalized CIR aims to enhance document retrieval by leveraging
user-specific information, such as preferences, knowledge, or constraints, to
tailor responses to individual needs. A key resource for this task is the TREC
iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines.
Building on this resource, Mo et al. explored several strategies for
incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query
reformulation. Their findings suggested that personalization from PTKBs could
be detrimental and that human annotations were often noisy. However, these
conclusions were based on single-run experiments using the GPT-3.5 Turbo model,
raising concerns about output variability and repeatability. In this
reproducibility study, we rigorously reproduce and extend their work, focusing
on LLM output variability and model generalization. We apply the original
methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of
models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that
human-selected PTKBs consistently enhance retrieval performance, while
LLM-based selection methods do not reliably outperform manual choices. We
further compare variance across datasets and observe higher variability on iKAT
than on CAsT, highlighting the challenges of evaluating personalized CIR.
Notably, recall-oriented metrics exhibit lower variance than precision-oriented
ones, a critical insight for first-stage retrievers. Finally, we underscore the
need for multi-run evaluations and variance reporting when assessing LLM-based
CIR systems. By broadening evaluation across models, datasets, and metrics, our
study contributes to more robust and generalizable practices for personalized
CIR.
Ссылки и действия
Дополнительные ресурсы: