Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic
2510.07557v1
cs.LG, cs.AI, cs.CY, cs.HC
2025-10-11
Авторы:
Abhay Bhandarkar, Gaurav Mishra, Khushi Juchani, Harsh Singhal
Abstract
This study applies BERTopic, a transformer-based topic modeling technique, to
the lmsys-chat-1m dataset, a multilingual conversational corpus built from
head-to-head evaluations of large language models (LLMs). Each user prompt is
paired with two anonymized LLM responses and a human preference label, used to
assess user evaluation of competing model outputs. The main objective is
uncovering thematic patterns in these conversations and examining their
relation to user preferences, particularly if certain LLMs are consistently
preferred within specific topics. A robust preprocessing pipeline was designed
for multilingual variation, balancing dialogue turns, and cleaning noisy or
redacted data. BERTopic extracted over 29 coherent topics including artificial
intelligence, programming, ethics, and cloud infrastructure. We analysed
relationships between topics and model preferences to identify trends in
model-topic alignment. Visualization techniques included inter-topic distance
maps, topic probability distributions, and model-versus-topic matrices. Our
findings inform domain-specific fine-tuning and optimization strategies for
improving real-world LLM performance and user satisfaction.