Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
2511.03261v1
cs.CL, cs.AI, I.2.1; I.2.7
2025-11-07
Авторы:
Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
Abstract
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to
enhance the capabilities of Generative AI models by reducing hallucination.
Thus, the increasing prominence of RAG alongside Large Language Models (LLMs)
has sparked interest in comparing the performance of different LLMs in
question-answering (QA) in diverse domains. This study compares the performance
of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat,
Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA
tasks within the computer science literature leveraging RAG support. Evaluation
metrics employed in the study include accuracy and precision for binary
questions and ranking by a human expert, ranking by Google's AI model Gemini,
alongside cosine similarity for long-answer questions. GPT-3.5, when paired
with RAG, effectively answers binary and long-answer questions, reaffirming its
status as an advanced LLM. Regarding open-source LLMs, Mistral AI's
Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary
and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b
reports the shortest average latency in generating responses, whereas
LLaMa2-7b-chat by Meta reports the highest average latency. This research
underscores the fact that open-source LLMs, too, can go hand in hand with
proprietary models like GPT-3.5 with better infrastructure.