RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
2511.04502v1
cs.CL, cs.AI
2025-11-08
Авторы:
Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere
Abstract
Retrieval-Augmented Generation (RAG) is a critical technique for grounding
Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in
specialized, safety-critical domains remains a significant challenge. Existing
evaluation frameworks often rely on heuristic-based metrics that fail to
capture domain-specific nuances and other works utilize LLM-as-a-Judge
approaches that lack validated alignment with human judgment. This paper
introduces RAGalyst, an automated, human-aligned agentic framework designed for
the rigorous evaluation of domain-specific RAG systems. RAGalyst features an
agentic pipeline that generates high-quality, synthetic question-answering (QA)
datasets from source documents, incorporating an agentic filtering step to
ensure data fidelity. The framework refines two key LLM-as-a-Judge
metrics-Answer Correctness and Answerability-using prompt optimization to
achieve a strong correlation with human annotations. Applying this framework to
evaluate various RAG components across three distinct domains (military
operations, cybersecurity, and bridge engineering), we find that performance is
highly context-dependent. No single embedding model, LLM, or hyperparameter
configuration proves universally optimal. Additionally, we provide an analysis
on the most common low Answer Correctness reasons in RAG. These findings
highlight the necessity of a systematic evaluation framework like RAGalyst,
which empowers practitioners to uncover domain-specific trade-offs and make
informed design choices for building reliable and effective RAG systems.
RAGalyst is available on our Github.
Ссылки и действия
Дополнительные ресурсы: