Evaluating an RAG Pipeline Using DeepEval
Comprehensive Evaluation with Metrics for Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation(RAG) pipelines are essential for creating systems that leverage both knowledge retrieval and generation capabilities, making them highly effective for tasks like answering questions or generating text based on a specific knowledge base. Evaluating the effectiveness of these RAG pipelines is critical to ensuring their performance and accuracy. This is where DeepEval, a powerful evaluation framework, comes into play.
In this guide, we will walk through how to evaluate an RAG pipeline using the DeepEval-Haystack integration, covering various metrics like context relevance, answer correctness, and faithfulness.
Key Metrics Covered:
- Contextual Precision
- Contextual Recall
- Contextual Relevance
- Answer Relevancy
- Faithfulness
1. Setting Up Your RAG Pipeline
Before evaluating, we first need to set up a RAG pipeline. For a detailed tutorial on building a RAG pipeline, you can refer to this Haystack tutorial. In this example, we use the SQuAD v2 dataset as our knowledge base and test queries.
1.1 Prerequisites
- OpenAI Key: DeepEval uses OpenAI models to compute certain metrics, so an OpenAI key is required.
- Document Store: We’ll use Haystack’s In-Memory Document Store to store our documents.
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Initialize document store and load the dataset
document_store = InMemoryDocumentStore()
dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)
1.2 Creating a RAG Pipeline
To create a RAG pipeline, we combine a retriever, a prompt builder, and an OpenAI generator into a pipeline.
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
retriever = InMemoryBM25Retriever(document_store, top_k=3)
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-3.5-turbo")
# Build the pipeline
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")
1.3 Testing the Pipeline
question = "In what country is Normandy located?"
response = rag_pipeline.run({
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question}
})
print(response["answer_builder"]["answers"][0].data)
2. Preparing for Evaluation
To evaluate the RAG pipeline, we need the following:
- Questions: Input queries for the model.
- Generated Responses: The model’s responses.
- Retrieved Contexts: The contexts retrieved by the retriever.
- Ground Truths: Correct answers to evaluate accuracy.
2.1 Helper Function for Getting Contexts and Responses
def get_contexts_and_responses(questions, pipeline):
contexts = []
responses = []
for question in questions:
response = pipeline.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
responses.append(response["answer_builder"]["answers"][0].data)
return contexts, responses
question_map = {
"Which mountain range influenced the split of the regions?": 0,
"What is the prize offered for finding a solution to P=NP?": 1,
"Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)
2.2 Ground truths, review all fields
Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.
ground_truths = [""] * len(question_map)
for question, index in question_map.items():
idx = dataset["question"].index(question)
ground_truths[index] = dataset["answers"][idx]["text"][0]
print("Questions:\n")
print("\n".join(questions))
print("Contexts:\n")
for c in contexts:
print(c[0])
print("Responses:\n")
print("\n".join(responses))
print("Ground truths:\n")
print("\n".join(ground_truths))
3. Evaluating the Pipeline with DeepEval
Now that we have our questions, responses, and contexts, we can use DeepEval to evaluate the pipeline on several metrics.
3.1 Contextual Precision
This measures how well the pipeline’s retriever ranks relevant contexts higher than irrelevant ones.
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-3.5-turbo"})
context_precision_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_precision_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
3.2 Contextual Recall
Measures how well the retrieved contexts align with the ground truth.
context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_recall_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
3.3 Answer Relevancy
Measures how relevant the pipeline’s answers are in relation to the questions.
answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)
evaluation_results = answer_relevancy_pipeline.run(
{"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])
3.4 Faithfulness
Evaluates if the generated answer is factually aligned with the retrieved context.
faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"})
faithfulness_pipeline.add_component("evaluator", evaluator)
evaluation_results = faithfulness_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
Conclusion
Using DeepEval for RAG pipeline evaluation provides deep insights into various components of the pipeline, helping identify where improvements can be made. Whether evaluating context retrieval, answer relevance, or the faithfulness of responses, this structured evaluation will ensure your pipeline delivers accurate and reliable results.