Evaluating an RAG Pipeline Using DeepEval

Comprehensive Evaluation with Metrics for Retrieval-Augmented Generation (RAG)

4 min readSep 19, 2024

Retrieval-augmented generation(RAG) pipelines are essential for creating systems that leverage both knowledge retrieval and generation capabilities, making them highly effective for tasks like answering questions or generating text based on a specific knowledge base. Evaluating the effectiveness of these RAG pipelines is critical to ensuring their performance and accuracy. This is where DeepEval, a powerful evaluation framework, comes into play.

In this guide, we will walk through how to evaluate an RAG pipeline using the DeepEval-Haystack integration, covering various metrics like context relevance, answer correctness, and faithfulness.

DeepEval | HaystackUse the DeepEval evaluation framework to calculate model-based metrics
haystack.deepset.ai

Key Metrics Covered:

Contextual Precision
Contextual Recall
Contextual Relevance
Answer Relevancy
Faithfulness

1. Setting Up Your RAG Pipeline

Before evaluating, we first need to set up a RAG pipeline. For a detailed tutorial on building a RAG pipeline, you can refer to this Haystack tutorial. In this example, we use the SQuAD v2 dataset as our knowledge base and test queries.

1.1 Prerequisites

OpenAI Key: DeepEval uses OpenAI models to compute certain metrics, so an OpenAI key is required.
Document Store: We’ll use Haystack’s In-Memory Document Store to store our documents.

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize document store and load the dataset
document_store = InMemoryDocumentStore()
dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

1.2 Creating a RAG Pipeline

To create a RAG pipeline, we combine a retriever, a prompt builder, and an OpenAI generator into a pipeline.

from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store, top_k=3)
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-3.5-turbo")
# Build the pipeline
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

1.3 Testing the Pipeline

question = "In what country is Normandy located?"
response = rag_pipeline.run({
    "retriever": {"query": question}, 
    "prompt_builder": {"question": question}, 
    "answer_builder": {"query": question}
})
print(response["answer_builder"]["answers"][0].data)

2. Preparing for Evaluation

To evaluate the RAG pipeline, we need the following:

Questions: Input queries for the model.
Generated Responses: The model’s responses.
Retrieved Contexts: The contexts retrieved by the retriever.
Ground Truths: Correct answers to evaluate accuracy.

2.1 Helper Function for Getting Contexts and Responses

def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses

question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

2.2 Ground truths, review all fields

Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.

ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

print("Questions:\n")
print("\n".join(questions))

print("Contexts:\n")
for c in contexts:
  print(c[0])

print("Responses:\n")
print("\n".join(responses))

print("Ground truths:\n")
print("\n".join(ground_truths))

3. Evaluating the Pipeline with DeepEval

Now that we have our questions, responses, and contexts, we can use DeepEval to evaluate the pipeline on several metrics.

3.1 Contextual Precision

This measures how well the pipeline’s retriever ranks relevant contexts higher than irrelevant ones.

from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-3.5-turbo"})
context_precision_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

3.2 Contextual Recall

Measures how well the retrieved contexts align with the ground truth.

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

3.3 Answer Relevancy

Measures how relevant the pipeline’s answers are in relation to the questions.

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)

evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

3.4 Faithfulness

Evaluates if the generated answer is factually aligned with the retrieved context.

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"})
faithfulness_pipeline.add_component("evaluator", evaluator)

evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Conclusion

Using DeepEval for RAG pipeline evaluation provides deep insights into various components of the pipeline, helping identify where improvements can be made. Whether evaluating context retrieval, answer relevance, or the faithfulness of responses, this structured evaluation will ensure your pipeline delivers accurate and reliable results.

References

[1] DeepEvalEvaluator Docs

[2] Metrics Introduction