LlamaIndex & Chroma: Building a Simple RAG Pipeline

Bhavik Jikadara
5 min readAug 9, 2024

--

In the world of AI and machine learning, Retrieval-Augmented Generation (RAG) has become a powerful technique to enhance the performance of language models. RAG combines the retrieval of relevant context from a large dataset with generative capabilities, enabling AI systems to produce more accurate and contextually aware responses.

In this article, we’ll explore how to build a simple RAG pipeline step by step. But first, let’s break down the RAG process into two main stages: Data Indexing and Data Retrieval and Generation.

What is Retrieval-Augmented Generation (RAG)?

RAG (Retrieval-Augmented Generation) is a technique in natural language processing (NLP) that enhances the capabilities of language models by combining information retrieval with text generation. It allows a model to generate more accurate and contextually relevant responses by retrieving relevant pieces of information from a database or knowledge base before generating text.

Understanding the RAG Pipeline

Simple RAG Pipeline diagram by author.

1. Data Indexing

The first step in building a RAG pipeline is data indexing. This process involves converting text data into a searchable database of vector embeddings, which represent the meaning of the text in a format that computers can easily understand.

  • Document Chunking: The collection of documents is split into smaller chunks of text. This allows for more precise and relevant pieces of information to be fed into the language model when needed, avoiding information overload.
  • Vector Embeddings: The chunks of text are then transformed into vector embeddings. These embeddings encode the meaning of natural language text into numerical representations.
  • Vector Database: Finally, the vector embeddings are stored in a vector database, making them easily searchable.

2. Data Retrieval and Generation

Once the context data is stored as vector embeddings, the process of data retrieval and generation begins.

  • Query Transformation: The user’s query (or prompt) is also transformed into a vector embedding, similar to how the context data was processed.
  • Context Matching: The query vector is compared against all the vectors in the vector database. The top-k most similar chunks of context data are selected.
  • Response Generation: The selected chunks of context, along with the user’s query, are fed into the language model (LLM) to generate a relevant and accurate response.

How I Built a Simple RAG Pipeline

Now that we’ve covered the theory behind a RAG pipeline, let’s dive into the practical implementation. Below are the steps we’ll follow:

  1. Set up the environment
  2. Import an LLM
  3. Import an embedding model
  4. Prepare the data
  5. Prompt Engineering
  6. Create the query engine

Setting Up the Environment

First, we need to import the necessary libraries:

  • Chroma: An AI-native open-source vector database, which will be used to create a vector database for our embeddings.
  • LlamaIndex: A framework for building context-augmented generative AI applications with LLMs. It handles reading the context data, creating vector embeddings, building prompt templates, and prompting the LLM locally.

Here’s how to get started:

# To install these libraries, you can run the following commands:
pip install chromadb llama-index
import chromadb
from llama_index.core import PromptTemplate, Settings, SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

Importing Llama LLM

With the libraries imported, we can now bring in the Llama language model. I opted for Llama because it allows for local execution, which is both free and private. Using the Ollama library makes it simple:

llm = Ollama(model="llama3")
response = llm.complete("Who is Laurie Voss? Write in 10 words")
print(response)

Importing an Embedding Model

Next, we need an embedding model to transform text into vector embeddings. I chose the “BAAI/bge-small-en-v1.5” model from Hugging Face, which is small and quick to implement — ideal for a proof of concept (POC).

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = llm
Settings.embed_model = embed_model

Preparing the Data

To prepare the data, we first read the context file using SimpleDirectoryReader. In this example, we're using a PDF of my one-page resume. We then create a vector database using Chroma and store the vector embeddings.

documents = SimpleDirectoryReader(input_files=["./resume.pdf"]).load_data()
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("ollama")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
transformations=[SentenceSplitter(chunk_size=256, chunk_overlap=10)]
)

Prompt Engineering

With the RAG pipeline set up, the next step is writing a template query. This template assigns the LLM a task and persona, provides context, and plugs in the user’s question.

template = (
"Imagine you are a data scientist's assistant and "
"you answer a recruiter's questions about the data scientist's experience."
"Here is some context from the data scientist's "
"resume related to the query::\n"
"-----------------------------------------\n"
"{context_str}\n"
"-----------------------------------------\n"
"Considering the above information, "
"Please respond to the following inquiry:\n\n"
"Question: {query_str}\n\n"
"Answer succinctly and ensure your response is "
"clear to someone without a data science background."
"The data scientist's name is Bhavik Jikadara."
)
qa_template = PromptTemplate(template)

Creating the Query Engine

Finally, we create a query engine that puts together all the components of our RAG pipeline.

query_engine = index.as_query_engine(
text_qa_template=qa_template,
similarity_top_k=3
)

Running the RAG Pipeline

The exciting part of building an AI application is seeing it work! To run the RAG pipeline, simply prompt the query engine with a question:

response = query_engine.query("Do you have experience with Python?")
print(response.response)

This will generate a response like:

'Yes, I can confirm that Diana Morales has extensive experience working 
with Python as a Data Scientist at Accenture. According to her resume,
she listed Python as one of her core skills, indicating a strong
proficiency in the programming language. Additionally, her projects and
achievements highlight her ability to leverage Python for various data
science tasks, such as natural language processing (NLP), machine learning,
and data visualizations.'

Conclusion

In this article, we explored the basics of building a simple RAG pipeline, from data indexing to query generation. This powerful technique can greatly enhance the accuracy and relevance of AI-generated responses by combining retrieval and generation in one seamless process.

In my upcoming articles, I will delve deeper into more advanced topics related to RAG pipelines. Also, you can read more articles related to RAG.

If you found this guide helpful, don’t forget to clap, comment, and share your own RAG pipeline creations!

References

--

--

Bhavik Jikadara
Bhavik Jikadara

Written by Bhavik Jikadara

🚀 AI/ML & MLOps expert 🌟 Crafting advanced solutions to speed up data retrieval 📊 and enhance ML model lifecycles. buymeacoffee.com/bhavikjikadara

No responses yet