Building RAG Applications with Website Content

A Comprehensive Guide to Web Scraping, Chunking, and Vector Embeddings for RAG Applications

Bhavik Jikadara
AI Agent Insider
Published in
5 min readOct 1, 2024

Introduction

The recent advancements in Large Language Models (LLMs) have unlocked exciting possibilities for sophisticated natural language applications. These models, such as ChatGPT, LLAMA, and Mistral, are revolutionizing how we interact with AI, from generating human-like text to powering personalized chatbots. However, a major limitation persists: these models are restricted to the knowledge they were trained on and cannot update themselves with new information. This limitation hinders their ability to respond to time-sensitive or domain-specific queries.

This is where Retrieval-Augmented Generation (RAG) comes into play. RAG enables us to input real-time contextual information into LLMs, allowing them to offer more pertinent and precise answers. One valuable source of contextual information is website content.

In this guide, we will explain how to extract content from websites and utilize it to improve the responses of LLMs in an RAG application. We will cover everything from the basics of web scraping to chunking strategies and creating vector embeddings for efficient retrieval. Let’s get started!

Web Scraping Fundamentals

To integrate website content into an RAG system, the first step is to extract the content. This process is known as web scraping. While some websites offer APIs for accessing their data, many do not. In such cases, web scraping becomes very valuable.

Several popular Python libraries can assist in extracting web data. In this case, we will use Beautiful Soup for parsing HTML content and requests for making HTTP requests. Advanced tools such as Selenium (for dynamic content) or Scrapy (for larger-scale scraping) can also be utilized.

Example: Scraping Wikipedia

Let’s start by scraping a Wikipedia page using BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# Send a request to the Wikipedia page for Data Science
response = requests.get(
url="https://en.wikipedia.org/wiki/Data_science",
)
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Get textual content inside the main body of the article
content = soup.find(id="bodyContent")
print(content.text)

This code sends a request to Wikipedia, fetches the content from the Data Science page, and extracts the main body text for further processing.

Chunking: Breaking Down the Content

After successfully scraping some content, the next step is to break it into chunks. Chunking is important for several reasons:

  • Granularity: Breaking the text into smaller pieces makes it easier to retrieve the most relevant information.
  • Improved Semantics: Using a single embedding for an entire document can cause the loss of meaningful information.
  • Efficiency: Smaller text chunks lead to more efficient computation during the embedding process.

Fixed-Size vs. Context-Aware Chunking

The most common chunking methods are fixed-size and context-aware chunking. Fixed-size chunks split text at predefined intervals, while context-aware chunking adjusts the chunk size based on sentence or paragraph boundaries.

For this guide, we’ll use the RecursiveCharacterTextSplitter from the LangChain framework to perform chunking, ensuring that splits occur at logical points in the text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Set chunk size to 512 characters
length_function=len
)
chunked_text = text_splitter.split_text(content.text)

This code splits the scraped text into chunks of approximately 512 characters, adjusting the splits based on natural breakpoints.

From Chunks to Vector Embeddings

Once we have the text chunks, the next step is to convert them into vector embeddings. Embeddings are numerical representations of the text that capture its semantic meaning, allowing for efficient similarity comparisons.

Types of Embeddings

There are two primary types of embeddings:

  • Dense Embeddings: Generated by deep learning models like those from OpenAI or Sentence Transformers. They encode semantic similarity well.
  • Sparse Embeddings: Generated by classical methods like TF-IDF or BM25. They are effective for keyword-based similarity.

For our RAG application, we’ll use dense embeddings generated by the all-MiniLM-L6-v2 model from Sentence Transformers.

from langchain.embeddings import SentenceTransformerEmbeddings

# Load the model for generating embeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# Create embeddings for the third chunk of text
chunk_embedding = embeddings.embed_documents([chunked_text[3]])

This code converts one of the chunks into a dense embedding using the MiniLM-L6-v2 model. In practice, you would generate embeddings for all chunks.

Storing and Retrieving Embeddings with Milvus

Once we’ve generated the embeddings, we need to store them in a vector database for efficient retrieval. Milvus is an open-source vector database that specializes in storing and searching embeddings. It integrates well with LangChain, making it an excellent choice for RAG applications.

Here’s how to store your chunk embeddings in Milvus:

from langchain.vectorstores.milvus import Milvus

# Store the embeddings in Milvus
vector_db = Milvus.from_texts(texts=chunked_text, embedding=embeddings, collection_name="rag_milvus")

This code creates a collection in Milvus and stores all the chunk embeddings for future retrieval.

Building the RAG Pipeline

With the chunks stored and embeddings ready, it’s time to construct our RAG pipeline. This pipeline will retrieve the most relevant embeddings based on user queries and pass them to the LLM to generate responses.

Step 1: Set Up the Retriever

We first need to set up a retriever that fetches the most relevant embeddings from the vector database based on the user’s query.

retriever = vector_db.as_retriever()

Step 2: Initialize the LLM

Next, we initialize our language model using OpenAI’s GPT-3.5-turbo:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

Step 3: Define a Custom Prompt

We need to create a prompt template that will guide the LLM to generate appropriate answers based on the retrieved content.

from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(template)

Step 4: Build the RAG Chain

Finally, we’ll create the RAG chain that will retrieve the most relevant chunks, pass them to the LLM, and output the generated response.

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| custom_rag_prompt
| llm
| StrOutputParser()
)

With the RAG chain set up, you can now send a query to the pipeline and receive an answer based on the website content.

for chunk in rag_chain.stream("What is a Data Scientist?"):
print(chunk, end="", flush=True)

Conclusion

Sure, here is the rewritten text:

In this guide, we covered the process of extracting website content and using it to improve LLM responses in an RAG application. We discussed web scraping, text chunking, generating vector embeddings, and storing those embeddings in a vector database like Milvus.

By using this technique, you can develop more informed and contextually aware AI applications. Whether you’re creating a chatbot or a question-answering system, RAG boosts the relevance and accuracy of the generated responses.

It’s important to remember that the success of your RAG pipeline relies on the quality of your data and how you organize the chunks, embeddings, and retrieval process. Experiment with different models, chunk sizes, and retrieval methods to refine your system.

Happy coding, and thanks for reading!

References

[1] Kotaemon: Open-source GraphRAG UI On Local Machine

[2] Building a Customized Knowledge Base with RAG, Llama 3, FAISS, and Langchain

[3] LlamaIndex & Chroma: Building a Simple RAG Pipeline

[4] Advanced RAG Techniques to Improve the Performance of Generative Models

[5] Building Powerful RAG Applications with Haystack 2.x

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

AI Agent Insider
AI Agent Insider

Published in AI Agent Insider

AI Agent Insider is your go-to source for the latest on AI agents. Explore breakthroughs, applications, and industry impact, from virtual assistants to autonomous systems. Dive into how AI is reshaping automation and interaction in our digital world. Stay ahead with us!

Bhavik Jikadara
Bhavik Jikadara

Written by Bhavik Jikadara

🚀 AI/ML & MLOps expert 🌟 Crafting advanced solutions to speed up data retrieval 📊 and enhance ML model lifecycles. buymeacoffee.com/bhavikjikadara

No responses yet

What are your thoughts?