Building RAG Applications with Website Content

A Comprehensive Guide to Web Scraping, Chunking, and Vector Embeddings for RAG Applications

Bhavik Jikadara

Published in

AI Agent Insider

5 min readOct 1, 2024

Introduction

The recent advancements in Large Language Models (LLMs) have unlocked exciting possibilities for sophisticated natural language applications. These models, such as ChatGPT, LLAMA, and Mistral, are revolutionizing how we interact with AI, from generating human-like text to powering personalized chatbots. However, a major limitation persists: these models are restricted to the knowledge they were trained on and cannot update themselves with new information. This limitation hinders their ability to respond to time-sensitive or domain-specific queries.

This is where Retrieval-Augmented Generation (RAG) comes into play. RAG enables us to input real-time contextual information into LLMs, allowing them to offer more pertinent and precise answers. One valuable source of contextual information is website content.

In this guide, we will explain how to extract content from websites and utilize it to improve the responses of LLMs in an RAG application. We will cover everything from the basics of web scraping to chunking strategies and creating vector embeddings for efficient retrieval. Let’s get started!

Web Scraping Fundamentals

To integrate website content into an RAG system, the first step is to extract the content. This process is known as web scraping. While some websites offer APIs for accessing their data, many do not. In such cases, web scraping becomes very valuable.

Several popular Python libraries can assist in extracting web data. In this case, we will use Beautiful Soup for parsing HTML content and requests for making HTTP requests. Advanced tools such as Selenium (for dynamic content) or Scrapy (for larger-scale scraping) can also be utilized.

Example: Scraping Wikipedia

Let’s start by scraping a Wikipedia page using BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# Send a request to the Wikipedia page for Data Science
response = requests.get(
    url="https://en.wikipedia.org/wiki/Data_science",
)
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Get textual content inside the main body of the article
content = soup.find(id="bodyContent")
print(content.text)

This code sends a request to Wikipedia, fetches the content from the Data Science page, and extracts the main body text for further processing.

Chunking: Breaking Down the Content

After successfully scraping some content, the next step is to break it into chunks. Chunking is important for several reasons:

Granularity: Breaking the text into smaller pieces makes it easier to retrieve the most relevant information.
Improved Semantics: Using a single embedding for an entire document can cause the loss of meaningful information.
Efficiency: Smaller text chunks lead to more efficient computation during the embedding process.

Fixed-Size vs. Context-Aware Chunking

The most common chunking methods are fixed-size and context-aware chunking. Fixed-size chunks split text at predefined intervals, while context-aware chunking adjusts the chunk size based on sentence or paragraph boundaries.

For this guide, we’ll use the RecursiveCharacterTextSplitter from the LangChain framework to perform chunking, ensuring that splits occur at logical points in the text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,   # Set chunk size to 512 characters
    length_function=len
)
chunked_text = text_splitter.split_text(content.text)

This code splits the scraped text into chunks of approximately 512 characters, adjusting the splits based on natural breakpoints.

From Chunks to Vector Embeddings

Once we have the text chunks, the next step is to convert them into vector embeddings. Embeddings are numerical representations of the text that capture its semantic meaning, allowing for efficient similarity comparisons.

Types of Embeddings

There are two primary types of embeddings:

Dense Embeddings: Generated by deep learning models like those from OpenAI or Sentence Transformers. They encode semantic similarity well.
Sparse Embeddings: Generated by classical methods like TF-IDF or BM25. They are effective for keyword-based similarity.

For our RAG application, we’ll use dense embeddings generated by the all-MiniLM-L6-v2 model from Sentence Transformers.

from langchain.embeddings import SentenceTransformerEmbeddings

# Load the model for generating embeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# Create embeddings for the third chunk of text
chunk_embedding = embeddings.embed_documents([chunked_text[3]])

This code converts one of the chunks into a dense embedding using the MiniLM-L6-v2 model. In practice, you would generate embeddings for all chunks.

Storing and Retrieving Embeddings with Milvus

Once we’ve generated the embeddings, we need to store them in a vector database for efficient retrieval. Milvus is an open-source vector database that specializes in storing and searching embeddings. It integrates well with LangChain, making it an excellent choice for RAG applications.

Here’s how to store your chunk embeddings in Milvus:

from langchain.vectorstores.milvus import Milvus

# Store the embeddings in Milvus
vector_db = Milvus.from_texts(texts=chunked_text, embedding=embeddings, collection_name="rag_milvus")

This code creates a collection in Milvus and stores all the chunk embeddings for future retrieval.

Building the RAG Pipeline

With the chunks stored and embeddings ready, it’s time to construct our RAG pipeline. This pipeline will retrieve the most relevant embeddings based on user queries and pass them to the LLM to generate responses.

Step 1: Set Up the Retriever

We first need to set up a retriever that fetches the most relevant embeddings from the vector database based on the user’s query.

retriever = vector_db.as_retriever()

Step 2: Initialize the LLM

Next, we initialize our language model using OpenAI’s GPT-3.5-turbo:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

Step 3: Define a Custom Prompt

We need to create a prompt template that will guide the LLM to generate appropriate answers based on the retrieved content.

from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

Step 4: Build the RAG Chain

Finally, we’ll create the RAG chain that will retrieve the most relevant chunks, pass them to the LLM, and output the generated response.

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

With the RAG chain set up, you can now send a query to the pipeline and receive an answer based on the website content.

for chunk in rag_chain.stream("What is a Data Scientist?"):
    print(chunk, end="", flush=True)

Conclusion

Sure, here is the rewritten text:

In this guide, we covered the process of extracting website content and using it to improve LLM responses in an RAG application. We discussed web scraping, text chunking, generating vector embeddings, and storing those embeddings in a vector database like Milvus.

By using this technique, you can develop more informed and contextually aware AI applications. Whether you’re creating a chatbot or a question-answering system, RAG boosts the relevance and accuracy of the generated responses.

It’s important to remember that the success of your RAG pipeline relies on the quality of your data and how you organize the chunks, embeddings, and retrieval process. Experiment with different models, chunk sizes, and retrieval methods to refine your system.

Happy coding, and thanks for reading!

References

[1] Kotaemon: Open-source GraphRAG UI On Local Machine

[2] Building a Customized Knowledge Base with RAG, Llama 3, FAISS, and Langchain

[3] LlamaIndex & Chroma: Building a Simple RAG Pipeline

[4] Advanced RAG Techniques to Improve the Performance of Generative Models

[5] Building Powerful RAG Applications with Haystack 2.x

Building RAG Applications with Website Content

A Comprehensive Guide to Web Scraping, Chunking, and Vector Embeddings for RAG Applications

Introduction

Web Scraping Fundamentals

Example: Scraping Wikipedia

Chunking: Breaking Down the Content

Fixed-Size vs. Context-Aware Chunking

From Chunks to Vector Embeddings

Types of Embeddings

Storing and Retrieving Embeddings with Milvus

Building the RAG Pipeline

Step 1: Set Up the Retriever

Step 2: Initialize the LLM

Step 3: Define a Custom Prompt

Step 4: Build the RAG Chain

Conclusion

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AI Agent Insider

Written by Bhavik Jikadara

No responses yet

More from Bhavik Jikadara and AI Agent Insider

Developing RAG Systems with DeepSeek R1 & Ollama

Build robust RAG systems using DeepSeek R1 and Ollama. Discover setup procedures, best practices, and tips for developing intelligent AI…

Build an AI Agent from Scratch

Learn to build AI agents from scratch. Comprehensive guide on tools, libraries, and practical steps for AI development.

Build AI Agents with Active Memory Management Using LangMem

Learn how to create AI agents that manage their own memory, share knowledge across teams, and organize information efficiently.

How to install Open WebUI without Docker

This guide walks you through setting up Ollama Web UI without Docker. While Docker is officially recommended for ease and support, this…

Recommended from Medium

AI Agents: Build an Agent from Scratch (Part-2)

Discover AI agents, their design, and real-world applications.

Goodbye RAG? Gemini 2.0 Flash Have Just Killed It!

Alright!!!

Lists

Natural Language Processing

Staff picks

Building a Multi-Agent RAG Pipeline with Crew AI

In today’s era of intelligent systems, the ability to combine diverse retrieval tools with robust language models is transforming the way…

What’s the Best PDF Extractor for RAG? I Tried LlamaParse, Unstructured and Vectorize

If you’re building retrieval augmented generation (RAG) applications, you will eventually need to work with documents that are in PDF form.

Google just unveiled Agentspace — and it could completely change the future of business.

Discover how Google Agentspace is revolutionizing enterprises with AI agents and intelligent search. Learn how to automate workflows…

10x Cheaper PDF Processing: Ingesting and RAG on Millions of Documents with Gemini 2.0 Flash

Picture this: you start by converting every PDF page into images, then send them off for OCR, only to wrestle the raw text into workable…