
Developing RAG Systems with DeepSeek R1 & Ollama
Build robust RAG systems using DeepSeek R1 and Ollama. Discover setup procedures, best practices, and tips for developing intelligent AI solutions.
DeepSeek R1 and Ollama provide powerful tools for building Retrieval-Augmented Generation (RAG) systems. This guide covers the setup, implementation, and best practices for developing RAG applications using these technologies.
Why RAG Systems Are Game-Changing
Retrieval-augmented generation (RAG) systems combine the best of search and generative AI, enabling context-aware responses that are precise and accurate. With tools like DeepSeek R1 and Ollama, creating a RAG system is no longer daunting. Whether you’re building a chatbot, knowledge assistant, or an AI-powered search engine, this guide equips you with everything you need to know.
Prerequisites
What You’ll Learn
- Setting up DeepSeek R1 and Ollama for RAG.
- Implementing document processing, vector storage, and query pipelines.
- Optimizing for performance, relevance, and user experience.
Steps to Build the RAG Pipeline
1. Setting Up the Environment and Importing libraries
Ensure you have installed the required Python packages. You can install them using:
pip install langchain-core langchain-community langchain-ollama langchain-huggingface faiss-cpu psutil
- Import libraries
from typing import List
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama.llms import OllamaLLM
from langchain_community.vectorstores import FAISS
import logging
import psutil
import os
2. Initializing the RAGPipeline Class
The the the RAGPipeline
class manages the entire process, including memory monitoring, document loading, embedding generation, and querying the model.
class RAGPipeline:
def __init__(self, model_name: str = "llama2:7b-chat-q4", max_memory_gb: float = 3.0):
self.setup_logging()
self.check_system_memory(max_memory_gb)
# Load the language model (LLM)
self.llm = OllamaLLM(model="deepseek-r1:8b")
# Initialize embeddings using a lightweight model
self.embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2",
model_kwargs={'device': 'cpu'} # Use CPU for efficiency
)
# Define the prompt template
self.prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. Be concise.
If you cannot find the answer in the context, say "I cannot answer this based on the provided context."
Context: {context}
Question: {question}
Answer: """)
3. Memory Management and Logging
To prevent crashes in low-memory environments, we log and check available memory before execution.
def setup_logging(self):
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger(__name__)
def check_system_memory(self, max_memory_gb: float):
available_memory = psutil.virtual_memory().available / (1024 ** 3)
self.logger.info(f"Available system memory: {available_memory:.1f} GB")
if available_memory < max_memory_gb:
self.logger.warning("Memory is below recommended threshold.")
4. Loading and Splitting Documents
We use TextLoader
and RecursiveCharacterTextSplitter
to process documents efficiently.
def load_and_split_documents(self, file_path: str) -> List[Document]:
loader = TextLoader(file_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
add_start_index=True,
)
splits = text_splitter.split_documents(documents)
self.logger.info(f"Created {len(splits)} document chunks")
return splits
5. Creating a Vector Store with FAISS
We use FAISS for efficient document retrieval, and processing documents in smaller batches.
def create_vectorstore(self, documents: List[Document]) -> FAISS:
batch_size = 32
vectorstore = FAISS.from_documents(documents[:batch_size], self.embeddings)
for i in range(batch_size, len(documents), batch_size):
batch = documents[i:i + batch_size]
vectorstore.add_documents(batch)
self.logger.info(f"Processed batch {i//batch_size + 1}")
return vectorstore
6. Setting Up the RAG Chain
We define the retrieval mechanism to fetch relevant documents efficiently.
def setup_rag_chain(self, vectorstore: FAISS):
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2, "fetch_k": 3})
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| self.prompt
| self.llm
| StrOutputParser()
)
return rag_chain
7. Querying the Model with Memory Monitoring
We log memory usage before executing the query.
def query(self, chain, question: str) -> str:
memory_usage = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
self.logger.info(f"Memory usage: {memory_usage:.1f} MB")
return chain.invoke(question)
8. Putting Everything Together in
We initialize the RAG pipeline, process documents, and run a sample query.
def main():
rag = RAGPipeline(model_name="deepseek-r1:8b", max_memory_gb=3.0)
documents = rag.load_and_split_documents("data/knowledge.txt")
vectorstore = rag.create_vectorstore(documents)
chain = rag.setup_rag_chain(vectorstore)
question = "What is AI?"
response = rag.query(chain, question)
print(f"Question: {question}\nAnswer: {response}")
if __name__ == "__main__":
main()
Conclusion
This blog detailed how to build a memory-efficient RAG pipeline using LangChain, Ollama, FAISS, and Hugging Face embeddings. By optimizing document chunking, vector storage, and memory monitoring, this approach ensures efficient AI-driven document retrieval even in low-resource environments. Try implementing this pipeline with your dataset and let us know your thoughts!