GraphRAG: Advanced Data Retrieval for Enhanced Insights

5 min readJul 6, 2024

What is GraphRAG?

GraphRAG is a method for improving how computers understand and use information. It stands for “Graph-based Retrieval Augmented Generation.” This approach is more advanced than just searching through plain text. Instead of simply finding text snippets, GraphRAG builds a structured and hierarchical map of the information.

How does GraphRAG work?

Extracting a Knowledge Graph: It starts by creating a “knowledge graph” from the raw text. A knowledge graph is like a network of connected ideas, where each idea (or “node”) is linked to others in meaningful ways.
Building a Community Hierarchy: Next, it organizes these connected ideas into groups, or “communities.” Think of these communities as clusters of related concepts.
Generating Summaries: For each community, GraphRAG generates summaries that capture the main points. This helps in understanding the key ideas without getting lost in details.
Leveraging the Structures: When you need to perform tasks that involve retrieving and generating information (RAG-based tasks), GraphRAG uses this well-organized structure. This makes the process more efficient and accurate.

Why use GraphRAG?

GraphRAG enhances your AI’s ability to reason about complex and private data. By structuring information more intelligently, it allows the AI to make better decisions and provide more accurate responses. It’s particularly useful for improving the performance of large language models (LLMs) when dealing with intricate datasets.

To learn more about how GraphRAG can benefit your AI projects, you can check out the detailed explanation in the Microsoft Research Blog Post.

GraphRAG vs Baseline RAG 🔍

Retrieval-augmented generation (RAG) is a technique used to enhance the outputs of large language models (LLMs) by integrating real-world information. This technique is essential for many tools that rely on LLMs. The most common RAG approaches use vector similarity for searching relevant information, which is known as Baseline RAG.

GraphRAG, on the other hand, uses knowledge graphs, resulting in significant improvements when dealing with complex information.

How RAG Techniques Help

RAG techniques are particularly useful for enabling LLMs to reason about private datasets — data that the LLM has never seen before, such as an enterprise’s proprietary research, business documents, or internal communications. While Baseline RAG was designed to solve this problem, it often falls short in certain scenarios.

Challenges with Baseline RAG

Connecting the Dots: Baseline RAG struggles when it needs to link different pieces of information through their shared attributes to provide new insights. It can’t easily traverse and synthesize disparate data points.
Understanding Large Data Collections: Baseline RAG performs poorly when it needs to understand summarized concepts across large data sets or even within single extensive documents.

The Advantage of GraphRAG

To address these shortcomings, the tech community, including Microsoft Research, has been developing advanced methods like GraphRAG. Here’s how GraphRAG works:

Creating a Knowledge Graph: GraphRAG uses LLMs to build a knowledge graph from the input text. This graph represents relationships and connections between different pieces of information.
Generating Community Summaries: It organizes the information into communities and generates summaries for these groups, capturing the main ideas.
Augmenting Prompts: When a query is made, GraphRAG leverages the knowledge graph, community summaries, and graph machine learning outputs to enhance the prompts provided to the LLM.

Performance Improvements

GraphRAG shows substantial improvements in two key areas:

Complex Information Traversal: It excels at connecting different pieces of information to provide new, synthesized insights.
Holistic Understanding: It performs better at understanding and summarizing large data collections, offering a more comprehensive grasp of the information.

The GraphRAG Process 🤖

GraphRAG enhances traditional RAG methods by incorporating graph machine learning. Here’s an overview of the GraphRAG process:

Indexing Phase

TextUnits Creation: Slice the input text into smaller units called TextUnits. These serve as the fundamental building blocks for analysis and provide detailed references in the output.
Entity Extraction: Use an LLM to extract all entities (e.g., people, places, organizations), relationships, and key claims from the TextUnits.
Hierarchical Clustering: Perform hierarchical clustering on the graph using the Leiden technique. Each circle in the visual representation is an entity, with size indicating the entity’s importance (degree) and color indicating its community.
Community Summarization: Generate summaries for each community and its components from the bottom up. This helps in gaining a comprehensive understanding of the dataset.

Query Phase

At query time, GraphRAG utilizes the structured information to enhance the context provided to the LLM. The primary query modes are:

Global Search: Use community summaries to reason about broad, holistic questions concerning the entire corpus.
Local Search: Focus on specific entities by expanding the search to their neighbors and associated concepts for more detailed reasoning.

Installation

Requirements

Python 3.10–3.12

To get started with the GraphRAG system, you have a few options:

👉 Use the GraphRAG Accelerator solution
👉 Install from Pypi.
👉 Use it from source

Getting Started:

Install GraphRAG:

pip install graphrag

2. Set Up Your Workspace:

Now we need to set up a data project and some initial configuration. Let’s set that up. We’re using the default configuration mode, which you can customize as needed using a config file, which we recommend, or environment variables.

First, let’s get a sample dataset ready:

Create a sample dataset:

mkdir input 

# Add own data
touch input/book.txt

To initialize your workspace, let’s first run the graphrag.index --init Command. Since we have already configured a current directory in the previous step, we can run the following command:

python -m graphrag.index --init --root .

3. Configuration:

OpenAI: Update GRAPHRAG_API_KEY in .env with your OpenAI API key.
Azure OpenAI: Set variables in settings.yaml for Azure configuration.
For more details about configuring GraphRAG, see the configuration documentation.
To learn more about Initialization, refer to the Initialization documentation.
For more details about using the CLI, refer to the CLI documentation.

4. Running the Indexing Pipeline:

python -m graphrag.index --root .

This process will take some time to run. This depends on the size of your input data, what model you’re using, and the text chunk size being used (these can be configured in your .env file). Once the pipeline is complete, you should see a new folder called ./output/<timestamp>/artifacts with a series of parquet files.

5. Using the Query Engine:

Global Search: Here is an example using Global search to ask a high-level question:

python -m graphrag.query --root ./ragtest --method global "What are the top themes in this story?"

Local Search: Here is an example using Local search to ask a more specific question about a particular character:

python -m graphrag.query --root ./ragtest --method local "Who is Scrooge, and what are his main relationships?"

Please refer to Query Engine docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.

Reference

Open-source RAG UI on your local machineKotaemon is an open-source, clean, and customizable Retrieval-Augmented Generation (RAG) User Interface (UI) designed…
bhavikjikadara.medium.com