Hybrid Search: Implementing RAG with LanceDB for Production Applications
Large Language Models (LLMs) dominate the Retrieval-Augmented Generation (RAG) landscape, and efficient and scalable search mechanisms are critical. While vector searches have gained prominence, especially for semantic search, challenges emerge when applying these methods to large-scale, real-world applications.
This article dives into Hybrid Search, an approach combining keyword-based and vector searches, utilizing LanceDB as the core database, to deliver precise and user-aligned search results.
What is Hybrid Search?
Hybrid Search combines traditional keyword-based searching with advanced techniques like natural language processing (NLP), semantic search, and machine learning to deliver more accurate and relevant results. Why stick to just one method when you can harness the strengths of multiple approaches? By blending these technologies, hybrid search can understand the context and meaning behind queries, going beyond simple keyword matching to retrieve information that meets the user’s intent.
How does this impact real-world applications? In the workplace, enterprise search engines using hybrid search enable employees to locate specific information within vast knowledge bases quickly. E-commerce platforms benefit from hybrid search by helping customers find the right products, even if they’re unsure of the exact name. Traditional web search engines are also embracing this approach, enhancing the accuracy and relevance of the results they provide to users.
How Does Hybrid Search Work?
Hybrid search operates by integrating traditional keyword-based search, which uses sparse vectors, with a modern semantic search that employs dense vectors, offering a more nuanced and accurate retrieval experience. But how exactly does it achieve this? Let’s break it down.
Keyword-Based Search (Sparse Vectors)
- In traditional search engines, keyword-based search utilizes sparse vectors, where each dimension corresponds to a unique term from a vast vocabulary. This method, facilitated by techniques like term frequency-inverse document frequency (TF-IDF) and inverted indexing, efficiently matches query keywords with documents. It excels in delivering fast and precise results for exact term matches but can fall short when relevant documents don’t contain the exact keywords.
Semantic Search (Dense Vectors)
- Semantic search, on the other hand, leverages dense vectors created using advanced techniques like word embeddings (e.g., Word2vec, GloVe) or contextual embeddings (e.g., BERT, GPT). These dense vectors capture the semantic meaning and context of words and phrases, enabling the search system to understand and match the underlying intent of a query, even when the exact keywords are absent.
Combining Sparse and Dense Vectors
- In a hybrid search system, both sparse and dense vectors are generated and stored in respective indices. When a query is submitted, it is processed to produce both types of vectors, allowing the system to perform a search across both indices. The initial retrieval involves selecting candidate documents from both sparse (keyword match) and dense (semantic match) indices.
Retrieval and Ranking
- The next step is re-ranking the retrieved documents, combining relevance scores from both sparse and dense vectors. Machine learning models often fine-tune this ranking by considering factors like query context, user behavior, and overall document relevance, ensuring the most relevant documents appear at the top.
Comparing Keyword, Semantic, and Hybrid Search
- Each search method — keyword, semantic, and hybrid — has its unique strengths and applications. While keyword search is fast and efficient for exact matches, semantic search excels in understanding context and meaning. Hybrid search, by combining these approaches, offers a balanced solution that enhances accuracy and relevance, making it suitable for a wide range of applications like enterprise search, digital libraries, and e-commerce.
Ultimately, hybrid search is often the best choice for modern applications due to its ability to leverage the strengths of both keyword and semantic search, providing users with the most relevant and precise results.
Why Hybrid Search?
- Enhanced Relevance and Precision: Combines keyword search’s exact matching with semantic search’s contextual understanding, retrieving precise and semantically relevant results.
- Better Query Handling: Handles both simple keyword queries and complex natural language queries, improving accuracy and user experience.
- Comprehensive Results: Ensures no relevant documents are missed, covering both exact keyword matches and semantically related content.
- Adaptability: Dynamically adjusts the weight between keyword matches and semantic relevance, continuously improving with machine learning models.
- Optimized Performance: Balances computational load by filtering results with keyword search and fine-tuning with semantic search, enabling scalable performance.
- Versatility in Applications: Ideal for enterprise search, e-commerce, digital libraries, and more, catering to diverse and complex queries for better user satisfaction.
Examples of Hybrid Search
Now that we’ve gone over why you should consider implementing hybrid search, let’s discuss examples of hybrid search across different platforms. Each platform has unique features and capabilities that enhance search accuracy and relevance.
Couchbase
- Couchbase is a NoSQL cloud database platform that allows teams to build powerful search capabilities into applications. It supports vector, full-text, geolocation, ranges, and predicate search techniques, all within a single SQL query and index — delivering simplicity and lower latency. You can learn more about Couchbase’s hybrid search capabilities here.
Elasticsearch
- Elasticsearch is a powerful open-source search engine that supports keyword-based and semantic search functionalities. It integrates with various plugins and tools like Kibana for visualization and machine learning to enhance search relevance. You can learn more about Elasticsearch’s hybrid search capabilities in this blog post.
Algolia
- Algolia is a search-as-a-service platform that provides real-time search and discovery capabilities. It combines keyword-based search with features like typo tolerance, synonyms, and personalization, which are aspects of semantic search. You can learn more about Algolia’s AI search capabilities here.
Amazon Kendra
- Amazon Kendra is an intelligent search service powered by machine learning. It offers natural language understanding capabilities to deliver more relevant search results, combining keyword and semantic searches. You can learn more about Amazon Kendra’s features here.
Implementation
Project Structure for Your Hybrid Search Application: To structure your Hybrid Search application, you can follow this layout:
HybridSearch/
│
├── data/
│ └── BEIR/ (place your datasets here)
│
├── db/
│ └── (LanceDB database files)
│
├── scripts/
│ ├── setup.py (environment setup)
│ ├── load_data.py (load and preprocess datasets)
│ ├── hybrid_search.py (implement Hybrid Search)
│ └── rerankers.py (custom rerankers and filters)
│
└── README.md (project overview and instructions)
Setting Up Your LanceDB Environment: To start implementing Hybrid Search, you’ll need to set up your environment. LanceDB is a great tool for this purpose, offering full-text search (using Tantivy) and re-ranking capabilities out of the box. Here’s how you can get started:
import os
import lancedb
from lancedb.embeddings import get_registry
from datasets import load_dataset
os.environ["OPENAI_API_KEY"] = "sk-......." # Your API Key
embeddings = get_registry().get("openai").create()
Creating a LanceDB Table with BEIR Data: The next step is to create a LanceDB table and load the BEIR dataset. This example shows how to create a table with additional metadata for a more “hybrid” search.
queries = load_dataset("BeIR/scidocs", "queries")["queries"].to_pandas()
full_docs = load_dataset('BeIR/scidocs', 'corpus')["corpus"].to_pandas().dropna(subset="text")
docs = full_docs.head(64)
docs["num_words"] = docs["text"].apply(lambda x: len(x.split())) # Adding metadata
class Documents(LanceModel):
vector: Vector(embeddings.ndims()) = embeddings.VectorField()
text: str = embeddings.SourceField()
title: str
num_words: int
data = docs.apply(lambda row: {"title": row["title"], "text": row["text"], "num_words": row["num_words"]}, axis=1).values.tolist()
db = lancedb.connect("./db")
table = db.create_table("documents", schema=Documents)
table.add(data) # Ingest documents with auto-vectorization
table.create_fts_index("text") # Create a full-text search index
Hybrid Search in Action: Fusion and Re-Ranking: Now that your table is set up, you can perform a hybrid search by applying fusion or re-ranking techniques.
from lancedb.rerankers import LinearCombinationReranker
reranker = LinearCombinationReranker(weight=0.3)
results = table.search("Confuse the AI with random terms", query_type="hybrid").rerank(reranker=reranker).limit(5).to_pandas()
Customizing Your Search: Filters and Modified Rerankers: You can further customize your Hybrid Search by implementing filters and creating modified rankers.
from typing import List, Union
import pandas as pd
import pyarrow as pa
from typing import List, Union
import pandas as pd
import pyarrow as pa
class ModifiedLinearReranker(LinearCombinationReranker):
def __init__(self, filters: Union[str, List[str]], **kwargs):
super().__init__(**kwargs)
filters = filters if isinstance(filters, list) else [filters]
self.filters = filters
def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table) -> pa.Table:
combined_result = super().rerank_hybrid(query, vector_results, fts_results)
df = combined_result.to_pandas()
for filter in self.filters:
df = df.query("(not text.str.contains(@filter)) & (num_words > 150)")
return pa.Table.from_pandas(df)
modified_reranker = ModifiedLinearReranker(filters=["dual-band"])
table.search("Confuse the AI with random terms", query_type="hybrid").rerank(reranker=modified_reranker).limit(5).to_pandas()
This code will implement a custom filtering criteria where only the results are there where No of words are >150
. You can also change the merging mechanism by inheriting from the built-in Reranked and adding some custom logic!
Conclusion
Hybrid Search using LanceDB provides a balanced approach to search, combining the precision of keyword-based methods with the context-awareness of semantic search. This dual approach makes it ideal for modern, large-scale applications where relevance and precision are paramount. The structured implementation offers a scalable, adaptable solution for a wide range of search-driven applications.
Happy Learning!