Production RAG: Hybrid Search with BM25 and HNSW using RRF

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Search Blind Spot in Production RAG

For any senior engineer building Retrieval-Augmented Generation (RAG) systems, the initial excitement of semantic search quickly meets a harsh production reality: it often fails spectacularly on queries that depend on specific keywords, identifiers, or domain-specific jargon. A pure dense vector retrieval system, while brilliant at understanding conceptual similarity ("how to improve database query speed"), is fundamentally ill-equipped to handle a query for a specific error code like "OOM-Killed-Error-137" or a product SKU "XG-49-T2". The embedding model, trained on general language, simply doesn't encode these high-entropy, low-frequency tokens with the precision required for an exact match.

This creates a critical reliability gap. Your RAG system might perform beautifully on general questions but will fail your users on specific, high-intent queries. The solution isn't to abandon dense vectors but to augment them. This article presents a production-grade architecture for a hybrid search system that combines the best of both worlds: the keyword-matching prowess of a sparse retriever (BM25) and the semantic understanding of a dense retriever (HNSW-based vector index).

We will go beyond the theoretical and focus on the most critical implementation detail: fusing the results. We will dissect why simple score normalization is fragile and implement a superior, rank-based approach called Reciprocal Rank Fusion (RRF). By the end, you'll have a complete, benchmarked Python implementation and the architectural patterns to deploy a hybrid retriever that is robust, performant, and capable of handling the full spectrum of user queries.

This is not an introduction to RAG. We assume you are familiar with embeddings, vector stores, and the basic RAG pipeline. Our focus is on solving the next-level problem.


Architecture Deep Dive: The Two Pillars of Hybrid Search

A robust hybrid search system relies on two independent but complementary retrieval mechanisms operating over the same corpus of documents.

1. The Sparse Retriever: BM25 for Lexical Precision

Okapi BM25 (Best Matching 25) is a bag-of-words retrieval function that ranks documents based on the query terms they contain. It's an evolution of TF-IDF and remains a powerful baseline for keyword search.

Under the Hood:

* Term Frequency (TF): It considers how often a query term appears in a document. However, unlike simple TF, it uses a saturation function. The first few occurrences of a term are highly significant, but the marginal benefit diminishes. This prevents very long documents from dominating just by repeating a term.

* Inverse Document Frequency (IDF): It gives more weight to terms that are rare across the entire corpus. A term like "database" is less informative than "PostgreSQL-wal-async-commit".

* Document Length Normalization: It penalizes documents that are longer than the average document length in the corpus, preventing them from having an unfair advantage.

The BM25 score for a document D and a query Q containing terms q_1, ..., q_n is:

Score(D, Q) = Σ [ IDF(q_i) (f(q_i, D) (k_1 + 1)) / (f(q_i, D) + k_1 (1 - b + b |D| / avgdl)) ]

Where:

* f(q_i, D) is the term frequency of q_i in D.

* |D| is the length of document D.

* avgdl is the average document length in the corpus.

* k_1 and b are hyperparameters. k_1 (typically 1.2-2.0) controls the term frequency saturation, and b (typically 0.75) controls the document length penalty.

Python Implementation with rank_bm25:

Let's build a simple, encapsulated BM25 retriever.

python
import numpy as np
from rank_bm25 import BM25Okapi
from typing import List, Dict

# Sample documents for our system
docs = [
    {"id": "doc1", "text": "The FAISS library provides efficient similarity search. HNSW is a key algorithm."}, 
    {"id": "doc2", "text": "PostgreSQL's partial index can optimize queries for multi-tenant apps."}, 
    {"id": "doc3", "text": "Troubleshooting the OOM-Killed-Error-137 in Kubernetes pods often involves memory limits."},
    {"id": "doc4", "text": "BM25 is a sparse retrieval method, great for keyword matching like OOM-Killed-Error-137."},
    {"id": "doc5", "text": "Reciprocal Rank Fusion (RRF) is a technique to combine search results from different systems."},
    {"id": "doc6", "text": "Optimizing Kubernetes pod memory is crucial to prevent out-of-memory errors."}
]

class BM25Retriever:
    def __init__(self, corpus: List[Dict[str, str]]):
        self.corpus = corpus
        self.doc_map = {doc['id']: doc['text'] for doc in corpus}
        self.doc_ids = [doc['id'] for doc in corpus]
        
        tokenized_corpus = [self._tokenize(doc['text']) for doc in corpus]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def _tokenize(self, text: str) -> List[str]:
        # Simple whitespace tokenizer, in production you'd use a more robust one
        return text.lower().split()

    def retrieve(self, query: str, k: int = 5) -> List[Dict[str, any]]:
        tokenized_query = self._tokenize(query)
        doc_scores = self.bm25.get_scores(tokenized_query)
        
        # Get top_k indices
        top_k_indices = np.argsort(doc_scores)[::-1][:k]
        
        results = []
        for i in top_k_indices:
            doc_id = self.doc_ids[i]
            score = doc_scores[i]
            if score > 0: # Only include documents with a positive score
                results.append({"id": doc_id, "score": score, "text": self.doc_map[doc_id]})
        return results

# --- Usage Example ---
bm25_retriever = BM25Retriever(docs)
results = bm25_retriever.retrieve("OOM-Killed-Error-137")
print("--- BM25 Results for 'OOM-Killed-Error-137' ---")
for res in results:
    print(f"ID: {res['id']}, Score: {res['score']:.4f}, Text: {res['text']}")

Notice how for the query "OOM-Killed-Error-137", BM25 will excel, ranking doc3 and doc4 highest due to the exact keyword match. A dense retriever would likely struggle with this.

2. The Dense Retriever: HNSW for Semantic Nuance

Dense retrieval maps documents and queries into a high-dimensional vector space using a pre-trained language model (e.g., a sentence-transformer). The goal is to find documents whose vectors are closest to the query vector, typically using cosine similarity.

Searching through millions of vectors naively is computationally infeasible. This is where Approximate Nearest Neighbor (ANN) algorithms like Hierarchical Navigable Small World (HNSW) come in.

HNSW Under the Hood:

HNSW builds a multi-layered graph structure.

* Top Layers: Contain a small number of nodes (vectors) with long-range connections, allowing for fast, coarse-grained traversal of the vector space.

* Bottom Layers: Are progressively denser, with more nodes and shorter-range connections, enabling fine-grained, precise search.

A search starts at an entry point in the top layer. It greedily traverses the graph, moving to the neighbor closest to the query vector. Once it can't find a closer neighbor in the current layer, it drops down to the layer below and repeats the process. This hierarchical approach avoids the need to compute distances to all vectors, making it incredibly fast.

Python Implementation with sentence-transformers and faiss:

python
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Dict
import numpy as np

class DenseRetriever:
    def __init__(self, corpus: List[Dict[str, str]], model_name: str = 'all-MiniLM-L6-v2'):
        self.corpus = corpus
        self.doc_map = {doc['id']: doc['text'] for doc in corpus}
        self.doc_ids = [doc['id'] for doc in corpus]
        
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()
        
        # Create embeddings
        embeddings = self.model.encode([doc['text'] for doc in corpus], convert_to_tensor=False)
        
        # Build HNSW index using FAISS
        self.index = faiss.IndexHNSWFlat(self.dimension, 32) # 32 is the number of neighbors per node
        self.index = faiss.IndexIDMap(self.index)
        
        # FAISS requires IDs as integers, so we create a mapping
        self.id_to_int = {doc_id: i for i, doc_id in enumerate(self.doc_ids)}
        self.int_to_id = {i: doc_id for doc_id, i in self.id_to_int.items()}
        
        doc_ints = np.array([self.id_to_int[doc_id] for doc_id in self.doc_ids])
        self.index.add_with_ids(embeddings.astype('float32'), doc_ints)

    def retrieve(self, query: str, k: int = 5) -> List[Dict[str, any]]:
        query_embedding = self.model.encode([query])[0].astype('float32')
        query_embedding = np.expand_dims(query_embedding, axis=0)
        
        # Search the index
        distances, indices = self.index.search(query_embedding, k)
        
        results = []
        for i in range(len(indices[0])):
            int_id = indices[0][i]
            if int_id != -1: # FAISS returns -1 for no result
                doc_id = self.int_to_id[int_id]
                # Cosine similarity is 1 - (distance^2 / 2) for normalized vectors
                # FAISS L2 distance is squared L2, so we need to be careful.
                # For simplicity, we'll use the distance as a score proxy (lower is better)
                # A better approach is to use IndexFlatIP for dot product (cosine similarity)
                score = distances[0][i]
                results.append({"id": doc_id, "score": float(score), "text": self.doc_map[doc_id]})
        return results

# --- Usage Example ---
dense_retriever = DenseRetriever(docs)
results = dense_retriever.retrieve("how to fix out of memory problems in containers")
print("\n--- Dense Results for 'how to fix out of memory problems in containers' ---")
for res in results:
    print(f"ID: {res['id']}, Score: {res['score']:.4f}, Text: {res['text']}")

For the query "how to fix out of memory problems in containers", the dense retriever correctly identifies doc6 and doc3 as semantically relevant, even though they don't share many keywords with the query. This is where it shines.


The Fusion Problem: Why You Can't Just Add Scores

Now we have two ranked lists of results. How do we combine them? A naive approach might be to normalize the scores from both systems (e.g., to a 0-1 range) and add them up. This is a production anti-pattern.

  • Different Scales and Distributions: BM25 scores are unbounded and depend on term frequencies and corpus statistics. Cosine similarity scores are neatly bounded between -1 and 1. Their distributions are completely different. A min-max normalization is extremely sensitive to outliers; a single document with an unusually high BM25 score could skew the normalization for the entire result set.
  • Lack of Comparability: A BM25 score of 10.5 and a cosine similarity of 0.85 are fundamentally incomparable. There's no principled way to assign a weight w to w score_bm25 + (1-w) score_dense that works reliably across all types of queries.
  • We need a method that is agnostic to the underlying scoring mechanisms. This leads us to rank-based fusion.

    The Solution: Reciprocal Rank Fusion (RRF)

    RRF is a simple yet powerful technique that combines multiple ranked lists by looking only at the rank of each document, not its score. It was originally proposed for meta-search in information retrieval and is perfectly suited for our hybrid search problem.

    The Formula:

    For each document d, its RRF score is calculated as:

    RRF_Score(d) = Σ [ 1 / (k + rank_i(d)) ]

    Where:

    * The sum is over all the result lists i (in our case, BM25 and Dense).

    * rank_i(d) is the rank of document d in result list i. The rank starts at 1.

    * k is a constant, typically set to 60. This constant dampens the influence of documents at lower ranks. A lower k gives more weight to the top-ranked documents.

    Why RRF Works:

    * Score Agnostic: It completely ignores the heterogeneous scores from BM25 and the dense retriever.

    Prioritizes Top Ranks: The 1 / rank formula heavily rewards documents that appear at the top of any* list. A document ranked #1 in one list gets a significant score boost, even if it doesn't appear in the other.

    * Cumulative Effect: If a document appears in both lists, its RRF score is the sum of its reciprocal ranks, naturally promoting documents that both systems agree on.

    Implementing the Hybrid Retriever with RRF

    Let's build a class that orchestrates the two retrievers and performs the RRF fusion.

    python
    from collections import defaultdict
    
    class HybridRetriever:
        def __init__(self, sparse_retriever: BM25Retriever, dense_retriever: DenseRetriever):
            self.sparse_retriever = sparse_retriever
            self.dense_retriever = dense_retriever
    
        def retrieve(self, query: str, k: int = 5, rrf_k: int = 60) -> List[Dict[str, any]]:
            # 1. Get results from both retrievers
            # We retrieve more than k to have a good pool for re-ranking
            sparse_results = self.sparse_retriever.retrieve(query, k=k*2)
            dense_results = self.dense_retriever.retrieve(query, k=k*2)
    
            # 2. Calculate RRF scores
            rrf_scores = defaultdict(float)
            all_doc_ids = set([res['id'] for res in sparse_results] + [res['id'] for res in dense_results])
    
            # Process sparse results
            for rank, result in enumerate(sparse_results, 1):
                rrf_scores[result['id']] += 1 / (rrf_k + rank)
    
            # Process dense results
            for rank, result in enumerate(dense_results, 1):
                rrf_scores[result['id']] += 1 / (rrf_k + rank)
    
            # 3. Sort documents by RRF score
            sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
    
            # 4. Format final results
            final_results = []
            doc_map = {**{res['id']: res for res in sparse_results}, 
                         **{res['id']: res for res in dense_results}} # Combine to get text
    
            for doc_id, score in sorted_docs[:k]:
                final_results.append({
                    "id": doc_id,
                    "score": score, # This is the RRF score
                    "text": doc_map[doc_id]['text']
                })
            
            return final_results
    
    # --- Usage Example ---
    hybrid_retriever = HybridRetriever(bm25_retriever, dense_retriever)
    
    # Test with a keyword-heavy query
    keyword_query = "OOM-Killed-Error-137"
    hybrid_results_keyword = hybrid_retriever.retrieve(keyword_query)
    print(f"\n--- Hybrid Results for '{keyword_query}' ---")
    for res in hybrid_results_keyword:
        print(f"ID: {res['id']}, RRF_Score: {res['score']:.6f}, Text: {res['text']}")
    
    # Test with a semantic query
    semantic_query = "how to fix out of memory problems in containers"
    hybrid_results_semantic = hybrid_retriever.retrieve(semantic_query)
    print(f"\n--- Hybrid Results for '{semantic_query}' ---")
    for res in hybrid_results_semantic:
        print(f"ID: {res['id']}, RRF_Score: {res['score']:.6f}, Text: {res['text']}")

    The output will demonstrate the system's robustness. For the keyword query, doc3 and doc4 will be at the top. For the semantic query, doc6 and doc3 will lead. The hybrid system correctly leverages the strengths of each component.


    Production Considerations and Performance Optimization

    Building this in a Jupyter notebook is one thing; deploying it to handle production traffic is another.

    1. Latency and Parallelization

    The HybridRetriever implementation above performs the sparse and dense searches sequentially. This is a latency bottleneck. In a production environment, these two I/O-bound (or CPU-bound, for FAISS) operations should be executed in parallel.

    Here's how you can adapt the retrieve method using Python's concurrent.futures:

    python
    from concurrent.futures import ThreadPoolExecutor
    
    class ParallelHybridRetriever(HybridRetriever):
        def retrieve(self, query: str, k: int = 5, rrf_k: int = 60) -> List[Dict[str, any]]:
            with ThreadPoolExecutor(max_workers=2) as executor:
                future_sparse = executor.submit(self.sparse_retriever.retrieve, query, k=k*2)
                future_dense = executor.submit(self.dense_retriever.retrieve, query, k=k*2)
    
                sparse_results = future_sparse.result()
                dense_results = future_dense.result()
    
            # RRF fusion logic remains the same from here...
            rrf_scores = defaultdict(float)
            for rank, result in enumerate(sparse_results, 1):
                rrf_scores[result['id']] += 1 / (rrf_k + rank)
            for rank, result in enumerate(dense_results, 1):
                rrf_scores[result['id']] += 1 / (rrf_k + rank)
    
            sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
            
            final_results = []
            doc_map = {**{res['id']: res for res in sparse_results}, 
                         **{res['id']: res for res in dense_results}}
    
            for doc_id, score in sorted_docs[:k]:
                final_results.append({"id": doc_id, "score": score, "text": doc_map[doc_id]['text']})
            
            return final_results
    
    # Usage is the same, but now searches run in parallel
    parallel_retriever = ParallelHybridRetriever(bm25_retriever, dense_retriever)

    This simple change can nearly halve the retrieval latency, assuming you have the CPU cores to support it.

    2. Indexing Pipeline

    Your indexing process needs to be atomic and reliable. For a new batch of documents, you must:

    • Chunk documents into manageable pieces.
    • Generate embeddings for all new chunks.
    • Update the FAISS index. For HNSW, you can't easily delete items, so a common pattern is to rebuild the index periodically offline and swap it into production.
  • Update the BM25 index. The rank_bm25 library is in-memory; for larger-scale systems, you'd use a persistent, searchable index like Lucene (via Elasticsearch or OpenSearch).
    • Ensure the document ID mapping remains consistent across both systems.

    This process should be managed by a background job queue (e.g., Celery, RQ) and not be part of the live request path.

    3. Memory Footprint and HNSW Tuning

    The dense vector index is memory-hungry.

    Vectors: num_docs dimension * 4 bytes/float.

    * HNSW Graph: The graph structure adds significant overhead, controlled by the M parameter (number of neighbors per node). A higher M (e.g., 32, 64) improves recall but increases index size and build time. A lower M (e.g., 16) is faster and smaller but may be less accurate.

    For very large indexes that don't fit in RAM, FAISS supports memory-mapping (faiss.OnDiskInvertedLists) or using Product Quantization (PQ) to compress the vectors (e.g., IndexHNSW_PQ), trading some accuracy for a massive reduction in memory usage.


    Advanced Edge Cases and Refinements

    Query Routing

    For some queries, you might know a priori that one retriever is superior. For example, a query matching a regex for a UUID or a product code is almost certainly a job for the sparse retriever. You can implement a simple query classifier/router that sits in front of the hybrid retriever.

    python
    import re
    
    def query_router(query: str):
        # If query looks like a specific code, heavily favor sparse search
        if re.match(r'^[A-Z0-9-]{10,}$', query):
            return {"sparse_weight": 0.9, "dense_weight": 0.1}
        # For natural language questions, use a more balanced approach
        else:
            return {"sparse_weight": 0.5, "dense_weight": 0.5}

    Instead of a simple RRF, you could use these weights to scale the reciprocal rank contributions:

    rrf_scores[doc_id] += weight * (1 / (rrf_k + rank))

    This adds a layer of intelligence but requires careful tuning and can be brittle.

    Handling No Results

    What if one retriever returns an empty list? The current RRF implementation handles this gracefully—the loop for that retriever simply doesn't run, and the document scores are based solely on the other system. This is the desired behavior.

    Benchmarking Your System

    How do you prove this complexity is worthwhile? By benchmarking. Create an evaluation dataset with a mix of query types:

    * Keyword queries: "OOM-Killed-Error-137"

    * Semantic queries: "how to fix memory issues"

    * Mixed queries: "troubleshooting OOM-Killed-Error-137 in Kubernetes"

    For each query, manually label the relevant documents in your corpus. Then, evaluate each retriever (sparse-only, dense-only, hybrid) using standard information retrieval metrics like Mean Reciprocal Rank (MRR) and Hit Rate@K.

    Here's a sketch of a benchmark runner:

    python
    # eval_dataset = [{"query": ..., "relevant_doc_ids": [...]}, ...]
    
    def calculate_mrr(results, relevant_ids):
        for i, res in enumerate(results, 1):
            if res['id'] in relevant_ids:
                return 1 / i
        return 0.0
    
    def run_benchmark(retriever, eval_dataset):
        total_mrr = 0.0
        for item in eval_dataset:
            results = retriever.retrieve(item['query'], k=10)
            mrr = calculate_mrr(results, item['relevant_doc_ids'])
            total_mrr += mrr
        return total_mrr / len(eval_dataset)
    
    # Create your evaluation dataset
    eval_dataset = [
        {"query": "OOM-Killed-Error-137", "relevant_doc_ids": ['doc3', 'doc4']},
        {"query": "how to fix memory issues", "relevant_doc_ids": ['doc3', 'doc6']},
        {"query": "combining search results", "relevant_doc_ids": ['doc5']}
    ]
    
    # Run the benchmarks
    mrr_sparse = run_benchmark(bm25_retriever, eval_dataset)
    mrr_dense = run_benchmark(dense_retriever, eval_dataset)
    mrr_hybrid = run_benchmark(parallel_retriever, eval_dataset)
    
    print(f"\n--- Benchmark Results ---")
    print(f"Sparse Only MRR: {mrr_sparse:.4f}")
    print(f"Dense Only MRR:  {mrr_dense:.4f}")
    print(f"Hybrid MRR:      {mrr_hybrid:.4f}")

    You will invariably find that the hybrid system's MRR is significantly higher than either individual component on a mixed-query dataset, providing the quantitative evidence needed to justify the architecture.

    Conclusion

    Moving a RAG system from a prototype to a production-grade service requires confronting the limitations of its components. Pure semantic search, while powerful, has a clear blind spot for the keyword-driven, specific queries that are common in any real-world application.

    By implementing a hybrid search architecture that combines a sparse retriever like BM25 with a dense retriever over an HNSW index, you create a system that is far more robust. The key to making this system work is a principled fusion strategy. Reciprocal Rank Fusion (RRF) provides an elegant, score-agnostic, and highly effective method for combining the results.

    This pattern—parallel retrieval followed by rank-based fusion—is a cornerstone of modern, reliable search and RAG systems. It directly addresses a common failure mode, improves retrieval relevance across diverse query types, and is the kind of advanced implementation detail that separates toy projects from resilient, production-ready AI applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles