Optimizing RAG: Hybrid Search and Re-ranking for Production Latency

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Naive RAG: The Multi-Stage Retrieval Imperative

For any senior engineer tasked with deploying a Retrieval-Augmented Generation (RAG) system into production, the initial excitement of a simple vector search prototype quickly gives way to a harsh reality: it's often not good enough. A baseline RAG system, typically using a bi-encoder model to embed a query and perform a single vector search over a corpus, suffers from critical, production-breaking flaws:

  • The Keyword Problem: Dense retrievers excel at semantic understanding but frequently fail on queries requiring exact keyword matches, such as product SKUs, error codes, specific function names, or legal clauses. A query for Error_404_Not_Found may retrieve documents about general connection issues instead of the specific document defining that error code.
  • The Relevance Ceiling: The top-k documents from a single vector search are often semantically related but not necessarily the most relevant or precise answers. The bi-encoder's dot-product or cosine similarity is a coarse-grained relevance signal, insufficient for nuanced information needs.
  • Unpredictable Latency: A single, large ANN (Approximate Nearest Neighbor) index can exhibit variable latency under load, and scaling it becomes a singular point of failure and cost.
  • To overcome these limitations, we must evolve our mental model from a single retrieval step to a multi-stage retrieval pipeline, architected like a funnel. This post provides a deep, implementation-focused guide to building such a pipeline, focusing on two key architectural patterns: Hybrid Search and Re-ranking. We will architect and implement a system that:

    * Retrieves a broad set of candidate documents using parallel sparse (lexical) and dense (semantic) search.

    * Fuses the results from these disparate systems into a single, more robust candidate list.

    * Re-ranks the top candidates using a computationally expensive but highly accurate cross-encoder model.

    This is not an introduction. This is a production playbook. We assume you understand vector embeddings, ANN indexes, and the basic RAG concept. Our focus is on the advanced architecture, code patterns, and performance trade-offs required to build a system that is accurate, fast, and resilient.

    The Baseline: A Simple Vector Search Retriever

    To appreciate the improvements, let's first establish a baseline. A typical naive RAG implementation involves embedding a corpus and a query with a bi-encoder (like sentence-transformers) and using an ANN library (like faiss) for retrieval.

    Setup:

    First, let's prepare our environment and a sample document corpus. This corpus is intentionally designed to highlight the weaknesses of a pure dense-retrieval approach.

    python
    import numpy as np
    import faiss
    from sentence_transformers import SentenceTransformer
    
    # Sample corpus with technical and specific terms
    documents = [
        "The new JX-2024 GPU offers 2x performance for deep learning workloads.",
        "To resolve HTTP error 503, check the upstream service availability.",
        "Our system architecture is based on a microservices pattern with event-driven communication.",
        "The official documentation for the 'Stratus' framework is available online.",
        "Performance benchmarks for the JX-2024 GPU show significant gains in FP16 precision.",
        "A 503 Service Unavailable error indicates a server-side problem.",
        "The 'Stratus' framework facilitates building scalable cloud-native applications.",
        "Troubleshooting guide for the JX-2024: common issues and solutions."
    ]
    
    # 1. Initialize a bi-encoder model
    bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
    
    # 2. Embed the documents
    doc_embeddings = bi_encoder.encode(documents, convert_to_tensor=False)
    doc_embeddings = np.float32(doc_embeddings)
    
    # 3. Build a FAISS index
    index = faiss.IndexFlatL2(doc_embeddings.shape[1])
    index.add(doc_embeddings)
    
    def search_dense(query, k=3):
        """Performs a dense search using the FAISS index."""
        query_embedding = bi_encoder.encode([query], convert_to_tensor=False)
        query_embedding = np.float32(query_embedding)
        distances, indices = index.search(query_embedding, k)
        return [documents[i] for i in indices[0]]
    
    # --- Test Cases ---
    
    # Test Case 1: Semantic Query (Should work well)
    query1 = "how to improve model training speed?"
    results1 = search_dense(query1)
    print(f"Query: '{query1}'")
    # Expected: Documents about the JX-2024 GPU
    for res in results1: print(f"  - {res}")
    # Output is generally relevant.
    
    print("\n" + "-"*50 + "\n")
    
    # Test Case 2: Keyword-specific Query (The failure case)
    query2 = "JX-2024 manual"
    results2 = search_dense(query2)
    print(f"Query: '{query2}'")
    # Expected: Documents specifically about JX-2024.
    for res in results2: print(f"  - {res}")
    # Output might include generic GPU/performance docs, missing the most relevant ones.

    In Test Case 2, the query "JX-2024 manual" might semantically match documents about performance and workloads but could fail to prioritize the document that explicitly mentions "Troubleshooting guide for the JX-2024". The vector for "manual" might not be close enough to the vector for "guide". This is the classic failure mode we need to solve.

    Stage 1: Implementing Hybrid Search with Reciprocal Rank Fusion

    Hybrid search remedies the keyword problem by combining the strengths of two orthogonal search paradigms:

    Dense (Semantic) Search: Finds documents that are conceptually* similar. (Our baseline).

    * Sparse (Lexical) Search: Finds documents that contain the exact query keywords. The classic algorithm here is BM25.

    Our strategy will be to run both retrievers in parallel for a given query and then intelligently fuse their results.

    A. The Sparse Retriever: BM25

    Okapi BM25 is a ranking function that scores documents based on the query terms they contain, factoring in term frequency (TF) and inverse document frequency (IDF), while also normalizing for document length. It's fast, effective, and doesn't require GPUs or complex models.

    For a production system, you'd use a robust search engine like Elasticsearch, OpenSearch, or Vespa. For this demonstration, the rank-bm25 library is perfectly sufficient.

    python
    from rank_bm25 import BM25Okapi
    
    # Tokenize the documents for BM25
    tokenized_corpus = [doc.split(" ") for doc in documents]
    
    # Initialize and index with BM25
    bm25 = BM25Okapi(tokenized_corpus)
    
    def search_sparse(query, k=3):
        """Performs a sparse search using BM25."""
        tokenized_query = query.split(" ")
        # BM25 returns a list of the documents themselves
        top_docs = bm25.get_top_n(tokenized_query, documents, n=k)
        return top_docs
    
    # --- Test Case ---
    query = "JX-2024 manual"
    results = search_sparse(query)
    print(f"Query: '{query}'")
    # Expected: Documents containing 'JX-2024'
    for res in results: print(f"  - {res}")
    # This will correctly return the documents with the exact term 'JX-2024'.

    Now we have two search functions: search_dense and search_sparse. How do we combine their outputs?

    B. The Fusion Algorithm: Reciprocal Rank Fusion (RRF)

    We could try to normalize the scores from FAISS (L2 distance) and BM25 (unbounded score) and combine them, but this is notoriously difficult and brittle. A much more elegant and robust solution is Reciprocal Rank Fusion (RRF).

    RRF is a rank-based fusion method that doesn't care about the absolute scores. It calculates a new RRF score for each document based on its rank in each retriever's result list. The formula is:

    RRF_Score(d) = Σ (1 / (k + rank_i(d)))

    Where:

    * d is a document.

    * rank_i(d) is the rank of document d in the result list from retriever i.

    * k is a constant (typically set to 60) that dampens the influence of high ranks.

    If a document doesn't appear in a result list, its rank is considered infinite, and its contribution to the sum is zero. This makes RRF resilient to retrievers returning no results.

    Let's implement a HybridRetriever.

    python
    import asyncio
    
    class HybridRetriever:
        def __init__(self, dense_retriever_func, sparse_retriever_func, k_dense, k_sparse, rrf_k=60):
            self.search_dense = dense_retriever_func
            self.search_sparse = sparse_retriever_func
            self.k_dense = k_dense
            self.k_sparse = k_sparse
            self.rrf_k = rrf_k
    
        async def retrieve(self, query):
            # Run retrievers in parallel
            dense_task = asyncio.to_thread(self.search_dense, query, self.k_dense)
            sparse_task = asyncio.to_thread(self.search_sparse, query, self.k_sparse)
    
            dense_results = await dense_task
            sparse_results = await sparse_task
    
            return self.fuse_results(dense_results, sparse_results)
    
        def fuse_results(self, dense_results, sparse_results):
            all_docs = list(set(dense_results + sparse_results))
            ranked_lists = [dense_results, sparse_results]
            
            rrf_scores = {doc: 0.0 for doc in all_docs}
    
            for doc in all_docs:
                for ranked_list in ranked_lists:
                    try:
                        rank = ranked_list.index(doc) + 1
                        rrf_scores[doc] += 1 / (self.rrf_k + rank)
                    except ValueError:
                        # Document not in this list, score contribution is 0
                        pass
            
            # Sort documents by their RRF score in descending order
            sorted_docs = sorted(rrf_scores.keys(), key=lambda x: rrf_scores[x], reverse=True)
            return sorted_docs, rrf_scores
    
    # --- Test Case ---
    async def main_hybrid():
        hybrid_retriever = HybridRetriever(
            dense_retriever_func=search_dense,
            sparse_retriever_func=search_sparse,
            k_dense=5,
            k_sparse=5
        )
    
        query = "JX-2024 manual"
        fused_results, _ = await hybrid_retriever.retrieve(query)
        
        print(f"Query: '{query}'")
        print("Hybrid Search Results:")
        for res in fused_results[:5]: # Show top 5
            print(f"  - {res}")
    
    # Run the async main function
    # In a real application, you would have an event loop running.
    # For a script, you can run it like this:
    if __name__ == "__main__":
        # Re-initialize everything for a clean run if this script is executed directly
        bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
        doc_embeddings = bi_encoder.encode(documents, convert_to_tensor=False)
        doc_embeddings = np.float32(doc_embeddings)
        index = faiss.IndexFlatL2(doc_embeddings.shape[1])
        index.add(doc_embeddings)
    
        tokenized_corpus = [doc.split(" ") for doc in documents]
        bm25 = BM25Okapi(tokenized_corpus)
        
        asyncio.run(main_hybrid())

    By running this, you'll see that for the query "JX-2024 manual", the fused results correctly prioritize documents containing the specific term "JX-2024" while also including semantically relevant documents about performance. We've successfully combined the best of both worlds.

    Stage 2: Adding a Re-ranking Layer for Precision

    Hybrid search gives us a high-recall set of candidate documents. However, the final ranking, determined by RRF, is still based on weak signals (the ranks from the initial retrievers). To achieve maximum precision, we can add a final re-ranking stage.

    This stage uses a more powerful but slower model to re-evaluate the relevance of only the top-k candidates from the fusion stage.

    Bi-Encoders vs. Cross-Encoders: The Right Tool for the Job

    This is a critical distinction for any serious RAG practitioner.

    * Bi-Encoders (used in our dense retriever): Create independent vector representations for the query and the document. Relevance is calculated via a fast distance metric (e.g., cosine similarity). This is scalable for searching over millions of documents.

    score = cosine_sim(model(query), model(document))

    * Cross-Encoders: Take both the query and a document as a single input and output a relevance score. They perform full self-attention across both inputs, making them far more accurate at capturing nuanced relevance. However, they are orders of magnitude slower and cannot be used for initial retrieval.

    score = model(query, document)

    Our architecture leverages this perfectly: use fast bi-encoders and BM25 for broad recall, then use a slow but accurate cross-encoder to re-rank the top ~25-50 candidates.

    Let's implement the ReRanker.

    python
    from sentence_transformers import CrossEncoder
    
    # Initialize a cross-encoder model
    # These models are trained specifically for relevance ranking
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    class ReRanker:
        def __init__(self, model):
            self.model = model
    
        def rerank(self, query, documents):
            # Create pairs of [query, document] for the cross-encoder
            pairs = [[query, doc] for doc in documents]
            
            # Predict scores
            scores = self.model.predict(pairs)
            
            # Combine documents with their new scores
            doc_score_pairs = list(zip(documents, scores))
            
            # Sort by score in descending order
            doc_score_pairs_sorted = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
            
            # Return just the documents, now re-ranked
            return [doc for doc, score in doc_score_pairs_sorted]
    
    # --- Test Case ---
    
    # Let's assume these are the top 5 results from our hybrid search
    candidate_docs = [
        "Troubleshooting guide for the JX-2024: common issues and solutions.",
        "The new JX-2024 GPU offers 2x performance for deep learning workloads.",
        "Performance benchmarks for the JX-2024 GPU show significant gains in FP16 precision.",
        "Our system architecture is based on a microservices pattern with event-driven communication.",
        "The 'Stratus' framework facilitates building scalable cloud-native applications."
    ]
    
    query = "JX-2024 manual"
    
    reranker = ReRanker(cross_encoder)
    reranked_results = reranker.rerank(query, candidate_docs)
    
    print(f"Query: '{query}'")
    print("Re-ranked Results:")
    for res in reranked_results:
        print(f"  - {res}")

    The cross-encoder, with its deeper understanding of the relationship between "manual" and "troubleshooting guide", is likely to promote the troubleshooting guide to the absolute top position, providing the most precise context for the final LLM generation step.

    The Complete Production-Grade Pipeline

    Now, let's orchestrate all the pieces into a single, cohesive pipeline. This final class will encapsulate the entire retrieve-fuse-rerank logic.

    python
    # Assuming all previous components (HybridRetriever, ReRanker, models, indexes) are defined and initialized
    
    class FullRAGPipeline:
        def __init__(self, hybrid_retriever, reranker, rerank_top_k=10):
            self.hybrid_retriever = hybrid_retriever
            self.reranker = reranker
            self.rerank_top_k = rerank_top_k
    
        async def execute(self, query):
            print(f"Executing pipeline for query: '{query}'")
            
            # 1. Retrieve & Fuse
            print("\n--- Stage 1: Hybrid Retrieval & Fusion ---")
            fused_docs, _ = await self.hybrid_retriever.retrieve(query)
            print(f"Retrieved {len(fused_docs)} unique candidates.")
            print("Top 5 fused results:")
            for doc in fused_docs[:5]:
                print(f"  - {doc}")
            
            # 2. Re-rank
            print(f"\n--- Stage 2: Re-ranking Top {self.rerank_top_k} ---")
            # Take only the top candidates for the expensive re-ranking step
            candidates_for_rerank = fused_docs[:self.rerank_top_k]
            reranked_docs = self.reranker.rerank(query, candidates_for_rerank)
            print("Re-ranked results:")
            for doc in reranked_docs:
                print(f"  - {doc}")
                
            # 3. Prepare context for LLM
            # The final step is to take the top N re-ranked docs and format them
            final_context = "\n\n".join(reranked_docs[:3]) # e.g., take top 3
            
            print("\n--- Final Context for LLM ---")
            print(final_context)
            
            return final_context
    
    async def main_full_pipeline():
        # Re-initialize everything for a self-contained example
        # In a real app, these would be long-lived objects
        docs = [
            "The new JX-2024 GPU offers 2x performance for deep learning workloads.",
            "To resolve HTTP error 503, check the upstream service availability.",
            "Our system architecture is based on a microservices pattern with event-driven communication.",
            "The official documentation for the 'Stratus' framework is available online.",
            "Performance benchmarks for the JX-2024 GPU show significant gains in FP16 precision.",
            "A 503 Service Unavailable error indicates a server-side problem.",
            "The 'Stratus' framework facilitates building scalable cloud-native applications.",
            "Troubleshooting guide for the JX-2024: common issues and solutions."
        ]
    
        bi_encoder_model = SentenceTransformer('msmarco-MiniLM-L-6-v3')
        doc_embs = bi_encoder_model.encode(docs, convert_to_tensor=False)
        doc_embs = np.float32(doc_embs)
        faiss_index = faiss.IndexFlatL2(doc_embs.shape[1])
        faiss_index.add(doc_embs)
    
        def dense_search_func(q, k): 
            query_emb = bi_encoder_model.encode([q], convert_to_tensor=False)
            _, indices = faiss_index.search(np.float32(query_emb), k)
            return [docs[i] for i in indices[0]]
    
        tok_corpus = [d.split(" ") for d in docs]
        bm25_index = BM25Okapi(tok_corpus)
        def sparse_search_func(q, k): 
            return bm25_index.get_top_n(q.split(" "), docs, n=k)
    
        hybrid_retriever = HybridRetriever(
            dense_retriever_func=dense_search_func,
            sparse_retriever_func=sparse_search_func,
            k_dense=10, k_sparse=10
        )
    
        cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        reranker = ReRanker(cross_encoder_model)
    
        pipeline = FullRAGPipeline(hybrid_retriever, reranker, rerank_top_k=10)
        
        await pipeline.execute("JX-2024 manual")
        print("\n" + "="*80 + "\n")
        await pipeline.execute("how to fix a 503 error?")
    
    # Run the full pipeline demonstration
    if __name__ == "__main__":
        asyncio.run(main_full_pipeline())
    

    This final script demonstrates the power of the multi-stage approach. A query like "how to fix a 503 error?" benefits from both sparse search (matching "503") and dense search (matching "fix" with "resolve"/"troubleshooting"), and the re-ranker ensures the most direct answer is prioritized.

    Advanced Considerations & Production Edge Cases

    Building this pipeline is only half the battle. Operating it in production requires attention to performance, reliability, and maintainability.

    1. Latency Budgeting and Optimization

    Every stage adds latency. A typical p99 latency budget for the entire retrieval process (before the LLM generation) might be 200-500ms.

    * Parallelization is Key: As shown in our HybridRetriever, running the dense and sparse retrievals in parallel using asyncio is non-negotiable. This means your retrieval latency is max(latency_dense, latency_sparse) rather than their sum.

    Re-ranker Bottleneck: The re-ranker is typically the slowest part. Its latency is ~N latency_per_pair, where N is the number of candidates. You must carefully benchmark to find the optimal N (e.g., rerank_top_k). Increasing N from 25 to 50 might improve recall slightly but double your re-ranking latency, often a poor trade-off. Quantizing the cross-encoder model can also yield significant speedups with minimal accuracy loss.

    * ANN Index Tuning: For dense search, using an optimized ANN index like faiss.IndexHNSWFlat instead of IndexFlatL2 is essential for large corpora. You must tune HNSW construction parameters (M and efConstruction) and search parameters (efSearch) to balance speed and recall.

    2. Index Management and Updates

    Your document corpus is rarely static. You need a strategy for updating the dense and sparse indexes without downtime.

    * Batch Updates: For many applications, a periodic (e.g., nightly) batch re-indexing is sufficient. The process involves building a new version of the sparse and dense indexes offline.

    * Blue-Green Deployments: Once the new indexes are built, you can atomically swap them into production. A load balancer or a simple file path symlink can redirect live traffic to the new indexes with zero downtime. This prevents queries from failing or returning stale data during an update.

    * Real-time Indexing: For applications requiring near-instant updates, systems like Elasticsearch or Vespa are designed for this. They use techniques like near-real-time (NRT) search over append-only segment files.

    3. Document Chunking Strategy

    The quality of your retrieval is fundamentally limited by the quality of your document chunks. A 10,000-word document embedded as a single vector is useless. A sentence split into five chunks is also useless.

    * Semantic Chunking: Instead of fixed-size chunks, consider semantic chunking. This involves splitting documents at logical boundaries (paragraphs, sections) or using NLP models to identify coherent semantic units. This ensures that the text within a single chunk (and thus a single vector) has high internal coherence.

    * Chunk Overlap: To avoid losing context at the edges of chunks, use a sliding window approach with a small overlap (e.g., 1-2 sentences). This ensures a single idea isn't split awkwardly between two chunks.

    * Metadata Association: Each chunk must retain metadata linking it back to its parent document, section, and page number. This is critical for providing citations and allowing users to navigate to the source.

    4. Evaluation and Monitoring

    How do you know if a change (e.g., a new embedding model, a different fusion strategy) is actually better? You must have a robust evaluation framework.

    * Retrieval Metrics: Create a golden set of (query, relevant_document_id) pairs. Evaluate your retrieval pipeline against this set using metrics like:

    * Hit Rate @ K: Did the correct document appear in the top K results?

    * Mean Reciprocal Rank (MRR): 1/rank of the first correct document, averaged over all queries. Rewards systems that rank the correct answer higher.

    * Normalized Discounted Cumulative Gain (NDCG @ K): A more sophisticated metric that handles graded relevance (i.e., some documents are more relevant than others) and discounts the value of correct documents ranked lower in the list.

    * End-to-End Evaluation: Ultimately, you care about the quality of the final generated answer. This often requires human evaluation or using a powerful LLM (like GPT-4) as an automated judge to compare the outputs of different pipeline versions.

    Conclusion: RAG as an Engineering Discipline

    Production-grade RAG is a distributed systems and information retrieval problem, not just a modeling problem. Moving from a naive prototype to a robust service requires a shift in mindset. By architecting a multi-stage pipeline that leverages the complementary strengths of sparse and dense retrieval, and refines results with a powerful re-ranker, we can build systems that are not only more accurate but also more reliable and performant.

    The patterns discussed here—hybrid search with RRF, cross-encoder re-ranking, and asynchronous execution—are not just theoretical improvements; they are battle-tested techniques used in large-scale commercial search and question-answering systems. By adopting this engineering-first approach, you can transform your RAG PoC into a system capable of handling the complexity and demands of a real-world production environment.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles