Optimizing RAG: Hybrid Search and Re-ranking for Low-Latency LLMs

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Illusion of Simplicity in Production RAG

For senior engineers tasked with building LLM-powered features, the initial promise of Retrieval-Augmented Generation (RAG) seems deceptively simple: embed a corpus of documents, perform a vector similarity search on a user query, stuff the results into a prompt, and let the LLM synthesize an answer. This works for a proof-of-concept. It fails catastrophically in production.

Why? Because naive, single-stage vector search is a blunt instrument. It's susceptible to several critical failure modes:

  • Keyword Miss: Dense vectors excel at capturing semantic meaning but can fail on specific keywords, acronyms, or identifiers (e.g., product SKU X-48-B2, error code 0x80070005). A query for a specific term might not retrieve a document that contains it if the overall semantic context is weak.
  • The "Lost in the Middle" Problem: LLMs pay disproportionate attention to information at the beginning and end of their context window. When you naively inject 5-10 retrieved documents, the most relevant one might be buried in the middle, effectively ignored by the model.
  • Semantic Ambiguity: Queries like "What is the performance of our system?" are ambiguous. "Performance" could refer to latency, throughput, error rate, or financial results. A simple vector search might retrieve a mix of all, creating a noisy, unfocused context.
  • Latency Penalty: The bi-encoder models used for retrieval embeddings are fast, but the subsequent LLM call is not. Every irrelevant token added to the context window increases inference cost and latency. A noisy context not only produces worse answers but also slower and more expensive ones.
  • Production-grade RAG is not a single API call; it's a multi-stage data processing funnel designed to maximize precision at the top of the retrieved results (Precision@k where k is small) before ever invoking the expensive LLM. This article details the architecture and implementation of a two-stage pipeline that addresses these failures: Hybrid Retrieval followed by Cross-Encoder Re-ranking.


    Stage 1: Hybrid Retrieval with Reciprocal Rank Fusion (RRF)

    The first stage of our funnel is retrieval. Our goal here is high recall—we want to cast a wide net to ensure the relevant documents are captured, even if it means pulling in some noise. We achieve this by combining the strengths of two fundamentally different search paradigms: sparse and dense retrieval.

    * Sparse Retrieval (e.g., BM25): The workhorse of traditional search engines. It operates on inverted indexes and term frequencies (TF-IDF). It is exceptionally good at matching keywords, codes, and specific jargon. Its weakness is a lack of semantic understanding.

    * Dense Retrieval (Vector Search): Uses embedding models (bi-encoders) to map text to high-dimensional vectors. It excels at understanding semantic similarity, synonyms, and paraphrasing. Its weakness is the keyword miss problem mentioned earlier.

    By combining them, we get the best of both worlds. The challenge lies in merging their disparate result sets and scoring systems.

    Fusing the Results: Reciprocal Rank Fusion (RRF)

    BM25 produces a relevance score, while vector search produces a distance/similarity score. These are not directly comparable. Instead of trying to normalize them, we can use a rank-based fusion method. Reciprocal Rank Fusion (RRF) is a simple, effective, and zero-tuning-required algorithm.

    The formula for the RRF score of a document d is:

    RRF_score(d) = Σ (1 / (k + rank_i(d)))

    Where:

    * rank_i(d) is the rank of document d in the i-th result set (e.g., the BM25 results or the vector search results).

    * k is a constant to mitigate the effect of high ranks (a common value is k=60).

    Let's implement this. We'll assume we have two sets of results, one from Elasticsearch (BM25) and one from FAISS (a popular vector index library).

    Code Example: Implementing Hybrid Search and RRF

    First, let's set up our search clients. This assumes you have documents indexed in both Elasticsearch and a FAISS index with corresponding metadata.

    python
    import faiss
    import numpy as np
    from elasticsearch import Elasticsearch
    from sentence_transformers import SentenceTransformer
    
    # --- Assume these are pre-populated ---
    # Elasticsearch client
    es_client = Elasticsearch("http://localhost:9200")
    # Sentence Transformer model for embedding queries
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    # FAISS index loaded from disk
    faiss_index = faiss.read_index("my_faiss_index.bin")
    # A mapping from FAISS index position to our document ID
    # e.g., {0: 'doc_abc', 1: 'doc_xyz', ...}
    index_to_doc_id = { ... }
    
    # --- Implementation ---
    
    def search_bm25(query_text: str, k: int = 50) -> list[tuple[str, float]]:
        """Performs a BM25 search and returns (doc_id, score) tuples."""
        response = es_client.search(
            index="my_documents",
            body={
                "size": k,
                "query": {
                    "match": {
                        "content": query_text
                    }
                }
            }
        )
        return [(hit['_id'], hit['_score']) for hit in response['hits']['hits']]
    
    def search_vector(query_text: str, k: int = 50) -> list[tuple[str, float]]:
        """Performs a vector search and returns (doc_id, score) tuples."""
        query_vector = embedding_model.encode([query_text])
        # FAISS returns distances (lower is better), so we'll convert to similarity later
        distances, indices = faiss_index.search(query_vector.astype('float32'), k)
        
        results = []
        for i, dist in zip(indices[0], distances[0]):
            if i != -1: # FAISS can return -1 for no result
                doc_id = index_to_doc_id[i]
                # Convert distance to a similarity score (0-1), 1 / (1 + dist)
                similarity = 1.0 / (1.0 + dist)
                results.append((doc_id, similarity))
        return results
    
    def reciprocal_rank_fusion(results_lists: list[list[tuple[str, float]]], k: int = 60) -> dict[str, float]:
        """Performs RRF on a list of search result lists."""
        fused_scores = {}
        
        # Create a map from doc_id to its rank in each list
        doc_ranks = {}
        for results in results_lists:
            for rank, (doc_id, _) in enumerate(results):
                if doc_id not in doc_ranks:
                    doc_ranks[doc_id] = []
                doc_ranks[doc_id].append(rank + 1) # rank is 1-based
                
        # Calculate RRF score for each document
        for doc_id, ranks in doc_ranks.items():
            rrf_score = 0.0
            for rank in ranks:
                rrf_score += 1.0 / (k + rank)
            fused_scores[doc_id] = rrf_score
            
        return fused_scores
    
    # --- Putting it together ---
    def hybrid_search(query: str, top_k_retrieval: int = 100):
        bm25_results = search_bm25(query, k=top_k_retrieval)
        vector_results = search_vector(query, k=top_k_retrieval)
    
        # RRF requires ranked lists, so we don't need the scores here, just the order
        bm25_ranked_list = [(doc_id, score) for doc_id, score in bm25_results]
        vector_ranked_list = [(doc_id, score) for doc_id, score in vector_results]
    
        fused_scores = reciprocal_rank_fusion([bm25_ranked_list, vector_ranked_list])
    
        # Sort by RRF score in descending order
        sorted_fused_results = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
        
        return sorted_fused_results
    
    # --- Example Usage ---
    query = "What is the p99 latency for the user-service API?"
    retrieved_docs = hybrid_search(query)
    print(f"Retrieved {len(retrieved_docs)} documents after fusion.")
    print("Top 5 results:", retrieved_docs[:5])

    In this stage, we might retrieve a large number of documents (e.g., k=100) to ensure our recall is high. The output of this stage is a single, relevance-ranked list of document IDs. This list is far superior to a single-modality search, but it's still too noisy and too large to pass directly to an LLM.


    Stage 2: Precision Enhancement with Cross-Encoder Re-ranking

    The retrieval stage used bi-encoders, which create embeddings for the query and documents independently. They are incredibly fast, allowing us to search over millions of documents in milliseconds. However, this independence is also their weakness; the model never sees the query and document together.

    Cross-encoders solve this. They take both the query and a potential document as a single input and output a score (e.g., 0 to 1) indicating their relevance. This allows for deep, token-level attention between the query and the document, making them significantly more accurate than bi-encoders.

    The catch? They are orders of magnitude slower. Running a cross-encoder on your entire corpus is computationally infeasible. This is why our funnel architecture is so critical: we use the fast, high-recall hybrid search to find a few dozen candidates, and then use the slow, high-precision cross-encoder to re-rank only this small set.

    Implementation with sentence-transformers

    The sentence-transformers library provides pre-trained cross-encoder models perfect for this task.

    python
    from sentence_transformers.cross_encoder import CrossEncoder
    
    # Load a lightweight, pre-trained cross-encoder
    # These models are trained on tasks like question-answering and are ideal for re-ranking
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    # Assume 'retrieved_docs' is the output from our hybrid_search function
    # and we have a function to get the actual text content of a document
    def get_document_content(doc_id: str) -> str:
        # In a real system, this would fetch from a database or document store
        # For this example, we'll use a dummy dictionary
        dummy_db = {
            'doc1': 'The user-service API has a p99 latency of 250ms.',
            'doc2': 'System performance metrics are tracked in Grafana.',
            'doc3': 'API latency is a key performance indicator.',
            'doc4': 'The billing-service API latency is currently 150ms.'
            # ... more docs
        }
        return dummy_db.get(doc_id, "")
    
    def rerank_documents(query: str, doc_ids: list[str], top_n: int = 5) -> list[tuple[str, float]]:
        """Re-ranks a list of document IDs using a cross-encoder."""
        # Create pairs of [query, document_content] for the cross-encoder
        query_doc_pairs = [(query, get_document_content(doc_id)) for doc_id in doc_ids]
        
        if not query_doc_pairs:
            return []
            
        # Predict scores. The model will output a single relevance score for each pair.
        scores = cross_encoder.predict(query_doc_pairs)
        
        # Combine doc_ids with their new scores
        reranked_results = list(zip(doc_ids, scores))
        
        # Sort by the new score in descending order
        reranked_results.sort(key=lambda x: x[1], reverse=True)
        
        return reranked_results[:top_n]
    
    # --- Example Usage with the previous stage's output ---
    # Let's say hybrid_search returned a list of (doc_id, rrf_score) tuples
    retrieved_doc_ids = [doc_id for doc_id, score in retrieved_docs]
    
    # We only re-rank the top, say, 50 candidates from the retrieval stage
    rerank_candidates = retrieved_doc_ids[:50]
    
    final_docs_for_llm = rerank_documents(query, rerank_candidates, top_n=5)
    
    print("Final documents to be passed to LLM context:", final_docs_for_llm)
    # Expected output might be: [('doc1', 0.98), ('doc3', 0.75), ('doc2', 0.21), ...]
    # Notice how the most direct answer ('doc1') is pushed to the top.

    Now, instead of a context cluttered with 10-20 potentially relevant documents, we have a highly-focused, precision-ordered list of 3-5 documents. The most relevant document is almost guaranteed to be at the top, mitigating the "lost in the middle" problem and providing the LLM with a clean, signal-rich context.


    End-to-End Production Pipeline and API

    Let's assemble this into a cohesive service. We can use a simple web framework like FastAPI to expose this advanced RAG pipeline as an API endpoint.

    python
    from fastapi import FastAPI
    from pydantic import BaseModel
    import openai
    
    # Assume all previous functions (search_bm25, search_vector, hybrid_search, rerank_documents, etc.) are defined here.
    # Also assume openai.api_key is configured.
    
    app = FastAPI()
    
    class RAGQuery(BaseModel):
        query: str
        user_id: str # For logging, personalization, etc.
    
    class RAGResponse(BaseModel):
        answer: str
        retrieved_doc_ids: list[str]
    
    def build_llm_prompt(query: str, context_docs: list[str]) -> str:
        """Builds the final prompt for the LLM."""
        context = "\n\n---\n\n".join(context_docs)
        prompt = f"""
        You are a helpful AI assistant. Answer the user's question based on the following context.
        If the context does not contain the answer, state that you don't have enough information.
    
        Context:
        {context}
    
        Question: {query}
    
        Answer:
        """
        return prompt
    
    @app.post("/query", response_model=RAGResponse)
    async def execute_rag_pipeline(rag_query: RAGQuery):
        # STAGE 1: Hybrid Retrieval
        # We retrieve a larger set of candidates (e.g., 50-100)
        retrieved_results = hybrid_search(rag_query.query, top_k_retrieval=75)
        retrieved_doc_ids = [doc_id for doc_id, score in retrieved_results]
    
        # STAGE 2: Cross-Encoder Re-ranking
        # We re-rank only the top candidates to keep latency down
        rerank_candidates = retrieved_doc_ids[:25]
        # We only need the top 3-5 documents for the final context
        final_reranked_docs = rerank_documents(rag_query.query, rerank_candidates, top_n=3)
        final_doc_ids = [doc_id for doc_id, score in final_reranked_docs]
    
        # STAGE 3: Context Generation and LLM Invocation
        context_contents = [get_document_content(doc_id) for doc_id in final_doc_ids]
        
        prompt = build_llm_prompt(rag_query.query, context_contents)
    
        # In a real app, you'd use async OpenAI client
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are an expert Q&A system."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1
        )
    
        answer = response.choices[0].message.content
    
        return RAGResponse(answer=answer, retrieved_doc_ids=final_doc_ids)
    

    This API endpoint encapsulates the entire sophisticated pipeline. It clearly shows the funnel: 75 candidates -> 25 to re-rank -> 3 for LLM. These numbers are hyperparameters you must tune based on your specific latency requirements and document characteristics.


    Advanced Performance Considerations and Edge Cases

    Deploying this system requires more than just the Python code. Here are critical production considerations:

  • Re-ranker Latency: The cross-encoder is the new bottleneck. To optimize it:
  • * GPU Inference: Deploy the re-ranking model on a GPU-enabled inference server like NVIDIA Triton Inference Server. This allows for batching requests and hardware acceleration.

    * Model Quantization/Pruning: Use tools like ONNX Runtime or TensorRT to convert the PyTorch model to a quantized, optimized format. This can reduce latency by 2-4x with a minimal drop in accuracy.

    * Smaller Models: Experiment with smaller cross-encoder models. A MiniLM might be sufficient and significantly faster than a BERT-base model.

  • Document Chunking Strategy: How you split long documents into indexable chunks is paramount. Naive fixed-size chunking can cut sentences in half, destroying context. Consider semantic chunking or sentence-aware chunking strategies. An advanced technique is sentence-window retrieval, where you embed each sentence but, upon retrieval, fetch the sentence along with the N sentences before and after it to provide a richer context.
  • Caching: Implement a multi-layer caching strategy. A Redis cache can store:
  • * Retrieval Results: Cache the output of the hybrid search for common queries.

    * Re-ranked Results: Cache the final, re-ranked list of document IDs.

    * LLM Responses: Cache the final generated answer for identical queries.

    The cache key should be a hash of the query and potentially other parameters (e.g., user permissions).

  • Asynchronous Execution: The two retrieval steps (BM25 and vector) are independent and can be executed in parallel using asyncio.gather to reduce I/O wait time.
  • Benchmarking and Evaluation: You cannot optimize what you cannot measure. Establish a robust evaluation set (a "golden set" of queries and their ideal document sets). Track metrics like:
  • * Mean Reciprocal Rank (MRR): Measures the rank of the first correct answer.

    * Normalized Discounted Cumulative Gain (nDCG@k): Evaluates the quality of the ranking for the top k documents.

    * End-to-End Latency (p95, p99): The most important user-facing metric.

    A Comparative Benchmark (Illustrative)

    ArchitectureMRR@10 (Retrieval Quality)p95 End-to-End LatencyLLM Cost (Relative)
    Naive Vector Search (10 docs)0.651.8s1.5x
    Hybrid Search (10 docs)0.782.0s1.5x
    Hybrid + Re-ranker (3 docs)0.921.5s1.0x (baseline)

    This illustrative data shows our advanced pipeline is not just more accurate (MRR jumps from 0.65 to 0.92) but can also be faster and cheaper. The latency reduction comes from the smaller, cleaner context passed to the LLM, which more than compensates for the added re-ranking step (if optimized correctly). The cost reduction is a direct result of using fewer tokens in the LLM prompt.

    Conclusion

    Moving a RAG system from a demo to a robust, production service requires a paradigm shift from single-shot retrieval to a multi-stage refinement pipeline. By combining the keyword-matching strength of sparse search with the semantic power of dense search, we create a high-recall initial candidate set. Then, by applying a computationally intensive but highly accurate cross-encoder re-ranker, we distill this set into a small, precision-focused context for the LLM.

    This architecture directly addresses the core weaknesses of naive RAG, leading to more relevant answers, lower latency, and reduced operational costs. It is a complex system with more moving parts, but for applications where accuracy and performance are non-negotiable, this level of engineering rigor is not just beneficial—it's essential.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles