Optimizing RAG with Hybrid Search and Cross-Encoder Re-ranking

29 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Relevance Problem: Beyond Naive Vector Search

In the initial gold rush of building Retrieval-Augmented Generation (RAG) systems, a simple pipeline of embedding a query, performing a vector similarity search, and stuffing the results into a Large Language Model (LLM) context was sufficient for impressive demos. However, for senior engineers tasked with deploying these systems into production, the limitations of this naive approach become apparent almost immediately. The core issue is that pure semantic search, while powerful, is a blunt instrument.

It frequently fails in scenarios requiring lexical precision:

  • Keyword and Acronym Mismatch: Queries containing specific product SKUs (NVIDIA H100), error codes (ERR_CONN_RESET), legal terms, or internal acronyms (Project Chimera) often retrieve semantically related but factually incorrect documents because the exact terms are not given enough weight.
  • Over-generalization: A query like "database performance tuning" might retrieve high-level articles instead of a specific document titled "PostgreSQL index tuning guide," even if the latter is more relevant.
  • Lack of Nuance: Semantic search struggles to differentiate between documents that are about the same topic versus documents that contain the specific answer. The top-k results are often noisy, containing redundant or tangentially related information that pollutes the LLM's context window.
  • These failures lead to hallucinations, incorrect answers, and a general lack of user trust—unacceptable outcomes for any serious application. The solution is not to abandon semantic search, but to augment it within a more sophisticated, multi-stage retrieval architecture. This article details a production-proven, two-stage pattern: Stage 1: Broad Retrieval via Hybrid Search and Stage 2: Fine-grained Re-ranking via Cross-Encoders.

    We will assume you are already familiar with the basics of RAG, vector embeddings, and similarity search. We will dive directly into building and optimizing this advanced pipeline.


    Part 1: Architecting Hybrid Search with Reciprocal Rank Fusion (RRF)

    Hybrid search addresses the core weakness of pure vector search by combining it with a traditional keyword-based search algorithm like BM25. This gives us the best of both worlds: the semantic understanding of dense vectors and the lexical precision of sparse vectors.

  • Dense Retrieval (Vector Search): Excellent for understanding intent, synonyms, and conceptual relationships. It answers the question, "What is the meaning of this query?"
  • Sparse Retrieval (BM25): A battle-tested algorithm based on TF-IDF that excels at matching exact keywords and rare terms. It answers the question, "Which documents contain these exact words?"
  • Simply running both searches and concatenating the results is suboptimal. The score ranges are completely different and uncalibrated. A BM25 score of 20.5 and a cosine similarity of 0.85 cannot be directly compared. The key is in the fusion algorithm. While various techniques exist (like weighted score normalization), Reciprocal Rank Fusion (RRF) has emerged as a simple, effective, and parameter-free method.

    Understanding Reciprocal Rank Fusion (RRF)

    RRF's elegance lies in its simplicity. It disregards the raw scores from different retrieval systems and instead focuses only on the rank of each document in the result lists. The formula for a document d is:

    text
    RRF_score(d) = Σ (1 / (k + rank_i(d)))

    Where:

  • rank_i(d) is the rank of document d in the result list from search system i.
  • k is a small constant (typically set to 60, as per the original paper) that helps diminish the influence of documents with very low ranks.
  • By using rank, RRF naturally normalizes the contributions from disparate systems like BM25 and vector search. Documents that consistently appear at the top of multiple result lists receive a significant boost in their final score.

    Implementation: A Production-Grade Hybrid Retriever

    Let's build a HybridRetriever in Python. For this example, we'll use Elasticsearch for BM25 search and a generic vector store client (the interface could be adapted for Pinecone, Weaviate, Qdrant, etc.).

    First, let's set up our environment and data. We'll use a small corpus of technical documents.

    python
    # requirements.txt
    # pip install elasticsearch sentence-transformers numpy
    
    import json
    import numpy as np
    from elasticsearch import Elasticsearch
    from sentence_transformers import SentenceTransformer
    
    # --- 1. Data and Embedding Setup ---
    
    documents = [
        {"id": "doc1", "text": "The new H100 GPU from NVIDIA provides significant performance gains for large language model training."}, 
        {"id": "doc2", "text": "Optimizing PostgreSQL queries can be achieved through partial indexing, especially in multi-tenant systems."}, 
        {"id": "doc3", "text": "NVIDIA's Triton Inference Server is a solution for deploying models at scale, supporting GPUs like the A100 and H100."}, 
        {"id": "doc4", "text": "A common network error, ERR_CONN_RESET, indicates that the connection was unexpectedly closed by the peer."}, 
        {"id": "doc5", "text": "We are launching Project Chimera next quarter, a new initiative focused on AI-driven data analytics."}, 
        {"id": "doc6", "text": "Advanced GPU computing is essential for deep learning tasks and scientific simulations."}
    ]
    
    # Use a high-quality embedding model
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # --- 2. Setup Clients (Elasticsearch and a mock Vector DB) ---
    
    # Elasticsearch for BM25
    es_client = Elasticsearch("http://localhost:9200")
    INDEX_NAME = "hybrid_search_docs"
    
    if es_client.indices.exists(index=INDEX_NAME):
        es_client.indices.delete(index=INDEX_NAME)
    es_client.indices.create(index=INDEX_NAME)
    
    for doc in documents:
        es_client.index(index=INDEX_NAME, id=doc['id'], document={"text": doc['text']})
    
    # Mock Vector Database for dense search
    # In a real system, this would be Pinecone, FAISS, etc.
    class MockVectorDB:
        def __init__(self, docs, model):
            self.docs = {doc['id']: doc['text'] for doc in docs}
            self.vectors = {doc['id']: model.encode(doc['text']) for doc in docs}
            self.ids = list(self.docs.keys())
            self.matrix = np.array(list(self.vectors.values()))
    
        def search(self, query_text, top_k=5):
            query_vector = embedding_model.encode(query_text)
            # Cosine similarity
            scores = np.dot(self.matrix, query_vector) / (np.linalg.norm(self.matrix, axis=1) * np.linalg.norm(query_vector))
            # Get top_k indices
            top_indices = np.argsort(scores)[-top_k:][::-1]
            return [(self.ids[i], scores[i]) for i in top_indices]
    
    vector_db = MockVectorDB(documents, embedding_model)
    
    # --- 3. The HybridRetriever Implementation ---
    
    class HybridRetriever:
        def __init__(self, es_client, vector_db, es_index, k=60):
            self.es_client = es_client
            self.vector_db = vector_db
            self.es_index = es_index
            self.k = k
    
        def retrieve(self, query, top_k=5):
            # 1. Sparse Search (BM25)
            bm25_results = self.es_client.search(
                index=self.es_index,
                query={"match": {"text": query}},
                size=top_k * 2 # Retrieve more to allow for fusion
            )
            bm25_docs = {hit['_id']: hit['_score'] for hit in bm25_results['hits']['hits']}
    
            # 2. Dense Search (Vector)
            vector_results = self.vector_db.search(query, top_k=top_k * 2)
            vector_docs = {doc_id: score for doc_id, score in vector_results}
    
            # 3. Reciprocal Rank Fusion (RRF)
            fused_scores = self._reciprocal_rank_fusion([bm25_docs, vector_docs])
    
            # 4. Sort and return results
            sorted_docs = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
            
            # Fetch the actual document text for the top_k results
            final_results = []
            for doc_id, score in sorted_docs[:top_k]:
                text = next((doc['text'] for doc in documents if doc['id'] == doc_id), None)
                final_results.append({"id": doc_id, "text": text, "score": score})
                
            return final_results
    
        def _reciprocal_rank_fusion(self, result_sets):
            fused_scores = {}
            # result_sets is a list of dictionaries, like [bm25_docs, vector_docs]
            for doc_scores in result_sets:
                # Create a ranked list from the scores
                ranked_list = sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)
                for rank, (doc_id, _) in enumerate(ranked_list):
                    if doc_id not in fused_scores:
                        fused_scores[doc_id] = 0
                    fused_scores[doc_id] += 1 / (self.k + rank + 1) # rank is 0-indexed
            return fused_scores
    
    # --- 4. Running a Query ---
    retriever = HybridRetriever(es_client, vector_db, INDEX_NAME)
    
    # Query 1: A keyword-specific query
    query1 = "Project Chimera H100"
    
    print(f"--- Running Query: '{query1}' ---")
    results = retriever.retrieve(query1)
    for res in results:
        print(f"ID: {res['id']}, Score: {res['score']:.4f}, Text: {res['text']}")
    
    # Query 2: A semantic query
    query2 = "improving database speed"
    print(f"\n--- Running Query: '{query2}' ---")
    results = retriever.retrieve(query2)
    for res in results:
        print(f"ID: {res['id']}, Score: {res['score']:.4f}, Text: {res['text']}")

    Analysis of the Results:

    For query1 = "Project Chimera H100":

  • BM25 would strongly rank doc5 (Project Chimera) and doc1/doc3 (H100).
  • Vector Search might struggle, as "Project Chimera" has no semantic overlap with "H100". It might rank doc6 (GPU computing) highly.
  • RRF will combine these signals. doc5, doc1, and doc3 will likely appear in the top ranks of the BM25 results. doc1 and doc3 will also rank highly in the vector search. The fusion process will correctly identify these as the most relevant documents, placing them at the top.
  • For query2 = "improving database speed":

  • BM25 might find no exact matches and perform poorly.
  • Vector Search will excel, understanding the semantic relationship between "improving database speed" and "Optimizing PostgreSQL queries". It will rank doc2 very highly.
  • RRF will be dominated by the strong signal from the vector search, correctly promoting doc2 to the top rank.
  • This hybrid approach provides a much more robust retrieval baseline than either method alone, forming a solid foundation for the next stage.


    Part 2: The 'Last Mile' Problem and Cross-Encoder Re-ranking

    Hybrid search gives us a high-quality set of candidate documents. We might retrieve the top 50-100 candidates. However, within this set, the precise ordering might still be suboptimal. This is the "last mile" problem of relevance. We need a more powerful, computationally expensive model to scrutinize this smaller set of candidates and re-rank them with extreme precision.

    This is where cross-encoders shine.

    Bi-Encoders vs. Cross-Encoders: A Critical Distinction

  • Bi-Encoders: These are the standard models used for retrieval (like all-MiniLM-L6-v2). They create independent vector representations (embeddings) for the query and the documents. The comparison (cosine similarity) happens after the encoding. This is computationally efficient, allowing us to pre-compute document embeddings and search through millions of them in milliseconds.
  • Query -> [Bi-Encoder] -> Query Vector

    Document -> [Bi-Encoder] -> Document Vector

    Score = CosineSimilarity(Query Vector, Document Vector)

  • Cross-Encoders: These models take both the query and a document as a single, concatenated input. This allows the model to perform full self-attention across both the query and document tokens simultaneously. The result is a much deeper contextual understanding and a highly accurate relevance score. The downside is extreme computational cost. You cannot pre-compute anything; you must run the model for every query-document pair.
  • (Query, Document) -> [Cross-Encoder] -> Relevance Score (a single float)

    This computational cost makes cross-encoders unsuitable for the initial retrieval step over a large corpus. But they are perfectly suited for re-ranking a small set of promising candidates returned by our hybrid retriever.

    Implementation: Integrating a Cross-Encoder

    We will now extend our HybridRetriever to include a re-ranking step using a model from the sentence-transformers library, such as cross-encoder/ms-marco-MiniLM-L-6-v2, which is trained specifically for relevance ranking.

    python
    # requirements.txt update
    # pip install torch
    # (sentence-transformers already installed)
    
    from sentence_transformers.cross_encoder import CrossEncoder
    
    # --- Add this to the setup ---
    # Load a cross-encoder model
    # This should be done once when the application starts
    cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    # --- 4. The Full Pipeline: HybridRetrieverWithReranker ---
    
    class HybridRetrieverWithReranker:
        def __init__(self, es_client, vector_db, cross_encoder, es_index, k=60):
            self.es_client = es_client
            self.vector_db = vector_db
            self.cross_encoder = cross_encoder
            self.es_index = es_index
            self.k = k
    
        def retrieve(self, query, hybrid_top_k=50, final_top_k=5):
            # --- Stage 1: Hybrid Retrieval (same as before) ---
            # 1. Sparse Search (BM25)
            bm25_results = self.es_client.search(
                index=self.es_index,
                query={"match": {"text": query}},
                size=hybrid_top_k
            )
            bm25_docs = {hit['_id']: hit['_score'] for hit in bm25_results['hits']['hits']}
    
            # 2. Dense Search (Vector)
            vector_results = self.vector_db.search(query, top_k=hybrid_top_k)
            vector_docs = {doc_id: score for doc_id, score in vector_results}
    
            # 3. Reciprocal Rank Fusion (RRF)
            fused_scores = self._reciprocal_rank_fusion([bm25_docs, vector_docs])
            
            # Get the top candidate documents from fusion
            sorted_fused_docs = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
            candidate_ids = [doc_id for doc_id, _ in sorted_fused_docs[:hybrid_top_k]]
            
            if not candidate_ids:
                return []
    
            # --- Stage 2: Cross-Encoder Re-ranking ---
            # Prepare pairs for the cross-encoder: [ (query, doc_text), ... ]
            candidate_docs_map = {doc['id']: doc['text'] for doc in documents}
            pairs = [(query, candidate_docs_map[doc_id]) for doc_id in candidate_ids]
            
            # Predict scores. This is the computationally expensive step.
            ce_scores = self.cross_encoder.predict(pairs)
            
            # Combine IDs with their new scores
            reranked_results = list(zip(candidate_ids, ce_scores))
            
            # Sort by the new cross-encoder score
            reranked_results.sort(key=lambda x: x[1], reverse=True)
            
            # --- Format final output ---
            final_results = []
            for doc_id, score in reranked_results[:final_top_k]:
                final_results.append({
                    "id": doc_id,
                    "text": candidate_docs_map[doc_id],
                    "score": float(score) # Ensure score is a standard float
                })
            
            return final_results
    
        def _reciprocal_rank_fusion(self, result_sets):
            # (Same implementation as before)
            fused_scores = {}
            for doc_scores in result_sets:
                ranked_list = sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)
                for rank, (doc_id, _) in enumerate(ranked_list):
                    if doc_id not in fused_scores:
                        fused_scores[doc_id] = 0
                    fused_scores[doc_id] += 1 / (self.k + rank + 1)
            return fused_scores
    
    # --- 5. Running the Full Pipeline ---
    
    full_retriever = HybridRetrieverWithReranker(es_client, vector_db, cross_encoder_model, INDEX_NAME)
    
    query3 = "What is the consequence of a connection reset error?"
    
    print(f"--- Running Full Pipeline Query: '{query3}' ---")
    final_results = full_retriever.retrieve(query3)
    
    for res in final_results:
        print(f"ID: {res['id']}, Score: {res['score']:.4f}, Text: {res['text']}")

    With query3, the hybrid search might return doc4 (ERR_CONN_RESET) and doc6 (GPU computing, due to some semantic overlap with 'connection'). The cross-encoder, however, will analyze the pairs (query, doc4_text) and (query, doc6_text). It will assign a very high score to the doc4 pair due to the strong contextual alignment between "connection reset error" and the document text, while assigning a very low score to the doc6 pair, correctly identifying it as irrelevant. This precision is what justifies the computational cost.


    Part 3: Production Architecture and Performance Optimization

    Deploying this pipeline requires careful consideration of its performance characteristics, especially the latency introduced by the cross-encoder.

    System Architecture Diagram:

    text
    +-------------+
    | User Query  |
    +------+------+
           |
           v
    +------v---------------------------------------------------------------------+
    | Retrieval Service                                                         |
    |      |                                                                    |
    |      +-------------------------> +-------------------+                      |
    |      |                           | BM25 Search       | --> [Top 50 Docs] --+
    |      |                           | (Elasticsearch)   |                     |   +-----------------+
    |      |                           +-------------------+                     +-->| RRF Fusion      |--> [Top 50 Candidates]
    |      +-------------------------> +-------------------+                     |   +-----------------+
    |                                  | Vector Search     | --> [Top 50 Docs] --+           |
    |                                  | (Pinecone/FAISS)  |                                 |
    |                                  +-------------------+                                 v
    |                                                                          +-----------------+
    |                                                                          | Cross-Encoder   |--> [Top 5 Ranked Docs]
    |                                                                          | Re-ranking      |           |
    |                                                                          | (GPU-accelerated) |           |
    |                                                                          +-----------------+
    |                                                                                      |
    +--------------------------------------------------------------------------------------v-+
                                                                                           |
                                                                                           v
                                                                                   +-------+-------+
                                                                                   | LLM           |
                                                                                   | (Adds context)| 
                                                                                   +-------+-------+
                                                                                           |
                                                                                           v
                                                                                   +-------+-------+
                                                                                   | Final Answer  |
                                                                                   +---------------+

    Latency Breakdown and Bottlenecks

    A typical request flow might have the following latencies:

  • BM25 Search (p95): 20-50ms
  • Vector Search (p95): 30-100ms (highly dependent on index size and hardware)
  • RRF Fusion: <1ms
  • Cross-Encoder Re-ranking (50 candidates):
  • - On CPU: 200-800ms (The major bottleneck)

    - On GPU (e.g., T4/L4): 40-100ms

    Clearly, the cross-encoder is the primary performance concern. Here are several production strategies to mitigate this:

  • GPU Acceleration: This is the most effective solution. Deploying the re-ranking model on a service with GPU access (like AWS SageMaker, Google Vertex AI, or a custom Kubernetes cluster with GPU nodes) is essential for interactive applications. The parallel processing power of a GPU is perfectly suited for scoring the candidate pairs.
  • Model Quantization: Quantization reduces the precision of the model's weights (e.g., from FP32 to INT8). This leads to a smaller model size, lower memory usage, and significantly faster inference, often with a negligible impact on accuracy. Libraries like Optimum (from Hugging Face) can be used to apply ONNX Runtime quantization.
  • python
        # Conceptual example of applying ONNX quantization
        from optimum.onnxruntime import ORTQuantizer
        from optimum.onnxruntime.configuration import AutoQuantizationConfig
    
        # 1. Export model to ONNX format
        # ... (code to export the cross-encoder to ONNX)
    
        # 2. Quantize the model
        onnx_model_path = "./model.onnx"
        quantized_model_path = "./model_quantized.onnx"
        quantizer = ORTQuantizer.from_pretrained(onnx_model_path)
        dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
        quantizer.quantize(save_dir=quantized_model_path, quantization_config=dqconfig)
  • Dynamic Candidate Sizing: Don't always re-rank 50 candidates. If the RRF scores show a very steep drop-off after the top 10 results, it's a strong signal that the remaining candidates are likely irrelevant. You can dynamically adjust the number of documents sent to the re-ranker based on the score distribution, saving compute on easy queries.
  • Asynchronous Execution & Streaming: For applications that can tolerate slightly higher perceived latency, consider a streaming approach. Immediately return the top result from the faster hybrid search. Then, run the re-ranking asynchronously and update the user's view or stream back the more refined results once they are ready. This improves the time-to-first-token for the user.

  • Part 4: Advanced Edge Cases and Failure Modes

    A robust production system must gracefully handle edge cases.

  • Contradictory Signals: What if BM25 returns a set of documents {A, B, C} and vector search returns a completely different set {X, Y, Z}? RRF handles this gracefully. Since no document appears in both lists, the final ranking will be an interleaving of the two lists based on their original ranks, which is a reasonable fallback.
  • Out-of-Domain Queries: If a user asks a question completely unrelated to the document corpus, the system should not hallucinate. In our pipeline, this failure mode is more graceful. Both BM25 and vector search will return low scores. The cross-encoder, when presented with irrelevant pairs, will also produce very low relevance scores. You can set a final threshold on the cross-encoder's output score. If the top-ranked document is below this threshold (e.g., < 0.1), you can return a canned response like "I could not find a relevant answer in my knowledge base" instead of feeding garbage to the LLM.
  • Re-ranker Bias: The pre-trained ms-marco cross-encoder is excellent for general-purpose web and question-answering documents. However, if your corpus is highly specialized (e.g., legal contracts, scientific papers, source code), the re-ranker's notion of relevance may not align with your domain. In this scenario, the ultimate optimization is to fine-tune the cross-encoder model. This involves creating a small, high-quality dataset of (query, relevant_passage, irrelevant_passage) triplets from your own domain and training the model for a few epochs. This can lead to a dramatic increase in domain-specific relevance.
  • Conclusion: From Demo to Production-Ready RAG

    Moving a RAG system from a proof-of-concept to a reliable production service requires a fundamental shift from a single-stage, naive retrieval process to a sophisticated, multi-stage pipeline. By combining the lexical precision of BM25 with the semantic power of dense vectors through Reciprocal Rank Fusion, we create a robust candidate generation system. By then applying a computationally intensive but highly accurate cross-encoder for re-ranking, we solve the critical 'last mile' problem, ensuring that the documents passed to the LLM are of the highest possible relevance.

    This two-stage architecture, coupled with performance optimizations like GPU acceleration and quantization, represents a mature, production-grade pattern. It directly addresses the common failure modes of basic RAG, resulting in fewer hallucinations, more accurate answers, and a system that engineers can confidently deploy and users can genuinely trust.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles