Production RAG: Hybrid Search & Cross-Encoder Re-ranking on Kubernetes

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Failure of Naive RAG in Production

For any senior engineer tasked with deploying a Retrieval-Augmented Generation (RAG) system, the initial proof-of-concept is deceptively simple: embed documents, store them in a vector database, and perform a similarity search to find context for a Large Language Model (LLM). This works well for straightforward semantic queries. However, this naive approach collapses under the weight of real-world query diversity and performance requirements.

The primary failure mode is the inherent limitation of dense vector retrieval. Embedding models, while powerful, can struggle with queries dominated by specific keywords, acronyms, or identifiers. A query like "How to fix error Kube-Err-503 in v1.28.2?" may fail because the semantic embedding of Kube-Err-503 is not distinct enough from other errors, while a traditional sparse retrieval method like BM25 would excel.

Conversely, a query like "Why is my pod frequently restarting without logging any errors?" is semantically rich and a perfect use case for vector search, as it requires understanding concepts like "restarting" and "crashing," which a keyword search would miss.

Relying solely on one method forces a compromise. Increasing the top_k in a vector search to compensate for missed keywords floods the context window with irrelevant documents, increasing LLM inference cost and introducing noise that degrades the quality of the final generation. This is not a production-ready solution. A robust RAG pipeline must maximize recall at the retrieval stage and then aggressively optimize for precision before handing off to the LLM. This requires a multi-stage architecture.

This article details the design and implementation of such an architecture, focusing on three core components:

  • Hybrid Search: Combining sparse (BM25) and dense (vector) retrieval in parallel to capture both lexical and semantic relevance.
  • Result Fusion: Implementing Reciprocal Rank Fusion (RRF) to intelligently merge disparate result sets without needing to normalize incomparable scores.
  • Cross-Encoder Re-ranking: Deploying a dedicated microservice with a more powerful (but slower) cross-encoder model to re-rank the fused candidates for maximum precision.
  • We will deploy this entire system on Kubernetes, architecting it for scalability, observability, and independent resource management.


    Architecting the Hybrid Search Retriever

    The goal of the initial retrieval stage is to cast a wide net, ensuring the ground-truth document is highly likely to be in our candidate set. We achieve this by running two specialized retrievers in parallel and fusing their results.

    * Sparse Retriever: We'll use Elasticsearch with a standard BM25 index. This is optimized for lexical matches—keywords, product IDs, error codes, and specific jargon.

    * Dense Retriever: A vector database like Weaviate, Milvus, or Pinecone, populated with embeddings from a model like all-mpnet-base-v2 or a fine-tuned equivalent. This is optimized for semantic or conceptual matches.

    The Fusion Problem: Why Score Normalization Fails

    A common but flawed approach to merging results is to normalize the scores from both systems (e.g., to a 0-1 range) and then interleave them. This is problematic because BM25 scores and cosine similarity scores are not directly comparable. A BM25 score of 20.5 has no inherent relationship to a cosine similarity of 0.89. Normalization is sensitive to the distribution of scores in a given result set, making it unstable. A single outlier can skew the entire ranking.

    Solution: Reciprocal Rank Fusion (RRF)

    RRF provides a simple, effective, and score-agnostic way to combine ranked lists. It disregards the absolute scores and focuses solely on the rank of each document in its respective list. The RRF score for a document is calculated by summing the reciprocal of its rank across all lists.

    Formula: RRF_Score(d) = Σ(1 / (k + rank_i(d)))

    Where:

    * d is the document.

    * rank_i(d) is the rank of document d in result list i.

    * k is a constant (typically 60) that dampens the influence of high ranks, preventing a single top result from dominating.

    Let's implement a HybridRetriever in Python that queries Elasticsearch and a vector database (using the Weaviate client as an example) in parallel using asyncio and then applies RRF.

    python
    import asyncio
    import weaviate
    from elasticsearch import AsyncElasticsearch
    from collections import defaultdict
    
    # --- Configuration ---
    WEAVIATE_URL = "http://weaviate:8080"
    ES_URL = "http://elasticsearch:9200"
    WEAVIATE_CLASS_NAME = "Document"
    ES_INDEX_NAME = "documents"
    RRF_K = 60
    
    class HybridRetriever:
        def __init__(self):
            self.weaviate_client = weaviate.Client(WEAVIATE_URL)
            self.es_client = AsyncElasticsearch([ES_URL])
    
        async def _dense_search(self, query_text: str, top_k: int):
            # In a real app, you'd get the vector from a dedicated embedding service
            # For this example, we assume Weaviate has a text2vec module
            near_text = {"concepts": [query_text]}
            
            response = (
                self.weaviate_client.query
                .get(WEAVIATE_CLASS_NAME, ["doc_id", "content"])
                .with_near_text(near_text)
                .with_limit(top_k)
                .do()
            )
            
            results = response['data']['Get'][WEAVIATE_CLASS_NAME]
            return [{ "id": item['doc_id'], "content": item['content'] } for item in results]
    
        async def _sparse_search(self, query_text: str, top_k: int):
            response = await self.es_client.search(
                index=ES_INDEX_NAME,
                body={
                    "size": top_k,
                    "query": {
                        "match": {
                            "content": query_text
                        }
                    }
                }
            )
            
            results = response['hits']['hits']
            return [{ "id": hit['_source']['doc_id'], "content": hit['_source']['content'] } for hit in results]
    
        def _reciprocal_rank_fusion(self, ranked_lists: list[list[dict]], k: int = 60) -> list[dict]:
            rrf_scores = defaultdict(float)
            doc_content_map = {}
    
            for ranked_list in ranked_lists:
                for i, doc in enumerate(ranked_list):
                    doc_id = doc['id']
                    if doc_id not in doc_content_map:
                        doc_content_map[doc_id] = doc['content']
                    
                    rank = i + 1
                    rrf_scores[doc_id] += 1 / (k + rank)
    
            sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
            
            fused_results = []
            for doc_id, score in sorted_docs:
                fused_results.append({
                    "id": doc_id,
                    "content": doc_content_map[doc_id],
                    "rrf_score": score
                })
            
            return fused_results
    
        async def search(self, query_text: str, top_k: int = 50) -> list[dict]:
            """Performs parallel dense and sparse search, then fuses the results."""
            dense_task = self._dense_search(query_text, top_k)
            sparse_task = self._sparse_search(query_text, top_k)
    
            dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
    
            fused_results = self._reciprocal_rank_fusion([dense_results, sparse_results], k=RRF_K)
            
            return fused_results
    
    # --- Example Usage ---
    async def main():
        retriever = HybridRetriever()
        query = "Kubernetes pod crashlooping without logs"
        
        # We retrieve a larger set of candidates (e.g., 50) for the re-ranker to process
        candidates = await retriever.search(query, top_k=50)
        
        print(f"Found {len(candidates)} candidates after RRF fusion.")
        for i, doc in enumerate(candidates[:5]):
            print(f"{i+1}. [ID: {doc['id']}] Score: {doc['rrf_score']:.4f} | Content: {doc['content'][:100]}...")
    
    if __name__ == "__main__":
        # Note: Requires running Elasticsearch and Weaviate instances
        # and populating them with documents having 'doc_id' and 'content' fields.
        # asyncio.run(main())
        pass
    

    This implementation fetches up to 100 documents (50 from each retriever) and fuses them into a single, relevance-ranked list. This list now has high recall but may contain noise. The next step is to prune this list with a high-precision re-ranking model.


    The Re-ranking Stage: From Recall to Precision

    Hybrid search gives us a rich set of candidate documents. Now we need to re-order them with much higher accuracy. The models used for retrieval (bi-encoders) are optimized for speed over a massive corpus. They encode the query and documents into vectors independently. For re-ranking, we can afford a more computationally expensive but far more accurate model: a cross-encoder.

    A cross-encoder takes both the query and a candidate document as a single input and outputs a score representing their relevance. This joint processing allows the model to perform deep attention across the query and document tokens, capturing nuances that bi-encoders miss.

    We'll use a model from the sentence-transformers library, such as ms-marco-MiniLM-L-6-v2, which is specifically trained for re-ranking tasks.

    Architectural Decision: The Re-ranker as a Microservice

    A critical production decision is where to run the cross-encoder. Do not run it in the same service as the retrieval logic. Here's why:

  • Different Scaling Needs: Retrieval is I/O-bound (network calls to databases). Re-ranking is CPU/GPU-bound (model inference). A spike in traffic might require scaling up retrievers, but not re-rankers, or vice-versa. Co-locating them forces you to over-provision one resource to satisfy the other.
  • Specialized Hardware: Cross-encoders benefit immensely from GPUs. By deploying the re-ranker as a separate microservice, you can schedule its pods onto dedicated GPU nodes in your Kubernetes cluster, while the orchestrator and retrieval logic run on cheaper CPU nodes.
  • Resilience and Independent Deployment: You can update, canary release, or roll back the re-ranking model and service independently of the main RAG pipeline.
  • Here is a simple FastAPI implementation for our re-ranker microservice:

    python
    # reranker_service/main.py
    from fastapi import FastAPI
    from pydantic import BaseModel
    from sentence_transformers.cross_encoder import CrossEncoder
    import torch
    
    # --- Model Loading ---
    # This happens once on startup.
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
    print(f"Loading CrossEncoder model on device: {DEVICE}")
    model = CrossEncoder(MODEL_NAME, max_length=512, device=DEVICE)
    
    app = FastAPI()
    
    class RerankRequest(BaseModel):
        query: str
        documents: list[dict] # Expects [{'id': str, 'content': str}, ...]
    
    class RerankedDocument(BaseModel):
        id: str
        content: str
        rerank_score: float
    
    class RerankResponse(BaseModel):
        results: list[RerankedDocument]
    
    @app.post("/rerank", response_model=RerankResponse)
    def rerank(request: RerankRequest):
        # Create pairs of [query, document_content] for the model
        model_input_pairs = [[request.query, doc['content']] for doc in request.documents]
        
        # Predict scores
        scores = model.predict(model_input_pairs, convert_to_numpy=True)
        
        # Combine documents with their new scores
        for doc, score in zip(request.documents, scores):
            doc['rerank_score'] = float(score)
        
        # Sort documents by the new score in descending order
        reranked_docs = sorted(request.documents, key=lambda x: x['rerank_score'], reverse=True)
        
        return RerankResponse(results=reranked_docs)
    
    # Health check endpoint
    @app.get("/health")
    def health():
        return {"status": "ok", "device": DEVICE}
    

    This service exposes a single /rerank endpoint. The main RAG orchestrator will now call this service after the fusion step.


    Production Deployment on Kubernetes

    Now, let's tie everything together in a Kubernetes environment. Our architecture consists of several components:

    * RAG Orchestrator: The main service that receives the user query, calls the HybridRetriever, then the RerankerService, and finally the LLM.

    * Reranker Service: The dedicated microservice we just defined, potentially running on GPU nodes.

    * Elasticsearch & Weaviate: Deployed as StatefulSets within the cluster.

    * LLM Service: An external API (like OpenAI) or a self-hosted model (like Llama 2 via TGI), accessed via its own service.

    * Redis: A StatefulSet for distributed caching.

    System Flow Diagram

    Here's the request flow:

    mermaid
    graph TD
        A[User Request] --> B(API Gateway)
        B --> C{RAG Orchestrator Service}
        C --> D(Redis Cache Check)
        D -- Cache Miss --> E{Parallel Fan-out}
        E --> F[Sparse Search: Elasticsearch Pod]
        E --> G[Dense Search: Weaviate Pod]
        F --> H{Result Fusion (RRF)}
        G --> H
        H --> I[Reranker Service Call]
        I --> J(Reranker Pod on GPU Node)
        J --> I
        I --> K(Redis Cache Write)
        I --> L{LLM Service Call}
        L --> M[LLM Inference Pod/API]
        M --> L
        L --> N[Format Response]
        N --> B
        D -- Cache Hit --> L

    Kubernetes Manifests for the Re-ranker Service

    Deploying the re-ranker service requires a Deployment and a Service. The key is specifying resource requests/limits and potentially a nodeSelector or toleration to target GPU nodes.

    reranker-deployment.yaml:

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: reranker-deployment
      labels:
        app: reranker
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: reranker
      template:
        metadata:
          labels:
            app: reranker
        spec:
          # Use a node selector to target nodes with GPUs
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-A10G
          # Or use tolerations if your GPU nodes are tainted
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
          containers:
          - name: reranker
            image: your-registry/reranker-service:latest
            ports:
            - containerPort: 80
            resources:
              requests:
                cpu: "1"
                memory: "4Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "2"
                memory: "8Gi"
                nvidia.com/gpu: "1"
            livenessProbe:
              httpGet:
                path: /health
                port: 80
              initialDelaySeconds: 30
              periodSeconds: 10
            readinessProbe:
              httpGet:
                path: /health
                port: 80
              initialDelaySeconds: 15
              periodSeconds: 5
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: reranker-service
    spec:
      selector:
        app: reranker
      ports:
        - protocol: TCP
          port: 80
          targetPort: 80
      type: ClusterIP

    This manifest ensures the re-ranker pods are scheduled on nodes with NVIDIA-A10G GPUs and are allocated one GPU each. The orchestrator can now reach it at the internal DNS name http://reranker-service.

    Performance Optimization: Caching

    This multi-stage pipeline introduces latency. A (retrieval) + (rerank) + (LLM) sequence can be slow. Caching is essential. The most effective place to cache is after the re-ranking step. This caches the most computationally expensive parts of the retrieval process.

    Here's how to modify the orchestrator to use a Redis cache:

    python
    # In the RAG Orchestrator Service
    import redis
    import json
    import hashlib
    
    REDIS_HOST = "redis"
    REDIS_PORT = 6379
    CACHE_EXPIRATION_SECONDS = 3600 # 1 hour
    
    class RAGOrchestrator:
        def __init__(self):
            self.retriever = HybridRetriever()
            # Assume reranker_client is an HTTP client for the reranker service
            self.reranker_client = RerankerClient("http://reranker-service") 
            self.redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, db=0)
    
        def _get_query_cache_key(self, query: str, top_k: int) -> str:
            # Hash the query and params to create a stable key
            key_string = f"{query}:{top_k}"
            return f"rag_cache:{hashlib.md5(key_string.encode()).hexdigest()}"
    
        async def process_query(self, query: str):
            # 1. Check cache first
            cache_key = self._get_query_cache_key(query, top_k=5) # Cache top 5 results
            cached_result = self.redis_client.get(cache_key)
            if cached_result:
                print("CACHE HIT")
                context_docs = json.loads(cached_result)
            else:
                print("CACHE MISS")
                # 2. Retrieve candidates
                candidates = await self.retriever.search(query, top_k=50)
                
                # 3. Re-rank candidates
                reranked_docs = await self.reranker_client.rerank(query, candidates)
                
                # 4. Select top N for context and cache them
                context_docs = reranked_docs[:5]
                self.redis_client.set(
                    cache_key, 
                    json.dumps(context_docs), 
                    ex=CACHE_EXPIRATION_SECONDS
                )
            
            # 5. Generate response with LLM using context_docs
            # ... llm_call(query, context_docs) ...
            pass

    Advanced Edge Cases and Tuning

    Deploying this system is just the beginning. Continuous tuning and monitoring are required.

    * The k Parameter Dilemma: The top_k for the initial retrieval and the final top_n sent to the LLM are critical tuning parameters. A larger initial k (e.g., 100) increases the load on the re-ranker service but raises the probability of finding the ground truth (higher recall). A smaller k is faster but riskier. This is a classic latency vs. accuracy trade-off that must be tuned based on SLOs and user feedback.

    * Re-ranker Model Choice: The MiniLM-based model is fast and effective, but for domains requiring extreme precision, a larger cross-encoder (like a DeBERTa-v3 variant) might be necessary. This would likely require a more powerful GPU (e.g., A100) and further highlights the value of the microservice architecture, as you can upgrade the hardware for just this component.

    * Cold Starts and Autoscaling: The re-ranker service, if scaled by a Kubernetes Horizontal Pod Autoscaler (HPA), might scale down to zero replicas during idle periods. The first request to a new pod will incur a significant latency penalty for model loading. To mitigate this, set minReplicas: 1 in your HPA definition. For more advanced scenarios, consider using KServe or a similar model serving framework that can perform intelligent model loading and offloading.

    * Observability is Non-Negotiable: With a multi-service pipeline, identifying bottlenecks is impossible without distributed tracing. Instrument your code with OpenTelemetry to create spans for each major step: hybrid_retrieval, sparse_search, dense_search, rrf_fusion, rerank_call, and llm_generation. This will allow you to visualize the latency contribution of each component in tools like Jaeger or Honeycomb and pinpoint exactly where to optimize.

    Conclusion

    Moving from a simplistic RAG prototype to a production-grade system requires a fundamental architectural shift. By embracing a multi-stage pipeline that decouples retrieval from re-ranking, we build a system that is more accurate, scalable, and maintainable. Hybrid search with RRF maximizes recall by leveraging the complementary strengths of sparse and dense retrieval. A dedicated cross-encoder microservice, running on specialized hardware, then provides the surgical precision needed to feed clean, relevant context to the LLM. When deployed on Kubernetes with proper caching and observability, this architecture provides the robust foundation necessary to handle the complexity and performance demands of real-world AI applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles