Production RAG: Hybrid Search & Cross-Encoder Re-ranking on Kubernetes

October 12, 2025

20 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Failure of Naive RAG in Production

For any senior engineer tasked with deploying a Retrieval-Augmented Generation (RAG) system, the initial proof-of-concept is deceptively simple: embed documents, store them in a vector database, and perform a similarity search to find context for a Large Language Model (LLM). This works well for straightforward semantic queries. However, this naive approach collapses under the weight of real-world query diversity and performance requirements.

The primary failure mode is the inherent limitation of dense vector retrieval. Embedding models, while powerful, can struggle with queries dominated by specific keywords, acronyms, or identifiers. A query like "How to fix error Kube-Err-503 in v1.28.2?" may fail because the semantic embedding of Kube-Err-503 is not distinct enough from other errors, while a traditional sparse retrieval method like BM25 would excel.

Conversely, a query like "Why is my pod frequently restarting without logging any errors?" is semantically rich and a perfect use case for vector search, as it requires understanding concepts like "restarting" and "crashing," which a keyword search would miss.

Relying solely on one method forces a compromise. Increasing the top_k in a vector search to compensate for missed keywords floods the context window with irrelevant documents, increasing LLM inference cost and introducing noise that degrades the quality of the final generation. This is not a production-ready solution. A robust RAG pipeline must maximize recall at the retrieval stage and then aggressively optimize for precision before handing off to the LLM. This requires a multi-stage architecture.

This article details the design and implementation of such an architecture, focusing on three core components:

Hybrid Search: Combining sparse (BM25) and dense (vector) retrieval in parallel to capture both lexical and semantic relevance.

Result Fusion: Implementing Reciprocal Rank Fusion (RRF) to intelligently merge disparate result sets without needing to normalize incomparable scores.

Cross-Encoder Re-ranking: Deploying a dedicated microservice with a more powerful (but slower) cross-encoder model to re-rank the fused candidates for maximum precision.

We will deploy this entire system on Kubernetes, architecting it for scalability, observability, and independent resource management.

Architecting the Hybrid Search Retriever

The goal of the initial retrieval stage is to cast a wide net, ensuring the ground-truth document is highly likely to be in our candidate set. We achieve this by running two specialized retrievers in parallel and fusing their results.

* Sparse Retriever: We'll use Elasticsearch with a standard BM25 index. This is optimized for lexical matches—keywords, product IDs, error codes, and specific jargon.

* Dense Retriever: A vector database like Weaviate, Milvus, or Pinecone, populated with embeddings from a model like all-mpnet-base-v2 or a fine-tuned equivalent. This is optimized for semantic or conceptual matches.

The Fusion Problem: Why Score Normalization Fails

A common but flawed approach to merging results is to normalize the scores from both systems (e.g., to a 0-1 range) and then interleave them. This is problematic because BM25 scores and cosine similarity scores are not directly comparable. A BM25 score of 20.5 has no inherent relationship to a cosine similarity of 0.89. Normalization is sensitive to the distribution of scores in a given result set, making it unstable. A single outlier can skew the entire ranking.

Solution: Reciprocal Rank Fusion (RRF)

RRF provides a simple, effective, and score-agnostic way to combine ranked lists. It disregards the absolute scores and focuses solely on the rank of each document in its respective list. The RRF score for a document is calculated by summing the reciprocal of its rank across all lists.

Formula: RRF_Score(d) = Σ(1 / (k + rank_i(d)))

Where:

* d is the document.

* rank_i(d) is the rank of document d in result list i.

* k is a constant (typically 60) that dampens the influence of high ranks, preventing a single top result from dominating.

Let's implement a HybridRetriever in Python that queries Elasticsearch and a vector database (using the Weaviate client as an example) in parallel using asyncio and then applies RRF.

python

import asyncio
import weaviate
from elasticsearch import AsyncElasticsearch
from collections import defaultdict

# --- Configuration ---
WEAVIATE_URL = "http://weaviate:8080"
ES_URL = "http://elasticsearch:9200"
WEAVIATE_CLASS_NAME = "Document"
ES_INDEX_NAME = "documents"
RRF_K = 60

class HybridRetriever:
    def __init__(self):
        self.weaviate_client = weaviate.Client(WEAVIATE_URL)
        self.es_client = AsyncElasticsearch([ES_URL])

    async def _dense_search(self, query_text: str, top_k: int):
        # In a real app, you'd get the vector from a dedicated embedding service
        # For this example, we assume Weaviate has a text2vec module
        near_text = {"concepts": [query_text]}
        
        response = (
            self.weaviate_client.query
            .get(WEAVIATE_CLASS_NAME, ["doc_id", "content"])
            .with_near_text(near_text)
            .with_limit(top_k)
            .do()
        )
        
        results = response['data']['Get'][WEAVIATE_CLASS_NAME]
        return [{ "id": item['doc_id'], "content": item['content'] } for item in results]

    async def _sparse_search(self, query_text: str, top_k: int):
        response = await self.es_client.search(
            index=ES_INDEX_NAME,
            body={
                "size": top_k,
                "query": {
                    "match": {
                        "content": query_text
                    }
                }
            }
        )
        
        results = response['hits']['hits']
        return [{ "id": hit['_source']['doc_id'], "content": hit['_source']['content'] } for hit in results]

    def _reciprocal_rank_fusion(self, ranked_lists: list[list[dict]], k: int = 60) -> list[dict]:
        rrf_scores = defaultdict(float)
        doc_content_map = {}

        for ranked_list in ranked_lists:
            for i, doc in enumerate(ranked_list):
                doc_id = doc['id']
                if doc_id not in doc_content_map:
                    doc_content_map[doc_id] = doc['content']
                
                rank = i + 1
                rrf_scores[doc_id] += 1 / (k + rank)

        sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
        
        fused_results = []
        for doc_id, score in sorted_docs:
            fused_results.append({
                "id": doc_id,
                "content": doc_content_map[doc_id],
                "rrf_score": score
            })
        
        return fused_results

    async def search(self, query_text: str, top_k: int = 50) -> list[dict]:
        """Performs parallel dense and sparse search, then fuses the results."""
        dense_task = self._dense_search(query_text, top_k)
        sparse_task = self._sparse_search(query_text, top_k)

        dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)

        fused_results = self._reciprocal_rank_fusion([dense_results, sparse_results], k=RRF_K)
        
        return fused_results

# --- Example Usage ---
async def main():
    retriever = HybridRetriever()
    query = "Kubernetes pod crashlooping without logs"
    
    # We retrieve a larger set of candidates (e.g., 50) for the re-ranker to process
    candidates = await retriever.search(query, top_k=50)
    
    print(f"Found {len(candidates)} candidates after RRF fusion.")
    for i, doc in enumerate(candidates[:5]):
        print(f"{i+1}. [ID: {doc['id']}] Score: {doc['rrf_score']:.4f} | Content: {doc['content'][:100]}...")

if __name__ == "__main__":
    # Note: Requires running Elasticsearch and Weaviate instances
    # and populating them with documents having 'doc_id' and 'content' fields.
    # asyncio.run(main())
    pass

This implementation fetches up to 100 documents (50 from each retriever) and fuses them into a single, relevance-ranked list. This list now has high recall but may contain noise. The next step is to prune this list with a high-precision re-ranking model.

The Re-ranking Stage: From Recall to Precision

Hybrid search gives us a rich set of candidate documents. Now we need to re-order them with much higher accuracy. The models used for retrieval (bi-encoders) are optimized for speed over a massive corpus. They encode the query and documents into vectors independently. For re-ranking, we can afford a more computationally expensive but far more accurate model: a cross-encoder.

A cross-encoder takes both the query and a candidate document as a single input and outputs a score representing their relevance. This joint processing allows the model to perform deep attention across the query and document tokens, capturing nuances that bi-encoders miss.

We'll use a model from the sentence-transformers library, such as ms-marco-MiniLM-L-6-v2, which is specifically trained for re-ranking tasks.

Architectural Decision: The Re-ranker as a Microservice

A critical production decision is where to run the cross-encoder. Do not run it in the same service as the retrieval logic. Here's why:

Different Scaling Needs: Retrieval is I/O-bound (network calls to databases). Re-ranking is CPU/GPU-bound (model inference). A spike in traffic might require scaling up retrievers, but not re-rankers, or vice-versa. Co-locating them forces you to over-provision one resource to satisfy the other.

Specialized Hardware: Cross-encoders benefit immensely from GPUs. By deploying the re-ranker as a separate microservice, you can schedule its pods onto dedicated GPU nodes in your Kubernetes cluster, while the orchestrator and retrieval logic run on cheaper CPU nodes.

Resilience and Independent Deployment: You can update, canary release, or roll back the re-ranking model and service independently of the main RAG pipeline.

Here is a simple FastAPI implementation for our re-ranker microservice:

python

# reranker_service/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers.cross_encoder import CrossEncoder
import torch

# --- Model Loading ---
# This happens once on startup.
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
print(f"Loading CrossEncoder model on device: {DEVICE}")
model = CrossEncoder(MODEL_NAME, max_length=512, device=DEVICE)

app = FastAPI()

class RerankRequest(BaseModel):
    query: str
    documents: list[dict] # Expects [{'id': str, 'content': str}, ...]

class RerankedDocument(BaseModel):
    id: str
    content: str
    rerank_score: float

class RerankResponse(BaseModel):
    results: list[RerankedDocument]

@app.post("/rerank", response_model=RerankResponse)
def rerank(request: RerankRequest):
    # Create pairs of [query, document_content] for the model
    model_input_pairs = [[request.query, doc['content']] for doc in request.documents]
    
    # Predict scores
    scores = model.predict(model_input_pairs, convert_to_numpy=True)
    
    # Combine documents with their new scores
    for doc, score in zip(request.documents, scores):
        doc['rerank_score'] = float(score)
    
    # Sort documents by the new score in descending order
    reranked_docs = sorted(request.documents, key=lambda x: x['rerank_score'], reverse=True)
    
    return RerankResponse(results=reranked_docs)

# Health check endpoint
@app.get("/health")
def health():
    return {"status": "ok", "device": DEVICE}

This service exposes a single /rerank endpoint. The main RAG orchestrator will now call this service after the fusion step.

Production Deployment on Kubernetes

Now, let's tie everything together in a Kubernetes environment. Our architecture consists of several components:

* RAG Orchestrator: The main service that receives the user query, calls the HybridRetriever, then the RerankerService, and finally the LLM.

* Reranker Service: The dedicated microservice we just defined, potentially running on GPU nodes.

* Elasticsearch & Weaviate: Deployed as StatefulSets within the cluster.

* LLM Service: An external API (like OpenAI) or a self-hosted model (like Llama 2 via TGI), accessed via its own service.

* Redis: A StatefulSet for distributed caching.

System Flow Diagram

Here's the request flow:

mermaid

graph TD
    A[User Request] --> B(API Gateway)
    B --> C{RAG Orchestrator Service}
    C --> D(Redis Cache Check)
    D -- Cache Miss --> E{Parallel Fan-out}
    E --> F[Sparse Search: Elasticsearch Pod]
    E --> G[Dense Search: Weaviate Pod]
    F --> H{Result Fusion (RRF)}
    G --> H
    H --> I[Reranker Service Call]
    I --> J(Reranker Pod on GPU Node)
    J --> I
    I --> K(Redis Cache Write)
    I --> L{LLM Service Call}
    L --> M[LLM Inference Pod/API]
    M --> L
    L --> N[Format Response]
    N --> B
    D -- Cache Hit --> L

Kubernetes Manifests for the Re-ranker Service

Deploying the re-ranker service requires a Deployment and a Service. The key is specifying resource requests/limits and potentially a nodeSelector or toleration to target GPU nodes.

reranker-deployment.yaml:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reranker-deployment
  labels:
    app: reranker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: reranker
  template:
    metadata:
      labels:
        app: reranker
    spec:
      # Use a node selector to target nodes with GPUs
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A10G
      # Or use tolerations if your GPU nodes are tainted
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: reranker
        image: your-registry/reranker-service:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: reranker-service
spec:
  selector:
    app: reranker
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP

This manifest ensures the re-ranker pods are scheduled on nodes with NVIDIA-A10G GPUs and are allocated one GPU each. The orchestrator can now reach it at the internal DNS name http://reranker-service.

Performance Optimization: Caching

This multi-stage pipeline introduces latency. A (retrieval) + (rerank) + (LLM) sequence can be slow. Caching is essential. The most effective place to cache is after the re-ranking step. This caches the most computationally expensive parts of the retrieval process.

Here's how to modify the orchestrator to use a Redis cache:

python

# In the RAG Orchestrator Service
import redis
import json
import hashlib

REDIS_HOST = "redis"
REDIS_PORT = 6379
CACHE_EXPIRATION_SECONDS = 3600 # 1 hour

class RAGOrchestrator:
    def __init__(self):
        self.retriever = HybridRetriever()
        # Assume reranker_client is an HTTP client for the reranker service
        self.reranker_client = RerankerClient("http://reranker-service") 
        self.redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, db=0)

    def _get_query_cache_key(self, query: str, top_k: int) -> str:
        # Hash the query and params to create a stable key
        key_string = f"{query}:{top_k}"
        return f"rag_cache:{hashlib.md5(key_string.encode()).hexdigest()}"

    async def process_query(self, query: str):
        # 1. Check cache first
        cache_key = self._get_query_cache_key(query, top_k=5) # Cache top 5 results
        cached_result = self.redis_client.get(cache_key)
        if cached_result:
            print("CACHE HIT")
            context_docs = json.loads(cached_result)
        else:
            print("CACHE MISS")
            # 2. Retrieve candidates
            candidates = await self.retriever.search(query, top_k=50)
            
            # 3. Re-rank candidates
            reranked_docs = await self.reranker_client.rerank(query, candidates)
            
            # 4. Select top N for context and cache them
            context_docs = reranked_docs[:5]
            self.redis_client.set(
                cache_key, 
                json.dumps(context_docs), 
                ex=CACHE_EXPIRATION_SECONDS
            )
        
        # 5. Generate response with LLM using context_docs
        # ... llm_call(query, context_docs) ...
        pass

Advanced Edge Cases and Tuning

Deploying this system is just the beginning. Continuous tuning and monitoring are required.

* The k Parameter Dilemma: The top_k for the initial retrieval and the final top_n sent to the LLM are critical tuning parameters. A larger initial k (e.g., 100) increases the load on the re-ranker service but raises the probability of finding the ground truth (higher recall). A smaller k is faster but riskier. This is a classic latency vs. accuracy trade-off that must be tuned based on SLOs and user feedback.

* Re-ranker Model Choice: The MiniLM-based model is fast and effective, but for domains requiring extreme precision, a larger cross-encoder (like a DeBERTa-v3 variant) might be necessary. This would likely require a more powerful GPU (e.g., A100) and further highlights the value of the microservice architecture, as you can upgrade the hardware for just this component.

* Cold Starts and Autoscaling: The re-ranker service, if scaled by a Kubernetes Horizontal Pod Autoscaler (HPA), might scale down to zero replicas during idle periods. The first request to a new pod will incur a significant latency penalty for model loading. To mitigate this, set minReplicas: 1 in your HPA definition. For more advanced scenarios, consider using KServe or a similar model serving framework that can perform intelligent model loading and offloading.

* Observability is Non-Negotiable: With a multi-service pipeline, identifying bottlenecks is impossible without distributed tracing. Instrument your code with OpenTelemetry to create spans for each major step: hybrid_retrieval, sparse_search, dense_search, rrf_fusion, rerank_call, and llm_generation. This will allow you to visualize the latency contribution of each component in tools like Jaeger or Honeycomb and pinpoint exactly where to optimize.

Conclusion

Moving from a simplistic RAG prototype to a production-grade system requires a fundamental architectural shift. By embracing a multi-stage pipeline that decouples retrieval from re-ranking, we build a system that is more accurate, scalable, and maintainable. Hybrid search with RRF maximizes recall by leveraging the complementary strengths of sparse and dense retrieval. A dedicated cross-encoder microservice, running on specialized hardware, then provides the surgical precision needed to feed clean, relevant context to the LLM. When deployed on Kubernetes with proper caching and observability, this architecture provides the robust foundation necessary to handle the complexity and performance demands of real-world AI applications.

The Failure of Naive RAG in Production

Architecting the Hybrid Search Retriever

The Fusion Problem: Why Score Normalization Fails

Solution: Reciprocal Rank Fusion (RRF)

The Re-ranking Stage: From Recall to Precision

Architectural Decision: The Re-ranker as a Microservice

Production Deployment on Kubernetes

System Flow Diagram

Kubernetes Manifests for the Re-ranker Service

Performance Optimization: Caching

Advanced Edge Cases and Tuning

Conclusion

Found this article helpful?