Production RAG: Hybrid Search & Cross-Encoder Re-ranking on Kubernetes
The Failure of Naive RAG in Production
For any senior engineer tasked with deploying a Retrieval-Augmented Generation (RAG) system, the initial proof-of-concept is deceptively simple: embed documents, store them in a vector database, and perform a similarity search to find context for a Large Language Model (LLM). This works well for straightforward semantic queries. However, this naive approach collapses under the weight of real-world query diversity and performance requirements.
The primary failure mode is the inherent limitation of dense vector retrieval. Embedding models, while powerful, can struggle with queries dominated by specific keywords, acronyms, or identifiers. A query like "How to fix error Kube-Err-503 in v1.28.2?" may fail because the semantic embedding of Kube-Err-503 is not distinct enough from other errors, while a traditional sparse retrieval method like BM25 would excel.
Conversely, a query like "Why is my pod frequently restarting without logging any errors?" is semantically rich and a perfect use case for vector search, as it requires understanding concepts like "restarting" and "crashing," which a keyword search would miss.
Relying solely on one method forces a compromise. Increasing the top_k in a vector search to compensate for missed keywords floods the context window with irrelevant documents, increasing LLM inference cost and introducing noise that degrades the quality of the final generation. This is not a production-ready solution. A robust RAG pipeline must maximize recall at the retrieval stage and then aggressively optimize for precision before handing off to the LLM. This requires a multi-stage architecture.
This article details the design and implementation of such an architecture, focusing on three core components:
We will deploy this entire system on Kubernetes, architecting it for scalability, observability, and independent resource management.
Architecting the Hybrid Search Retriever
The goal of the initial retrieval stage is to cast a wide net, ensuring the ground-truth document is highly likely to be in our candidate set. We achieve this by running two specialized retrievers in parallel and fusing their results.
* Sparse Retriever: We'll use Elasticsearch with a standard BM25 index. This is optimized for lexical matches—keywords, product IDs, error codes, and specific jargon.
*   Dense Retriever: A vector database like Weaviate, Milvus, or Pinecone, populated with embeddings from a model like all-mpnet-base-v2 or a fine-tuned equivalent. This is optimized for semantic or conceptual matches.
The Fusion Problem: Why Score Normalization Fails
A common but flawed approach to merging results is to normalize the scores from both systems (e.g., to a 0-1 range) and then interleave them. This is problematic because BM25 scores and cosine similarity scores are not directly comparable. A BM25 score of 20.5 has no inherent relationship to a cosine similarity of 0.89. Normalization is sensitive to the distribution of scores in a given result set, making it unstable. A single outlier can skew the entire ranking.
Solution: Reciprocal Rank Fusion (RRF)
RRF provides a simple, effective, and score-agnostic way to combine ranked lists. It disregards the absolute scores and focuses solely on the rank of each document in its respective list. The RRF score for a document is calculated by summing the reciprocal of its rank across all lists.
Formula: RRF_Score(d) = Σ(1 / (k + rank_i(d)))
Where:
*   d is the document.
*   rank_i(d) is the rank of document d in result list i.
*   k is a constant (typically 60) that dampens the influence of high ranks, preventing a single top result from dominating.
Let's implement a HybridRetriever in Python that queries Elasticsearch and a vector database (using the Weaviate client as an example) in parallel using asyncio and then applies RRF.
import asyncio
import weaviate
from elasticsearch import AsyncElasticsearch
from collections import defaultdict
# --- Configuration ---
WEAVIATE_URL = "http://weaviate:8080"
ES_URL = "http://elasticsearch:9200"
WEAVIATE_CLASS_NAME = "Document"
ES_INDEX_NAME = "documents"
RRF_K = 60
class HybridRetriever:
    def __init__(self):
        self.weaviate_client = weaviate.Client(WEAVIATE_URL)
        self.es_client = AsyncElasticsearch([ES_URL])
    async def _dense_search(self, query_text: str, top_k: int):
        # In a real app, you'd get the vector from a dedicated embedding service
        # For this example, we assume Weaviate has a text2vec module
        near_text = {"concepts": [query_text]}
        
        response = (
            self.weaviate_client.query
            .get(WEAVIATE_CLASS_NAME, ["doc_id", "content"])
            .with_near_text(near_text)
            .with_limit(top_k)
            .do()
        )
        
        results = response['data']['Get'][WEAVIATE_CLASS_NAME]
        return [{ "id": item['doc_id'], "content": item['content'] } for item in results]
    async def _sparse_search(self, query_text: str, top_k: int):
        response = await self.es_client.search(
            index=ES_INDEX_NAME,
            body={
                "size": top_k,
                "query": {
                    "match": {
                        "content": query_text
                    }
                }
            }
        )
        
        results = response['hits']['hits']
        return [{ "id": hit['_source']['doc_id'], "content": hit['_source']['content'] } for hit in results]
    def _reciprocal_rank_fusion(self, ranked_lists: list[list[dict]], k: int = 60) -> list[dict]:
        rrf_scores = defaultdict(float)
        doc_content_map = {}
        for ranked_list in ranked_lists:
            for i, doc in enumerate(ranked_list):
                doc_id = doc['id']
                if doc_id not in doc_content_map:
                    doc_content_map[doc_id] = doc['content']
                
                rank = i + 1
                rrf_scores[doc_id] += 1 / (k + rank)
        sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
        
        fused_results = []
        for doc_id, score in sorted_docs:
            fused_results.append({
                "id": doc_id,
                "content": doc_content_map[doc_id],
                "rrf_score": score
            })
        
        return fused_results
    async def search(self, query_text: str, top_k: int = 50) -> list[dict]:
        """Performs parallel dense and sparse search, then fuses the results."""
        dense_task = self._dense_search(query_text, top_k)
        sparse_task = self._sparse_search(query_text, top_k)
        dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
        fused_results = self._reciprocal_rank_fusion([dense_results, sparse_results], k=RRF_K)
        
        return fused_results
# --- Example Usage ---
async def main():
    retriever = HybridRetriever()
    query = "Kubernetes pod crashlooping without logs"
    
    # We retrieve a larger set of candidates (e.g., 50) for the re-ranker to process
    candidates = await retriever.search(query, top_k=50)
    
    print(f"Found {len(candidates)} candidates after RRF fusion.")
    for i, doc in enumerate(candidates[:5]):
        print(f"{i+1}. [ID: {doc['id']}] Score: {doc['rrf_score']:.4f} | Content: {doc['content'][:100]}...")
if __name__ == "__main__":
    # Note: Requires running Elasticsearch and Weaviate instances
    # and populating them with documents having 'doc_id' and 'content' fields.
    # asyncio.run(main())
    pass
This implementation fetches up to 100 documents (50 from each retriever) and fuses them into a single, relevance-ranked list. This list now has high recall but may contain noise. The next step is to prune this list with a high-precision re-ranking model.
The Re-ranking Stage: From Recall to Precision
Hybrid search gives us a rich set of candidate documents. Now we need to re-order them with much higher accuracy. The models used for retrieval (bi-encoders) are optimized for speed over a massive corpus. They encode the query and documents into vectors independently. For re-ranking, we can afford a more computationally expensive but far more accurate model: a cross-encoder.
A cross-encoder takes both the query and a candidate document as a single input and outputs a score representing their relevance. This joint processing allows the model to perform deep attention across the query and document tokens, capturing nuances that bi-encoders miss.
We'll use a model from the sentence-transformers library, such as ms-marco-MiniLM-L-6-v2, which is specifically trained for re-ranking tasks.
Architectural Decision: The Re-ranker as a Microservice
A critical production decision is where to run the cross-encoder. Do not run it in the same service as the retrieval logic. Here's why:
Here is a simple FastAPI implementation for our re-ranker microservice:
# reranker_service/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers.cross_encoder import CrossEncoder
import torch
# --- Model Loading ---
# This happens once on startup.
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
print(f"Loading CrossEncoder model on device: {DEVICE}")
model = CrossEncoder(MODEL_NAME, max_length=512, device=DEVICE)
app = FastAPI()
class RerankRequest(BaseModel):
    query: str
    documents: list[dict] # Expects [{'id': str, 'content': str}, ...]
class RerankedDocument(BaseModel):
    id: str
    content: str
    rerank_score: float
class RerankResponse(BaseModel):
    results: list[RerankedDocument]
@app.post("/rerank", response_model=RerankResponse)
def rerank(request: RerankRequest):
    # Create pairs of [query, document_content] for the model
    model_input_pairs = [[request.query, doc['content']] for doc in request.documents]
    
    # Predict scores
    scores = model.predict(model_input_pairs, convert_to_numpy=True)
    
    # Combine documents with their new scores
    for doc, score in zip(request.documents, scores):
        doc['rerank_score'] = float(score)
    
    # Sort documents by the new score in descending order
    reranked_docs = sorted(request.documents, key=lambda x: x['rerank_score'], reverse=True)
    
    return RerankResponse(results=reranked_docs)
# Health check endpoint
@app.get("/health")
def health():
    return {"status": "ok", "device": DEVICE}
This service exposes a single /rerank endpoint. The main RAG orchestrator will now call this service after the fusion step.
Production Deployment on Kubernetes
Now, let's tie everything together in a Kubernetes environment. Our architecture consists of several components:
*   RAG Orchestrator: The main service that receives the user query, calls the HybridRetriever, then the RerankerService, and finally the LLM.
* Reranker Service: The dedicated microservice we just defined, potentially running on GPU nodes.
*   Elasticsearch & Weaviate: Deployed as StatefulSets within the cluster.
* LLM Service: An external API (like OpenAI) or a self-hosted model (like Llama 2 via TGI), accessed via its own service.
*   Redis: A StatefulSet for distributed caching.
System Flow Diagram
Here's the request flow:
graph TD
    A[User Request] --> B(API Gateway)
    B --> C{RAG Orchestrator Service}
    C --> D(Redis Cache Check)
    D -- Cache Miss --> E{Parallel Fan-out}
    E --> F[Sparse Search: Elasticsearch Pod]
    E --> G[Dense Search: Weaviate Pod]
    F --> H{Result Fusion (RRF)}
    G --> H
    H --> I[Reranker Service Call]
    I --> J(Reranker Pod on GPU Node)
    J --> I
    I --> K(Redis Cache Write)
    I --> L{LLM Service Call}
    L --> M[LLM Inference Pod/API]
    M --> L
    L --> N[Format Response]
    N --> B
    D -- Cache Hit --> LKubernetes Manifests for the Re-ranker Service
Deploying the re-ranker service requires a Deployment and a Service. The key is specifying resource requests/limits and potentially a nodeSelector or toleration to target GPU nodes.
reranker-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reranker-deployment
  labels:
    app: reranker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: reranker
  template:
    metadata:
      labels:
        app: reranker
    spec:
      # Use a node selector to target nodes with GPUs
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A10G
      # Or use tolerations if your GPU nodes are tainted
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: reranker
        image: your-registry/reranker-service:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: reranker-service
spec:
  selector:
    app: reranker
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIPThis manifest ensures the re-ranker pods are scheduled on nodes with NVIDIA-A10G GPUs and are allocated one GPU each. The orchestrator can now reach it at the internal DNS name http://reranker-service.
Performance Optimization: Caching
This multi-stage pipeline introduces latency. A (retrieval) + (rerank) + (LLM) sequence can be slow. Caching is essential. The most effective place to cache is after the re-ranking step. This caches the most computationally expensive parts of the retrieval process.
Here's how to modify the orchestrator to use a Redis cache:
# In the RAG Orchestrator Service
import redis
import json
import hashlib
REDIS_HOST = "redis"
REDIS_PORT = 6379
CACHE_EXPIRATION_SECONDS = 3600 # 1 hour
class RAGOrchestrator:
    def __init__(self):
        self.retriever = HybridRetriever()
        # Assume reranker_client is an HTTP client for the reranker service
        self.reranker_client = RerankerClient("http://reranker-service") 
        self.redis_client = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, db=0)
    def _get_query_cache_key(self, query: str, top_k: int) -> str:
        # Hash the query and params to create a stable key
        key_string = f"{query}:{top_k}"
        return f"rag_cache:{hashlib.md5(key_string.encode()).hexdigest()}"
    async def process_query(self, query: str):
        # 1. Check cache first
        cache_key = self._get_query_cache_key(query, top_k=5) # Cache top 5 results
        cached_result = self.redis_client.get(cache_key)
        if cached_result:
            print("CACHE HIT")
            context_docs = json.loads(cached_result)
        else:
            print("CACHE MISS")
            # 2. Retrieve candidates
            candidates = await self.retriever.search(query, top_k=50)
            
            # 3. Re-rank candidates
            reranked_docs = await self.reranker_client.rerank(query, candidates)
            
            # 4. Select top N for context and cache them
            context_docs = reranked_docs[:5]
            self.redis_client.set(
                cache_key, 
                json.dumps(context_docs), 
                ex=CACHE_EXPIRATION_SECONDS
            )
        
        # 5. Generate response with LLM using context_docs
        # ... llm_call(query, context_docs) ...
        passAdvanced Edge Cases and Tuning
Deploying this system is just the beginning. Continuous tuning and monitoring are required.
*   The k Parameter Dilemma: The top_k for the initial retrieval and the final top_n sent to the LLM are critical tuning parameters. A larger initial k (e.g., 100) increases the load on the re-ranker service but raises the probability of finding the ground truth (higher recall). A smaller k is faster but riskier. This is a classic latency vs. accuracy trade-off that must be tuned based on SLOs and user feedback.
*   Re-ranker Model Choice: The MiniLM-based model is fast and effective, but for domains requiring extreme precision, a larger cross-encoder (like a DeBERTa-v3 variant) might be necessary. This would likely require a more powerful GPU (e.g., A100) and further highlights the value of the microservice architecture, as you can upgrade the hardware for just this component.
*   Cold Starts and Autoscaling: The re-ranker service, if scaled by a Kubernetes Horizontal Pod Autoscaler (HPA), might scale down to zero replicas during idle periods. The first request to a new pod will incur a significant latency penalty for model loading. To mitigate this, set minReplicas: 1 in your HPA definition. For more advanced scenarios, consider using KServe or a similar model serving framework that can perform intelligent model loading and offloading.
*   Observability is Non-Negotiable: With a multi-service pipeline, identifying bottlenecks is impossible without distributed tracing. Instrument your code with OpenTelemetry to create spans for each major step: hybrid_retrieval, sparse_search, dense_search, rrf_fusion, rerank_call, and llm_generation. This will allow you to visualize the latency contribution of each component in tools like Jaeger or Honeycomb and pinpoint exactly where to optimize.
Conclusion
Moving from a simplistic RAG prototype to a production-grade system requires a fundamental architectural shift. By embracing a multi-stage pipeline that decouples retrieval from re-ranking, we build a system that is more accurate, scalable, and maintainable. Hybrid search with RRF maximizes recall by leveraging the complementary strengths of sparse and dense retrieval. A dedicated cross-encoder microservice, running on specialized hardware, then provides the surgical precision needed to feed clean, relevant context to the LLM. When deployed on Kubernetes with proper caching and observability, this architecture provides the robust foundation necessary to handle the complexity and performance demands of real-world AI applications.