Optimizing RAG: Hybrid Search and Re-ranking for Low-Latency LLMs
The Illusion of Simplicity in Production RAG
For senior engineers tasked with building LLM-powered features, the initial promise of Retrieval-Augmented Generation (RAG) seems deceptively simple: embed a corpus of documents, perform a vector similarity search on a user query, stuff the results into a prompt, and let the LLM synthesize an answer. This works for a proof-of-concept. It fails catastrophically in production.
Why? Because naive, single-stage vector search is a blunt instrument. It's susceptible to several critical failure modes:
X-48-B2, error code 0x80070005). A query for a specific term might not retrieve a document that contains it if the overall semantic context is weak.Production-grade RAG is not a single API call; it's a multi-stage data processing funnel designed to maximize precision at the top of the retrieved results (Precision@k where k is small) before ever invoking the expensive LLM. This article details the architecture and implementation of a two-stage pipeline that addresses these failures: Hybrid Retrieval followed by Cross-Encoder Re-ranking.
Stage 1: Hybrid Retrieval with Reciprocal Rank Fusion (RRF)
The first stage of our funnel is retrieval. Our goal here is high recall—we want to cast a wide net to ensure the relevant documents are captured, even if it means pulling in some noise. We achieve this by combining the strengths of two fundamentally different search paradigms: sparse and dense retrieval.
* Sparse Retrieval (e.g., BM25): The workhorse of traditional search engines. It operates on inverted indexes and term frequencies (TF-IDF). It is exceptionally good at matching keywords, codes, and specific jargon. Its weakness is a lack of semantic understanding.
* Dense Retrieval (Vector Search): Uses embedding models (bi-encoders) to map text to high-dimensional vectors. It excels at understanding semantic similarity, synonyms, and paraphrasing. Its weakness is the keyword miss problem mentioned earlier.
By combining them, we get the best of both worlds. The challenge lies in merging their disparate result sets and scoring systems.
Fusing the Results: Reciprocal Rank Fusion (RRF)
BM25 produces a relevance score, while vector search produces a distance/similarity score. These are not directly comparable. Instead of trying to normalize them, we can use a rank-based fusion method. Reciprocal Rank Fusion (RRF) is a simple, effective, and zero-tuning-required algorithm.
The formula for the RRF score of a document d is:
RRF_score(d) = Σ (1 / (k + rank_i(d)))
Where:
*   rank_i(d) is the rank of document d in the i-th result set (e.g., the BM25 results or the vector search results).
*   k is a constant to mitigate the effect of high ranks (a common value is k=60).
Let's implement this. We'll assume we have two sets of results, one from Elasticsearch (BM25) and one from FAISS (a popular vector index library).
Code Example: Implementing Hybrid Search and RRF
First, let's set up our search clients. This assumes you have documents indexed in both Elasticsearch and a FAISS index with corresponding metadata.
import faiss
import numpy as np
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
# --- Assume these are pre-populated ---
# Elasticsearch client
es_client = Elasticsearch("http://localhost:9200")
# Sentence Transformer model for embedding queries
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# FAISS index loaded from disk
faiss_index = faiss.read_index("my_faiss_index.bin")
# A mapping from FAISS index position to our document ID
# e.g., {0: 'doc_abc', 1: 'doc_xyz', ...}
index_to_doc_id = { ... }
# --- Implementation ---
def search_bm25(query_text: str, k: int = 50) -> list[tuple[str, float]]:
    """Performs a BM25 search and returns (doc_id, score) tuples."""
    response = es_client.search(
        index="my_documents",
        body={
            "size": k,
            "query": {
                "match": {
                    "content": query_text
                }
            }
        }
    )
    return [(hit['_id'], hit['_score']) for hit in response['hits']['hits']]
def search_vector(query_text: str, k: int = 50) -> list[tuple[str, float]]:
    """Performs a vector search and returns (doc_id, score) tuples."""
    query_vector = embedding_model.encode([query_text])
    # FAISS returns distances (lower is better), so we'll convert to similarity later
    distances, indices = faiss_index.search(query_vector.astype('float32'), k)
    
    results = []
    for i, dist in zip(indices[0], distances[0]):
        if i != -1: # FAISS can return -1 for no result
            doc_id = index_to_doc_id[i]
            # Convert distance to a similarity score (0-1), 1 / (1 + dist)
            similarity = 1.0 / (1.0 + dist)
            results.append((doc_id, similarity))
    return results
def reciprocal_rank_fusion(results_lists: list[list[tuple[str, float]]], k: int = 60) -> dict[str, float]:
    """Performs RRF on a list of search result lists."""
    fused_scores = {}
    
    # Create a map from doc_id to its rank in each list
    doc_ranks = {}
    for results in results_lists:
        for rank, (doc_id, _) in enumerate(results):
            if doc_id not in doc_ranks:
                doc_ranks[doc_id] = []
            doc_ranks[doc_id].append(rank + 1) # rank is 1-based
            
    # Calculate RRF score for each document
    for doc_id, ranks in doc_ranks.items():
        rrf_score = 0.0
        for rank in ranks:
            rrf_score += 1.0 / (k + rank)
        fused_scores[doc_id] = rrf_score
        
    return fused_scores
# --- Putting it together ---
def hybrid_search(query: str, top_k_retrieval: int = 100):
    bm25_results = search_bm25(query, k=top_k_retrieval)
    vector_results = search_vector(query, k=top_k_retrieval)
    # RRF requires ranked lists, so we don't need the scores here, just the order
    bm25_ranked_list = [(doc_id, score) for doc_id, score in bm25_results]
    vector_ranked_list = [(doc_id, score) for doc_id, score in vector_results]
    fused_scores = reciprocal_rank_fusion([bm25_ranked_list, vector_ranked_list])
    # Sort by RRF score in descending order
    sorted_fused_results = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
    
    return sorted_fused_results
# --- Example Usage ---
query = "What is the p99 latency for the user-service API?"
retrieved_docs = hybrid_search(query)
print(f"Retrieved {len(retrieved_docs)} documents after fusion.")
print("Top 5 results:", retrieved_docs[:5])In this stage, we might retrieve a large number of documents (e.g., k=100) to ensure our recall is high. The output of this stage is a single, relevance-ranked list of document IDs. This list is far superior to a single-modality search, but it's still too noisy and too large to pass directly to an LLM.
Stage 2: Precision Enhancement with Cross-Encoder Re-ranking
The retrieval stage used bi-encoders, which create embeddings for the query and documents independently. They are incredibly fast, allowing us to search over millions of documents in milliseconds. However, this independence is also their weakness; the model never sees the query and document together.
Cross-encoders solve this. They take both the query and a potential document as a single input and output a score (e.g., 0 to 1) indicating their relevance. This allows for deep, token-level attention between the query and the document, making them significantly more accurate than bi-encoders.
The catch? They are orders of magnitude slower. Running a cross-encoder on your entire corpus is computationally infeasible. This is why our funnel architecture is so critical: we use the fast, high-recall hybrid search to find a few dozen candidates, and then use the slow, high-precision cross-encoder to re-rank only this small set.
Implementation with sentence-transformers
The sentence-transformers library provides pre-trained cross-encoder models perfect for this task.
from sentence_transformers.cross_encoder import CrossEncoder
# Load a lightweight, pre-trained cross-encoder
# These models are trained on tasks like question-answering and are ideal for re-ranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Assume 'retrieved_docs' is the output from our hybrid_search function
# and we have a function to get the actual text content of a document
def get_document_content(doc_id: str) -> str:
    # In a real system, this would fetch from a database or document store
    # For this example, we'll use a dummy dictionary
    dummy_db = {
        'doc1': 'The user-service API has a p99 latency of 250ms.',
        'doc2': 'System performance metrics are tracked in Grafana.',
        'doc3': 'API latency is a key performance indicator.',
        'doc4': 'The billing-service API latency is currently 150ms.'
        # ... more docs
    }
    return dummy_db.get(doc_id, "")
def rerank_documents(query: str, doc_ids: list[str], top_n: int = 5) -> list[tuple[str, float]]:
    """Re-ranks a list of document IDs using a cross-encoder."""
    # Create pairs of [query, document_content] for the cross-encoder
    query_doc_pairs = [(query, get_document_content(doc_id)) for doc_id in doc_ids]
    
    if not query_doc_pairs:
        return []
        
    # Predict scores. The model will output a single relevance score for each pair.
    scores = cross_encoder.predict(query_doc_pairs)
    
    # Combine doc_ids with their new scores
    reranked_results = list(zip(doc_ids, scores))
    
    # Sort by the new score in descending order
    reranked_results.sort(key=lambda x: x[1], reverse=True)
    
    return reranked_results[:top_n]
# --- Example Usage with the previous stage's output ---
# Let's say hybrid_search returned a list of (doc_id, rrf_score) tuples
retrieved_doc_ids = [doc_id for doc_id, score in retrieved_docs]
# We only re-rank the top, say, 50 candidates from the retrieval stage
rerank_candidates = retrieved_doc_ids[:50]
final_docs_for_llm = rerank_documents(query, rerank_candidates, top_n=5)
print("Final documents to be passed to LLM context:", final_docs_for_llm)
# Expected output might be: [('doc1', 0.98), ('doc3', 0.75), ('doc2', 0.21), ...]
# Notice how the most direct answer ('doc1') is pushed to the top.Now, instead of a context cluttered with 10-20 potentially relevant documents, we have a highly-focused, precision-ordered list of 3-5 documents. The most relevant document is almost guaranteed to be at the top, mitigating the "lost in the middle" problem and providing the LLM with a clean, signal-rich context.
End-to-End Production Pipeline and API
Let's assemble this into a cohesive service. We can use a simple web framework like FastAPI to expose this advanced RAG pipeline as an API endpoint.
from fastapi import FastAPI
from pydantic import BaseModel
import openai
# Assume all previous functions (search_bm25, search_vector, hybrid_search, rerank_documents, etc.) are defined here.
# Also assume openai.api_key is configured.
app = FastAPI()
class RAGQuery(BaseModel):
    query: str
    user_id: str # For logging, personalization, etc.
class RAGResponse(BaseModel):
    answer: str
    retrieved_doc_ids: list[str]
def build_llm_prompt(query: str, context_docs: list[str]) -> str:
    """Builds the final prompt for the LLM."""
    context = "\n\n---\n\n".join(context_docs)
    prompt = f"""
    You are a helpful AI assistant. Answer the user's question based on the following context.
    If the context does not contain the answer, state that you don't have enough information.
    Context:
    {context}
    Question: {query}
    Answer:
    """
    return prompt
@app.post("/query", response_model=RAGResponse)
async def execute_rag_pipeline(rag_query: RAGQuery):
    # STAGE 1: Hybrid Retrieval
    # We retrieve a larger set of candidates (e.g., 50-100)
    retrieved_results = hybrid_search(rag_query.query, top_k_retrieval=75)
    retrieved_doc_ids = [doc_id for doc_id, score in retrieved_results]
    # STAGE 2: Cross-Encoder Re-ranking
    # We re-rank only the top candidates to keep latency down
    rerank_candidates = retrieved_doc_ids[:25]
    # We only need the top 3-5 documents for the final context
    final_reranked_docs = rerank_documents(rag_query.query, rerank_candidates, top_n=3)
    final_doc_ids = [doc_id for doc_id, score in final_reranked_docs]
    # STAGE 3: Context Generation and LLM Invocation
    context_contents = [get_document_content(doc_id) for doc_id in final_doc_ids]
    
    prompt = build_llm_prompt(rag_query.query, context_contents)
    # In a real app, you'd use async OpenAI client
    response = openai.ChatCompletion.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are an expert Q&A system."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.1
    )
    answer = response.choices[0].message.content
    return RAGResponse(answer=answer, retrieved_doc_ids=final_doc_ids)
This API endpoint encapsulates the entire sophisticated pipeline. It clearly shows the funnel: 75 candidates -> 25 to re-rank -> 3 for LLM. These numbers are hyperparameters you must tune based on your specific latency requirements and document characteristics.
Advanced Performance Considerations and Edge Cases
Deploying this system requires more than just the Python code. Here are critical production considerations:
* GPU Inference: Deploy the re-ranking model on a GPU-enabled inference server like NVIDIA Triton Inference Server. This allows for batching requests and hardware acceleration.
* Model Quantization/Pruning: Use tools like ONNX Runtime or TensorRT to convert the PyTorch model to a quantized, optimized format. This can reduce latency by 2-4x with a minimal drop in accuracy.
    *   Smaller Models: Experiment with smaller cross-encoder models. A MiniLM might be sufficient and significantly faster than a BERT-base model.
* Retrieval Results: Cache the output of the hybrid search for common queries.
* Re-ranked Results: Cache the final, re-ranked list of document IDs.
* LLM Responses: Cache the final generated answer for identical queries.
The cache key should be a hash of the query and potentially other parameters (e.g., user permissions).
asyncio.gather to reduce I/O wait time.* Mean Reciprocal Rank (MRR): Measures the rank of the first correct answer.
* Normalized Discounted Cumulative Gain (nDCG@k): Evaluates the quality of the ranking for the top k documents.
* End-to-End Latency (p95, p99): The most important user-facing metric.
A Comparative Benchmark (Illustrative)
| Architecture | MRR@10 (Retrieval Quality) | p95 End-to-End Latency | LLM Cost (Relative) | 
|---|---|---|---|
| Naive Vector Search (10 docs) | 0.65 | 1.8s | 1.5x | 
| Hybrid Search (10 docs) | 0.78 | 2.0s | 1.5x | 
| Hybrid + Re-ranker (3 docs) | 0.92 | 1.5s | 1.0x (baseline) | 
This illustrative data shows our advanced pipeline is not just more accurate (MRR jumps from 0.65 to 0.92) but can also be faster and cheaper. The latency reduction comes from the smaller, cleaner context passed to the LLM, which more than compensates for the added re-ranking step (if optimized correctly). The cost reduction is a direct result of using fewer tokens in the LLM prompt.
Conclusion
Moving a RAG system from a demo to a robust, production service requires a paradigm shift from single-shot retrieval to a multi-stage refinement pipeline. By combining the keyword-matching strength of sparse search with the semantic power of dense search, we create a high-recall initial candidate set. Then, by applying a computationally intensive but highly accurate cross-encoder re-ranker, we distill this set into a small, precision-focused context for the LLM.
This architecture directly addresses the core weaknesses of naive RAG, leading to more relevant answers, lower latency, and reduced operational costs. It is a complex system with more moving parts, but for applications where accuracy and performance are non-negotiable, this level of engineering rigor is not just beneficial—it's essential.