Advanced RAG: HyDE & Re-ranking for High-Fidelity Q&A Systems

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Illusion of Simplicity in Naive RAG

For any senior engineer who has moved past introductory tutorials, the promise of Retrieval-Augmented Generation (RAG) quickly meets a harsh production reality. The standard pattern—embedding a user query and performing a vector similarity search to find context for an LLM—is powerful but brittle. It fundamentally fails when the user's query lacks semantic overlap with the source documents, even if the conceptual answer is present.

Consider a query like: "What were the key architectural decisions that led to our platform's scalability issues last quarter?"

A naive vector search for this query will likely fail. Your knowledge base contains architectural decision records (ADRs), incident post-mortems, and infrastructure diagrams. None of these documents are likely to contain the exact phrase "scalability issues last quarter." The query is a synthesis question, while the documents contain the evidence. This semantic gap is where naive RAG breaks down.

This article deconstructs this problem and implements a robust, multi-stage pipeline that directly addresses it. We will architect a system that first transforms the query into a more potent search vector using Hypothetical Document Embeddings (HyDE) and then ruthlessly prunes and re-orders the retrieved candidates with a cross-encoder re-ranking model. This isn't a theoretical exercise; it's a production-ready pattern for building high-fidelity Q&A systems that deliver accurate, contextually-aware answers.

Setting Up Our Environment

Before we begin, let's establish a common ground with a reproducible environment. We'll use sentence-transformers for embeddings, faiss-cpu for a lightweight local vector store, transformers for our re-ranker, and an LLM provider client (we'll use OpenAI's API as an example, but the concepts are provider-agnostic).

bash
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install necessary libraries
pip install sentence-transformers faiss-cpu openai transformers torch

Let's also prepare a sample document corpus that exemplifies the challenge. We'll create a few snippets representing internal technical documentation.

python
# setup.py
import faiss
from sentence_transformers import SentenceTransformer

# 1. Our Document Corpus (The Knowledge Base)
documents = [
    "ADR-001: We will use a microservices architecture. Services will communicate via gRPC. This decision was made for team autonomy and independent scaling.",
    "Post-mortem Q4 2023: A cascading failure in the 'user-auth' service led to a 3-hour outage. The root cause was identified as a database connection pool exhaustion under high load.",
    "System Design: The 'inventory-service' uses a PostgreSQL database with read replicas to handle query load. All writes go to the primary instance.",
    "ADR-005: To handle increased traffic, we are introducing a Kafka-based event queue between the 'order-service' and the 'fulfillment-service'. This decouples the services and provides better resilience.",
    "Performance Report Q4 2023: Latency in the 'order-service' increased by 200% during peak hours. Investigation points to synchronous calls to the 'inventory-service' creating a bottleneck."
]

# 2. Embedding Model (Bi-Encoder)
# A bi-encoder creates independent embeddings for query and document.
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# 3. Create Embeddings and Vector Store
doc_embeddings = embedding_model.encode(documents)
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)

print("Setup complete. Vector store is ready.")

def search_naive_rag(query: str, k: int = 2):
    """Performs a standard vector search."""
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [documents[i] for i in indices[0]]

# Let's test the naive approach with our challenging query
query = "What were the key architectural decisions that led to our platform's scalability issues last quarter?"

retrieved_docs = search_naive_rag(query)

print(f"--- Naive RAG Retrieval for query: '{query}' ---")
for doc in retrieved_docs:
    print(f"- {doc}")

Running this will likely yield suboptimal results. The retrieved documents might be ADR-001 and ADR-005, which mention architecture but miss the critical performance report and post-mortem that actually explain the problem. The vector search latches onto "architectural decisions" but fails to connect it to "scalability issues."

Output of Naive RAG:

text
--- Naive RAG Retrieval for query: 'What were the key architectural decisions that led to our platform's scalability issues last quarter?' ---
- ADR-001: We will use a microservices architecture. Services will communicate via gRPC. This decision was made for team autonomy and independent scaling.
- ADR-005: To handle increased traffic, we are introducing a Kafka-based event queue between the 'order-service' and the 'fulfillment-service'. This decouples the services and provides better resilience.

This context is insufficient. An LLM given this would hallucinate an answer or state it cannot find the information. This is our baseline failure case.

Stage 1: Hypothetical Document Embeddings (HyDE)

HyDE, introduced in the paper "Precise Zero-Shot Dense Retrieval without Relevance Labels", proposes a clever solution: instead of searching with the query's embedding, we first use an LLM to generate a hypothetical document—a fictional but plausible answer to the query. We then embed this hypothetical document and use its embedding for the vector search.

The Core Insight: The embedding of a well-formed answer is more likely to be located in the vector space near the embeddings of actual documents that contain the real answer. It transforms the search from query -> document to answer -> document, bridging the semantic gap.

Implementation of the HyDE Generator

First, we need a component that takes a query and uses an LLM to generate the hypothetical document. This is a simple instruction-following task.

python
# hyde_module.py
import os
from openai import OpenAI

# Ensure you have your OPENAI_API_KEY set as an environment variable
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

HYDE_PROMPT_TEMPLATE = """
Please write a short, hypothetical passage that provides a plausible answer to the following user query. 
Do not use any prior knowledge, and focus on creating a document that sounds like a definitive answer found in a technical knowledge base.

USER QUERY: {query}

HYPOTHETICAL DOCUMENT:"""

def generate_hypothetical_document(query: str, model: str = "gpt-3.5-turbo") -> str:
    """Generates a hypothetical document using an LLM."""
    prompt = HYDE_PROMPT_TEMPLATE.format(query=query)
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that generates plausible documents."}, 
            {"role": "user", "content": prompt}
        ],
        temperature=0.7, # Allow for some creativity but not too much
        max_tokens=150
    )
    return response.choices[0].message.content.strip()

# Example usage with our query
query = "What were the key architectural decisions that led to our platform's scalability issues last quarter?"
hypothetical_doc = generate_hypothetical_document(query)

print(f"--- Generated Hypothetical Document for query: '{query}' ---")
print(hypothetical_doc)

For our query, the LLM might generate something like this:

Hypothetical Document Output:

text
The primary architectural decision impacting last quarter's scalability was the synchronous communication pattern between the order-service and the inventory-service. This created a severe bottleneck during peak traffic, as every order request had to wait for a direct response from inventory. A post-mortem revealed this tight coupling led to cascading failures and increased latency, which was later addressed by introducing an asynchronous event queue.

This generated text is a goldmine. It contains keywords and concepts like "synchronous communication," "bottleneck," "peak traffic," "post-mortem," and "event queue." These are far more likely to have strong vector similarity to our actual documents (Performance Report Q4 2023 and Post-mortem Q4 2023) than the original query.

Integrating HyDE into the Retrieval Pipeline

Now, let's modify our search function to use this new approach.

python
# In setup.py, add the HyDE search function

# (Assuming setup code from before is present: documents, embedding_model, index)
# (Also assuming generate_hypothetical_document is imported or in the same file)

def search_hyde(query: str, k: int = 4):
    """Performs a vector search using a HyDE document."""
    print("\n1. Generating hypothetical document...")
    hypothetical_doc = generate_hypothetical_document(query)
    print(f"   -> Generated: '{hypothetical_doc[:100]}...' ")

    print("2. Embedding the hypothetical document...")
    hyde_embedding = embedding_model.encode([hypothetical_doc])

    print("3. Performing vector search...")
    distances, indices = index.search(hyde_embedding, k)
    return [documents[i] for i in indices[0]]

# Let's test the HyDE approach
retrieved_docs_hyde = search_hyde(query)

print(f"\n--- HyDE Retrieval for query: '{query}' ---")
for doc in retrieved_docs_hyde:
    print(f"- {doc}")

Expected HyDE Retrieval Output:

text
--- HyDE Retrieval for query: 'What were the key architectural decisions that led to our platform's scalability issues last quarter?' ---
- Performance Report Q4 2023: Latency in the 'order-service' increased by 200% during peak hours. Investigation points to synchronous calls to the 'inventory-service' creating a bottleneck.
- Post-mortem Q4 2023: A cascading failure in the 'user-auth' service led to a 3-hour outage. The root cause was identified as a database connection pool exhaustion under high load.
- ADR-005: To handle increased traffic, we are introducing a Kafka-based event queue between the 'order-service' and the 'fulfillment-service'. This decouples the services and provides better resilience.
- ADR-001: We will use a microservices architecture. Services will communicate via gRPC. This decision was made for team autonomy and independent scaling.

The results are dramatically better. We've successfully retrieved the performance report and the post-mortem, which are the most relevant documents. However, we've also pulled in ADR-001, which is only tangentially related. The retrieval is higher in recall but still lacks precision. This is where our next stage comes in.

HyDE Edge Cases and Considerations

  • Hallucination Risk: What if the HyDE-generated document is completely wrong? This is a valid concern. However, the vector search acts as a grounding mechanism. A wildly inaccurate hypothetical document will likely have no close neighbors in the vector space, leading to poor retrieval, but it's unlikely to retrieve documents that are confidently wrong. The risk is retrieving irrelevant documents, not actively misleading ones.
  • Latency Cost: HyDE introduces an additional LLM call at the start of the process, adding latency (typically 500ms - 2s). This is a critical trade-off. For user-facing real-time applications, this might be unacceptable. For offline analysis or less time-sensitive bots, it's a small price for a large accuracy gain.
  • Prompt Engineering: The quality of the HYDE_PROMPT_TEMPLATE matters. It should guide the model to produce text that mimics the style and content of your document corpus.
  • Stage 2: The Re-ranking Precision Engine

    Our HyDE-powered retrieval is good, but not perfect. We retrieved a wider net of potentially relevant documents (k=4 or more), but now we need to identify the absolute best ones to pass to the final LLM. A simple vector similarity score (like L2 distance or cosine similarity) is not nuanced enough for this final step.

    This is the role of a re-ranker. We'll use a cross-encoder model. Here's the critical difference:

  • Bi-Encoders (for retrieval): Create separate embeddings for the query and each document. The comparison is fast (a dot product), making it suitable for searching over millions of documents. This is what SentenceTransformer does for initial retrieval.
  • Cross-Encoders (for re-ranking): Take the query and a single document together as a single input [CLS] query [SEP] document [SEP]. The model then performs full self-attention across both, allowing for a much deeper, token-level interaction. It outputs a single score (e.g., 0 to 1) indicating relevance. This is far more accurate but computationally expensive, making it unsuitable for initial retrieval but perfect for re-ranking a small set of candidates.
  • Implementation of a Cross-Encoder Re-ranker

    We will use a pre-trained model from Hugging Face specifically designed for this task, like cross-encoder/ms-marco-MiniLM-L-6-v2.

    python
    # reranker_module.py
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    
    class ReRanker:
        def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
            self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
            self.model.to(self.device)
            print(f"ReRanker initialized on {self.device}")
    
        def compute_scores(self, query: str, documents: list[str]) -> list[float]:
            """Computes relevance scores for a list of documents against a query."""
            features = self.tokenizer([query] * len(documents), documents, padding=True, truncation=True, return_tensors="pt").to(self.device)
            
            self.model.eval()
            with torch.no_grad():
                scores = self.model(**features).logits
                # Apply sigmoid to get a score between 0 and 1
                normalized_scores = torch.sigmoid(scores).squeeze().cpu().numpy().tolist()
            
            # Handle case where only one document is passed
            if isinstance(normalized_scores, float):
                return [normalized_scores]
            return normalized_scores
    
        def rerank(self, query: str, documents: list[str], top_n: int = 2) -> list[str]:
            """Reranks documents and returns the top_n most relevant ones."""
            if not documents:
                return []
            
            scores = self.compute_scores(query, documents)
            
            # Pair documents with their scores and sort
            scored_docs = list(zip(documents, scores))
            scored_docs.sort(key=lambda x: x[1], reverse=True)
            
            # Return the top_n documents
            return [doc for doc, score in scored_docs[:top_n]]
    
    # Example usage
    reranker = ReRanker()
    
    query = "What were the key architectural decisions that led to our platform's scalability issues last quarter?"
    # Use the documents we got from the HyDE step
    candidate_docs = retrieved_docs_hyde 
    
    reranked_docs = reranker.rerank(query, candidate_docs, top_n=2)
    
    print(f"\n--- Re-ranked Documents (Top 2) ---")
    for doc in reranked_docs:
        print(f"- {doc}")

    Expected Re-ranked Output:

    text
    --- Re-ranked Documents (Top 2) ---
    - Performance Report Q4 2023: Latency in the 'order-service' increased by 200% during peak hours. Investigation points to synchronous calls to the 'inventory-service' creating a bottleneck.
    - ADR-005: To handle increased traffic, we are introducing a Kafka-based event queue between the 'order-service' and the 'fulfillment-service'. This decouples the services and provides better resilience.

    Notice the improvement. The cross-encoder correctly identified that the Performance Report is the most relevant document because it directly addresses the "latency" and "bottleneck" concepts implicitly linked to "scalability issues." It also correctly prioritized ADR-005 (the solution) over the generic ADR-001 and the unrelated Post-mortem about user-auth.

    We now have a highly-relevant, precision-focused context to feed our final LLM prompt. The chance of a correct, well-supported answer has increased dramatically.

    The Complete Production Pipeline

    Let's assemble these components into a single, cohesive class that represents our advanced RAG pipeline. This encapsulates the logic and makes it easy to manage and deploy.

    python
    # full_pipeline.py
    import os
    import faiss
    from sentence_transformers import SentenceTransformer
    from openai import OpenAI
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    import time
    
    # --- Component Definitions (condensed from above) ---
    
    class HyDEGenerator:
        def __init__(self, model: str = "gpt-3.5-turbo"):
            self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
            self.model = model
            self.prompt_template = """Please write a short, hypothetical passage that provides a plausible answer to the following user query. Do not use any prior knowledge, and focus on creating a document that sounds like a definitive answer found in a technical knowledge base. USER QUERY: {query} HYPOTHETICAL DOCUMENT:"""
    
        def generate(self, query: str) -> str:
            prompt = self.prompt_template.format(query=query)
            response = self.client.chat.completions.create(
                model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.0
            )
            return response.choices[0].message.content.strip()
    
    class VectorStore:
        def __init__(self, documents: list[str], model_name: str = 'all-MiniLM-L6-v2'):
            self.documents = documents
            self.embedding_model = SentenceTransformer(model_name)
            doc_embeddings = self.embedding_model.encode(self.documents)
            self.index = faiss.IndexFlatL2(doc_embeddings.shape[1])
            self.index.add(doc_embeddings)
    
        def retrieve(self, query_embedding, k: int) -> list[str]:
            _, indices = self.index.search(query_embedding, k)
            return [self.documents[i] for i in indices[0]]
    
    class ReRanker:
        def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
            self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
            self.model.to(self.device)
    
        def rerank(self, query: str, documents: list[str], top_n: int) -> list[str]:
            if not documents: return []
            features = self.tokenizer([query] * len(documents), documents, padding=True, truncation=True, return_tensors="pt").to(self.device)
            with torch.no_grad():
                scores = self.model(**features).logits.squeeze()
            sorted_indices = torch.argsort(scores, descending=True)
            return [documents[i] for i in sorted_indices[:top_n]]
    
    class FinalAnswerGenerator:
        def __init__(self, model: str = "gpt-4-turbo-preview"):
            self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
            self.model = model
            self.prompt_template = """Based on the following context, please provide a direct and concise answer to the user's query. Cite the source documents by their content. CONTEXT: {context} USER QUERY: {query} ANSWER:"""
    
        def generate(self, query: str, context: str) -> str:
            prompt = self.prompt_template.format(context=context, query=query)
            response = self.client.chat.completions.create(
                model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.0
            )
            return response.choices[0].message.content.strip()
    
    # --- The Master Pipeline Class ---
    
    class AdvancedRAGPipeline:
        def __init__(self, documents: list[str]):
            self.vector_store = VectorStore(documents)
            self.hyde_generator = HyDEGenerator()
            self.reranker = ReRanker()
            self.answer_generator = FinalAnswerGenerator()
    
        def run(self, query: str, retrieve_k: int = 5, rerank_top_n: int = 2):
            print(f"--- Executing Advanced RAG for Query: '{query}' ---")
            
            # 1. HyDE Stage
            start_time = time.time()
            hypothetical_doc = self.hyde_generator.generate(query)
            hyde_latency = time.time() - start_time
            print(f"[1. HyDE] Generated hypothetical doc in {hyde_latency:.2f}s")
    
            # 2. Retrieval Stage
            start_time = time.time()
            hyde_embedding = self.vector_store.embedding_model.encode([hypothetical_doc])
            candidate_docs = self.vector_store.retrieve(hyde_embedding, k=retrieve_k)
            retrieval_latency = time.time() - start_time
            print(f"[2. Retrieval] Retrieved {len(candidate_docs)} candidates in {retrieval_latency:.2f}s")
    
            # 3. Re-ranking Stage
            start_time = time.time()
            reranked_docs = self.reranker.rerank(query, candidate_docs, top_n=rerank_top_n)
            rerank_latency = time.time() - start_time
            print(f"[3. Re-ranking] Re-ranked to top {len(reranked_docs)} in {rerank_latency:.2f}s")
    
            # 4. Final Answer Generation Stage
            context = "\n\n".join(reranked_docs)
            start_time = time.time()
            final_answer = self.answer_generator.generate(query, context)
            generation_latency = time.time() - start_time
            print(f"[4. Generation] Generated final answer in {generation_latency:.2f}s")
    
            total_latency = hyde_latency + retrieval_latency + rerank_latency + generation_latency
            print(f"--- Total Pipeline Latency: {total_latency:.2f}s ---")
            
            return {
                "final_answer": final_answer,
                "reranked_context": reranked_docs,
                "candidate_docs": candidate_docs
            }
    
    # --- Execution ---
    if __name__ == '__main__':
        # Use the same documents from setup.py
        documents = [
            "ADR-001: We will use a microservices architecture. Services will communicate via gRPC. This decision was made for team autonomy and independent scaling.",
            "Post-mortem Q4 2023: A cascading failure in the 'user-auth' service led to a 3-hour outage. The root cause was identified as a database connection pool exhaustion under high load.",
            "System Design: The 'inventory-service' uses a PostgreSQL database with read replicas to handle query load. All writes go to the primary instance.",
            "ADR-005: To handle increased traffic, we are introducing a Kafka-based event queue between the 'order-service' and the 'fulfillment-service'. This decouples the services and provides better resilience.",
            "Performance Report Q4 2023: Latency in the 'order-service' increased by 200% during peak hours. Investigation points to synchronous calls to the 'inventory-service' creating a bottleneck."
        ]
    
        pipeline = AdvancedRAGPipeline(documents)
        query = "What were the key architectural decisions that led to our platform's scalability issues last quarter?"
        result = pipeline.run(query)
    
        print("\n--- FINAL ANSWER ---")
        print(result['final_answer'])

    Performance and Latency Breakdown

    Running this pipeline provides a clear view of the performance trade-offs:

  • HyDE Latency: ~0.5-1.5s (dependent on LLM provider and model).
  • Retrieval Latency: < 50ms (FAISS is extremely fast).
  • Re-ranking Latency: ~100-300ms (for ~5 candidates on a CPU; faster on GPU). The latency scales linearly with the number of candidates.
  • Generation Latency: ~1.0-3.0s (dependent on LLM provider and answer length).
  • Total Latency: Typically in the 2-5 second range. This is acceptable for many applications but highlights the need for optimization. The key tuning parameters are retrieve_k and rerank_top_n.

  • Increasing retrieve_k: Improves recall (higher chance of finding the right document) but increases the latency of the re-ranking step. A good starting point is between 10 and 20.
  • Increasing rerank_top_n: Provides more context to the final LLM, which can be good for complex questions, but increases the size of the final prompt and can introduce noise if lower-ranked documents are irrelevant. top_n=3 is often a sweet spot.
  • Conclusion: From Retrieval to a Reasoning Pipeline

    By composing HyDE and re-ranking, we've transformed our RAG system from a simple lookup tool into a multi-stage reasoning pipeline. This architecture acknowledges a critical truth: retrieval is not a solved problem, and vector similarity is merely a powerful first-pass filter.

    For senior engineers building systems that require high-fidelity, trustworthy answers from large document sets, this pattern is not an academic curiosity—it is a necessary evolution. It trades a manageable increase in latency and complexity for a substantial and often decisive improvement in answer quality.

    The next steps in this journey involve even more sophisticated techniques, such as query decomposition for multi-hop questions, graph-based RAG for structured data, and adaptive retrieval strategies that choose which pipeline to run based on query complexity. But the foundational principle remains the same: treat retrieval as a dynamic, multi-step process of hypothesis, filtering, and refinement.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles