Advanced RAG: HyDE & Self-Correction for Production-Ready LLMs

23 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Silent Failures of Naive RAG

Every engineer building with Large Language Models (LLMs) has implemented a basic Retrieval-Augmented Generation (RAG) pipeline. The pattern is deceptively simple: take a user query, embed it, perform a vector similarity search against a document store, prepend the retrieved chunks as context, and ask the LLM to generate an answer. For a significant percentage of queries, this works remarkably well.

The problem lies in the long tail of complex, nuanced, or abstract queries where this simple approach silently fails. The generated output might look plausible, but it's often superficial, factually inconsistent with the source documents, or based on irrelevant context. The core issue is a fundamental semantic mismatch: the embedding of a user's question may not reside in the same vector space region as the embedding of the document chunk containing the answer.

Consider a knowledge base of internal software engineering design documents. A junior engineer might ask a direct question:

"What is the timeout for the billing-service API?"

An embedding of this query will likely have high similarity to a document chunk that contains the words "timeout," "billing-service," and "API." Naive RAG excels here.

Now, consider a query from a principal engineer:

"How do we ensure idempotent transaction processing during payment gateway failures?"

This query is abstract. The document containing the solution might not even use the word "idempotent." It might describe a pattern using idempotency keys, a persistent outbox, and a state machine. A naive vector search on the query's embedding will likely retrieve high-level architectural overviews or documents that mention "failures," but miss the specific, detailed implementation guide. The LLM, given this suboptimal context, will generate a generic, unhelpful answer about idempotency.

This is the crux of the problem we're solving. Production-grade RAG isn't about a single API call; it's about building a multi-step reasoning pipeline that actively works to bridge the semantic gap and validate its own output.

In this post, we will dissect and implement two powerful patterns to elevate your RAG system from a prototype to a production-ready asset:

  • Hypothetical Document Embeddings (HyDE): A technique to align the query and document embeddings by first generating a hypothetical answer and using its embedding for the search.
  • Self-Correction Loops: A multi-agent pattern where the system generates an initial answer, critiques it against the source documents, and refines it if factual inconsistencies are found.
  • Let's start by building our baseline—a naive RAG system—and demonstrating its failure.

    Setting Up the Environment

    We'll use Python with a few key libraries. Ensure you have them installed:

    bash
    pip install langchain openai faiss-cpu sentence-transformers python-dotenv

    Create a .env file for your OpenAI API key:

    text
    OPENAI_API_KEY="your-api-key-here"

    Our setup will use a local FAISS vector store for simplicity, but these patterns are agnostic to the vector database you choose (Pinecone, Weaviate, etc.).

    python
    # setup.py
    import os
    from dotenv import load_dotenv
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import TextLoader
    from langchain_community.vectorstores import FAISS
    from langchain_openai import OpenAIEmbeddings, ChatOpenAI
    
    load_dotenv()
    
    # Initialize LLMs and Embeddings
    llm = ChatOpenAI(model_name="gpt-4-turbo", temperature=0)
    embeddings = OpenAIEmbeddings()
    
    # Create a sample knowledge base
    KB_CONTENT = """
    # Design Doc: Idempotency in Payment Service
    
    **Author:** Alex Chen
    **Date:** 2023-05-12
    
    ## 1. Problem Statement
    
    Network failures between our service and the external payment gateway (Stripe) can lead to duplicate transaction charges. If we send a charge request, don't receive a response due to a timeout, and then retry the same request, the customer could be charged twice.
    
    ## 2. Proposed Solution: Idempotency Keys
    
    To prevent duplicate processing, we will implement idempotency keys for all mutating API calls to the payment gateway.
    
    ### 2.1. Key Generation
    
    - An `Idempotency-Key` header, containing a unique UUIDv4 value, will be generated by the client for each distinct transaction attempt.
    - This key will be stored in a dedicated `idempotency_keys` table in our PostgreSQL database with a TTL of 24 hours.
    
    ### 2.2. Server-Side Logic
    
    1.  When a request with an `Idempotency-Key` is received, the server first checks if this key exists in our database.
    2.  **New Key:** If the key is not found, it's saved to the database, and the request is processed. The resulting status code and response body are stored alongside the key.
    3.  **Existing Key:** If the key is found, the server immediately stops processing and returns the stored response (status code and body) from the initial request.
    
    This ensures that even if a client retries a request multiple times, the core transaction logic is executed only once.
    
    ## 3. Edge Cases
    
    - **Concurrent Requests:** A unique constraint on the `idempotency_key` column in the database will prevent race conditions where two identical requests arrive simultaneously.
    - **Key Expiration:** The 24-hour TTL is a trade-off between resource usage and the likelihood of a legitimate retry after a prolonged period.
    """
    
    with open("kb.md", "w") as f:
        f.write(KB_CONTENT)
    
    # Load and process the document
    def get_retriever():
        loader = TextLoader("./kb.md")
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        docs = text_splitter.split_documents(documents)
        
        print(f"Created {len(docs)} document chunks.")
    
        # Create and return the retriever
        vectorstore = FAISS.from_documents(docs, embeddings)
        return vectorstore.as_retriever(search_kwargs={"k": 2})
    
    if __name__ == "__main__":
        retriever = get_retriever()
        print("Retriever created successfully.")

    The Naive RAG Implementation and Its Failure

    Now, let's implement a standard RAG chain and test it with our complex query.

    python
    # naive_rag.py
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    from langchain_core.output_parsers import StrOutputParser
    from setup import get_retriever, llm
    
    retriever = get_retriever()
    
    template = """Answer the question based only on the following context:
    {context}
    
    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)
    
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    if __name__ == "__main__":
        query = "How do we ensure idempotent transaction processing during payment gateway failures?"
        
        # Let's inspect the retrieved documents first
        retrieved_docs = retriever.invoke(query)
        print("--- RETRIEVED DOCS ---")
        for i, doc in enumerate(retrieved_docs):
            print(f"DOC {i+1}:\n{doc.page_content}\n")
    
        print("--- NAIVE RAG RESPONSE ---")
        response = rag_chain.invoke(query)
        print(response)

    Running this will produce output similar to the following:

    text
    Created 4 document chunks.
    --- RETRIEVED DOCS ---
    DOC 1:
    # Design Doc: Idempotency in Payment Service
    
    **Author:** Alex Chen
    **Date:** 2023-05-12
    
    ## 1. Problem Statement
    
    Network failures between our service and the external payment gateway (Stripe) can lead to duplicate transaction charges. If we send a charge request, don't receive a response due to a timeout, and then retry the same request, the customer could be charged twice.
    
    ## 2. Proposed Solution: Idempotency Keys
    
    To prevent duplicate processing, we will implement idempotency keys for all mutating API calls to the payment gateway.
    
    DOC 2:
    - **Concurrent Requests:** A unique constraint on the `idempotency_key` column in the database will prevent race conditions where two identical requests arrive simultaneously.
    - **Key Expiration:** The 24-hour TTL is a trade-off between resource usage and the likelihood of a legitimate retry after a prolonged period.
    
    --- NAIVE RAG RESPONSE ---
    Based on the context provided, idempotent transaction processing during payment gateway failures is ensured by implementing idempotency keys. This involves the client generating a unique `Idempotency-Key` for each transaction attempt. On the server side, this key is checked against a database. If the key is new, the request is processed and the result is stored. If the key already exists, the stored response from the initial request is returned, preventing the transaction logic from being executed a second time. This handles retries after network failures and timeouts. Edge cases like concurrent requests are managed with a unique database constraint on the key.

    In this controlled example, the retrieval is actually quite good because our document is small and focused. However, in a real-world knowledge base with thousands of documents, the query's abstract nature would cause it to match with less relevant documents. To simulate this, imagine the query was slightly different: "What is our strategy for financial data consistency?" The term "idempotency" might not even be in the top-k results. The failure mode is subtle but critical.

    Pattern 1: Hypothetical Document Embeddings (HyDE)

    HyDE, introduced in the paper "Precise Zero-Shot Dense Retrieval without Relevance Labels" by Gao et al., flips the retrieval process on its head. Instead of using the query's embedding directly, it follows a two-step process:

  • Generate: Pass the user's query to an LLM and ask it to generate a hypothetical document or answer that it believes would perfectly answer the query. This is done without any context from our knowledge base (zero-shot).
  • Embed & Search: Take this generated hypothetical document, create an embedding for it, and then use this new embedding to perform the vector search against your actual document store.
  • Why does this work? The generated document, while factually incorrect, is rich in the vocabulary, structure, and semantic concepts that are likely to be present in the actual answer. Its embedding is therefore much closer in vector space to the true source documents than the original, often sparse or abstract, query embedding.

    Implementing HyDE

    Let's build a chain that implements this logic. We'll need a prompt to instruct the LLM to generate the hypothetical document.

    python
    # hyde_rag.py
    import os
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    from langchain_core.output_parsers import StrOutputParser
    from setup import get_retriever, llm, embeddings
    from naive_rag import format_docs # Reuse the doc formatter
    
    retriever = get_retriever()
    
    # 1. HyDE Prompt: Generate a hypothetical document
    hyde_template = """Please write a short, hypothetical document that answers the following user question. 
    Focus on providing a detailed, technical explanation.
    Question: {question}
    Hypothetical Document:"""
    hyde_prompt = ChatPromptTemplate.from_template(hyde_template)
    
    # 2. HyDE Chain: Query -> LLM -> Hypothetical Document
    hyde_chain = hyde_prompt | llm | StrOutputParser()
    
    # 3. Main RAG Prompt (same as before)
    rag_template = """Answer the question based only on the following context:
    {context}
    
    Question: {question}
    """
    rag_prompt = ChatPromptTemplate.from_template(rag_template)
    
    # 4. The Full Chain
    # This is more complex. We need to pass the original question through.
    # We'll use a custom function for the retrieval step.
    
    def hyde_retriever(query: str):
        print("--- GENERATING HYPOTHETICAL DOCUMENT ---")
        hypothetical_document = hyde_chain.invoke({"question": query})
        print(hypothetical_document)
        
        print("\n--- PERFORMING VECTOR SEARCH WITH HYDE EMBEDDING ---")
        return retriever.invoke(hypothetical_document)
    
    # The final chain combines HyDE retrieval with the generation step
    full_hyde_chain = (
        {
            "context": (lambda x: x['question']) | hyde_retriever | format_docs, 
            "question": (lambda x: x['question'])
        }
        | rag_prompt
        | llm
        | StrOutputParser()
    )
    
    if __name__ == "__main__":
        query = "How do we ensure idempotent transaction processing during payment gateway failures?"
        
        response = full_hyde_chain.invoke({"question": query})
        
        print("\n--- HYDE RAG RESPONSE ---")
        print(response)

    When we run this, observe the intermediate output:

    text
    Created 4 document chunks.
    --- GENERATING HYPOTHETICAL DOCUMENT ---
    To ensure idempotent transaction processing during payment gateway failures, a robust system typically employs a combination of client-generated idempotency keys and server-side state management.
    
    First, the client application initiating the transaction generates a unique identifier, often a UUID, known as an `Idempotency-Key`. This key is sent in the header of the API request for the payment.
    
    On the server, when a request is received, the system first inspects for an `Idempotency-Key`. It then performs a lookup in a dedicated idempotency store (e.g., a Redis cache or a database table) for this key. 
    
    - If the key is not found, it signifies a new request. The server saves the key with a 'processing' status, executes the payment transaction with the external gateway, and upon completion, stores the final result (success or failure) associated with the key before sending the response to the client. 
    - If the key is found, the server checks its status. If it's 'completed', the stored result is immediately returned without re-processing the transaction. If it's 'processing', it indicates a potential race condition or a retry of a request that is still in flight, and the server can respond with a specific error code indicating the transaction is in progress.
    
    This mechanism guarantees that even if the client retries a request due to a network timeout, the core financial transaction is only executed once, preventing duplicate charges.
    
    --- PERFORMING VECTOR SEARCH WITH HYDE EMBEDDING ---
    
    --- HYDE RAG RESPONSE ---
    To ensure idempotent transaction processing during payment gateway failures, the system uses idempotency keys. A unique `Idempotency-Key` (UUIDv4) is generated by the client for each transaction attempt and included in the request header. 
    
    The server-side logic is as follows:
    1.  Upon receiving a request, the server checks a database for the provided `Idempotency-Key`.
    2.  If the key is new, it's saved, the transaction is processed, and the response is stored with the key.
    3.  If the key already exists, the server skips processing and immediately returns the previously stored response.
    
    This ensures that a transaction is executed only once, even if the client sends multiple retries. Edge cases like concurrent requests are handled by a unique constraint on the `idempotency_key` column in the database.

    Notice the generated hypothetical document. It's a perfect, textbook explanation of idempotency keys. It's rich with keywords and concepts like UUID, Redis cache, database table, race condition, and processing status. The embedding of this document is a far better search query for our knowledge base than the original question's embedding. It acts as a semantic bridge, dramatically increasing the likelihood of retrieving the most relevant chunks.

    Performance and Cost Considerations for HyDE

  • Latency: HyDE introduces a significant latency penalty—an entire extra LLM call—before the retrieval step even begins. For user-facing applications requiring real-time responses, this can be a deal-breaker. It's best suited for asynchronous tasks or scenarios where accuracy is paramount and users can tolerate a few seconds of delay.
  • Cost: You are doubling the number of LLM calls per query. This can substantially increase operational costs. A common optimization is to use a smaller, faster, and cheaper model (e.g., GPT-3.5-Turbo or a fine-tuned open-source model) for the hypothetical document generation step, while reserving the more powerful model (e.g., GPT-4) for the final answer generation.
  • When to Use HyDE: Apply HyDE selectively. You could implement a router or classifier that analyzes the user's query. If the query is detected as abstract or complex, route it through the HyDE pipeline. If it's a simple, direct question, use the naive RAG path to save on latency and cost.
  • Pattern 2: Self-Correction Loops for Factual Consistency

    Even with perfect context retrieval, LLMs can still hallucinate or misinterpret the provided text. A self-correction or self-refinement loop addresses this by making the generation process an explicit, multi-step dialogue with itself.

    The core idea is to force the LLM to act as its own worst critic.

  • Generate Initial Answer: Use the RAG pipeline (ideally the HyDE-enhanced one) to produce a first-draft answer.
  • Critique: Make a second LLM call. Provide it with the original query, the retrieved source documents, and the generated answer. Use a carefully crafted prompt that asks it to act as a fact-checker. Its task is to critique the answer strictly based on the provided source documents, checking for inconsistencies, omissions, or hallucinations.
  • Decide & Refine: Based on the critique, decide if the answer is sufficient. If the critique finds flaws, make a third LLM call. Provide the original context, the flawed answer, and the critique, and ask it to generate a new, refined answer that addresses the identified issues.
  • This pattern transforms a simple generation call into a robust reasoning agent.

    Implementing a Self-Correcting RAG Agent

    We'll structure this using a graph-based approach, which is a natural fit for these conditional, multi-step chains. While libraries like LangGraph are designed for this, we can implement the logic explicitly in Python to understand the mechanics.

    python
    # self_correcting_rag.py
    import json
    from typing import TypedDict, List
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnableConfig
    from langchain_core.output_parsers.json import JsonOutputParser
    from setup import llm
    from hyde_rag import hyde_retriever, format_docs
    
    # Define the state for our graph
    class GraphState(TypedDict):
        question: str
        documents: List[str]
        generation: str
        critique: dict
    
    # 1. Initial Generation Node
    def generate(state: GraphState):
        print("--- GENERATING INITIAL ANSWER ---")
        question = state["question"]
        documents = state["documents"]
        
        prompt = ChatPromptTemplate.from_template(
            """Answer the user's question based ONLY on the following context.
            If the context doesn't contain the answer, state that you don't have enough information.
            
            Context:
            {context}
            
            Question:
            {question}
            """
        )
        
        chain = prompt | llm | StrOutputParser()
        generation = chain.invoke({"context": format_docs(documents), "question": question})
        
        print(f"Initial Generation: {generation}")
        return {"generation": generation, "documents": documents, "question": question}
    
    # 2. Critique Node
    def critique(state: GraphState):
        print("--- CRITIQUING THE GENERATION ---")
        question = state["question"]
        documents = state["documents"]
        generation = state["generation"]
        
        parser = JsonOutputParser(pydantic_object=Critique)
        
        prompt = ChatPromptTemplate.from_template(
            """You are a meticulous fact-checker. Your task is to evaluate a generated answer against a set of source documents.
            Analyze the generated answer for any claims that are NOT supported by the provided context.
            
            Respond in a JSON format with two keys:
            'is_supported': boolean, true if ALL claims in the generation are supported by the context, false otherwise.
            'feedback': string, a detailed explanation of any unsupported claims. If all claims are supported, say 'All claims are supported.'
            
            Source Documents (Context):
            {context}
            
            Generated Answer:
            {generation}
            
            Question:
            {question}
            
            {format_instructions}
            """
        )
        
        chain = prompt | llm | parser
        critique_output = chain.invoke({
            "context": format_docs(documents), 
            "generation": generation, 
            "question": question, 
            "format_instructions": parser.get_format_instructions()
        })
        
        print(f"Critique Output: {critique_output}")
        return {"critique": critique_output, **state}
    
    # 3. Refinement Node
    def refine(state: GraphState):
        print("--- REFINING THE GENERATION ---")
        question = state["question"]
        documents = state["documents"]
        generation = state["generation"]
        critique = state["critique"]
        
        prompt = ChatPromptTemplate.from_template(
            """The user's question was: {question}
            
            The original answer you generated was:
            {generation}
            
            However, a fact-checker provided the following feedback:
            {feedback}
            
            Please generate a new, improved answer that addresses the feedback, using only the provided source documents.
            
            Source Documents (Context):
            {context}
            """
        )
        
        chain = prompt | llm | StrOutputParser()
        new_generation = chain.invoke({
            "question": question,
            "generation": generation,
            "feedback": critique['feedback'],
            "context": format_docs(documents)
        })
        
        return {"generation": new_generation, **state}
    
    # Pydantic model for structured output
    from pydantic import BaseModel, Field
    class Critique(BaseModel):
        is_supported: bool = Field(description="Whether ALL claims in the generation are supported by the context.")
        feedback: str = Field(description="Detailed feedback on unsupported claims.")
    
    # The main agent loop
    def run_self_correcting_rag(query: str, max_iterations: int = 2):
        state = {"question": query}
        
        # 1. Retrieve documents first (using HyDE)
        documents = hyde_retriever(query)
        state["documents"] = documents
        
        for i in range(max_iterations):
            print(f"\n--- ITERATION {i+1} ---")
            # 2. Generate
            state = generate(state)
            
            # 3. Critique
            state = critique(state)
            
            # 4. Decision Gate
            if state['critique']['is_supported']:
                print("--- CRITIQUE PASSED, FINAL ANSWER ---")
                return state['generation']
            else:
                print("--- CRITIQUE FAILED, REFINING ---")
                state = refine(state)
        
        print("--- MAX ITERATIONS REACHED, RETURNING LAST GENERATION ---")
        return state['generation']
    
    if __name__ == "__main__":
        # Let's use a query that might tempt the LLM to add outside knowledge
        query = "How does our payment service's idempotency key TTL compare to industry standards?"
        
        final_answer = run_self_correcting_rag(query)
        
        print("\n\n--- FINAL SELF-CORRECTED RAG RESPONSE ---")
        print(final_answer)

    Let's analyze the execution flow with our new query:

  • Query: "How does our payment service's idempotency key TTL compare to industry standards?"
  • HyDE Retrieval: HyDE will generate a document about idempotency keys and TTLs, successfully retrieving our design doc chunk that mentions the "24-hour TTL."
  • Iteration 1 - Generate: The LLM, given the context, might be tempted to use its general knowledge. It could generate an answer like: "Our service uses a 24-hour TTL for idempotency keys. This is a common practice, as industry standards typically range from a few minutes to 24 hours to balance safety and resource usage."
  • Iteration 1 - Critique: The critique agent will receive this answer and the source document. The source document only states there is a 24-hour TTL. It says nothing about industry standards. The critique prompt will force the LLM to be pedantic. The JSON output will be: {"is_supported": false, "feedback": "The claim that industry standards typically range from a few minutes to 24 hours is not supported by the provided context. The document only mentions our specific 24-hour TTL."}
  • Iteration 1 - Decision: is_supported is false. The loop proceeds to the refinement step.
  • Iteration 1 - Refine: The refinement agent receives the original answer, the critique, and the context. It is explicitly told to fix the answer based on the feedback. It will now generate a much more precise and factually grounded response: "Our payment service implements a 24-hour Time-To-Live (TTL) for idempotency keys. The provided context does not contain information about how this compares to general industry standards."
  • Iteration 2 - Generate/Critique: This new answer would be passed to the critique node again. This time, every claim is directly supported by the text. The critique would return {"is_supported": true, ...}.
  • Iteration 2 - Decision: is_supported is true. The loop terminates, returning the corrected, factually consistent answer.
  • Production Considerations for Self-Correction

  • Prompt Engineering is Critical: The success of this pattern hinges almost entirely on the quality of your critique and refinement prompts. They must be explicit, constraint-driven, and guide the LLM to the desired behavior. Using structured output (JSON) for the critique step is highly recommended as it makes the decision gate deterministic.
  • Cost and Latency Amplification: This pattern is even more expensive than HyDE. A single query can result in 3, 5, or more LLM calls. This is untenable for most real-time applications. It is best suited for high-stakes, asynchronous report generation, complex query analysis, or as a component in an autonomous agent where accuracy is the absolute priority.
  • Preventing Loops: Set a maximum number of iterations to prevent the agent from getting stuck in a refinement loop where it fails to satisfy the critique. The LLM might oscillate between two slightly different but equally flawed answers.
  • Observability: Tracing the execution of these graphs is paramount for debugging. Tools like LangSmith, Arize, or even custom logging that saves the state (generation, critique, refinement) at each step are essential. Without this, understanding why a query produced a specific output is nearly impossible.
  • Conclusion: RAG as a Multi-Agent System

    The journey from a basic RAG prototype to a production-ready system is one of increasing complexity and robustness. By moving away from the single-shot paradigm and embracing multi-step reasoning pipelines, we can mitigate the core weaknesses of naive RAG.

  • HyDE tackles the input problem: the semantic gap between query and context. It's a powerful and relatively simple pattern to improve retrieval accuracy for complex queries, at the cost of one extra LLM call.
  • Self-Correction tackles the output problem: the LLM's tendency to hallucinate or misrepresent the retrieved context. It enforces factual consistency by making the LLM its own auditor, at a significant cost to latency and budget.
  • Senior engineers building these systems should think of themselves not as prompt engineers, but as architects of small, specialized agentic systems. The solution to a complex query is rarely a single, perfect prompt. Instead, it's often a well-orchestrated collaboration between multiple, focused LLM calls, each with a specific job: generating hypotheses, retrieving information, synthesizing answers, critiquing facts, and refining outputs. By combining patterns like HyDE and Self-Correction, you can build RAG systems that are not only powerful but also reliable, trustworthy, and ready for production.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles