Self-Correcting RAG: Advanced Query Rewriting for Production LLMs

October 13, 2025

15 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Fragility of Production RAG: Why Simple Retrieval Fails

In any production-grade Retrieval-Augmented Generation (RAG) system, the quality of the final generated answer is overwhelmingly dependent on the quality of the initial retrieval step. A simple vector similarity search works beautifully for straightforward queries that map cleanly to document chunks in your knowledge base. However, senior engineers know that production workloads are never that simple. User queries are often ambiguous, multi-faceted, or use terminology that differs from the source documents. This semantic gap leads to irrelevant context, which in turn causes factual inaccuracies, hallucinations, and ultimately, a loss of user trust.

Consider a standard RAG pipeline for an internal engineering documentation bot. A user might ask: "How did we solve the N+1 query problem in the billing service last quarter?"

A naive RAG system might perform a vector search on this entire query. It might find documents about "billing service," documents about "N+1 queries," and maybe some quarterly reports. The retrieved context is a scattered, unfocused collection of chunks, leading to a generic, unhelpful answer like: *"The billing service is a critical component. N+1 query problems are performance issues that should be optimized."

This is the critical failure mode we're addressing. The core problem is that the initial, user-provided query is not always the optimal query for retrieval. The solution is not to simply tweak embedding models or chunking strategies, but to build a dynamic, responsive system that can recognize poor retrieval and correct its own course. This is the essence of a Self-Correcting RAG Pipeline.

This article details the architecture and implementation of such a system. We will build a closed-loop pipeline that:

Performs an initial retrieval.
Uses a dedicated LLM call (an "LLM-as-a-Judge") to evaluate the relevance of the retrieved context against the original query.
If the context is deemed insufficient, it triggers a query rewriting step (using an "LLM-as-a-Rewriter").
Performs a second, more precise retrieval using the refined query.
Generates the final answer with the superior context.

We'll move beyond simplistic rewriting and implement advanced, targeted techniques like Step-Back Prompting for broader context and Hypothetical Document Embeddings (HyDE) for semantic alignment.

Architectural Overview of the Self-Correcting Loop

Before diving into code, let's visualize the architecture. This is not a simple linear flow; it's a conditional graph.

mermaid

graph TD
    A[User Query] --> B{Initial Retrieval};
    B --> C{Retrieved Context v1};
    C --> D{LLM-as-a-Judge};
    A --> D;
    D -- Context Sufficient --> F[Final Generation];
    D -- Context Insufficient --> E{LLM-as-a-Rewriter};
    A --> E;
    E --> G{Rewritten Query};
    G --> H{Final Retrieval};
    H --> I{Retrieved Context v2};
    I --> F;
    C --> F;

Initial Retrieval: Standard hybrid search (e.g., BM25 + dense vector search) for a robust baseline.

LLM-as-a-Judge: A crucial, low-latency step. A small, fast LLM is prompted to score the relevance of Context v1 with respect to the User Query. It outputs a decision: SUFFICIENT or INSUFFICIENT.

LLM-as-a-Rewriter: Triggered only on INSUFFICIENT. This component uses more sophisticated prompting to transform the original query.

Final Retrieval: Uses the Rewritten Query to fetch Context v2, which should be significantly more relevant.

Final Generation: The main LLM synthesizes the answer using the best available context (v2 if it exists, otherwise v1).

Implementation Deep Dive: Building the Components

Let's build this system in Python. We'll use langchain for orchestration, but the core principles are framework-agnostic. Assume we have a pre-existing vector store (e.g., FAISS or a managed service like Pinecone) populated with our engineering documentation.

Setup and Baseline Retriever

First, let's establish our baseline. We'll use a langchain retriever. For production, you'd likely use a more robust hybrid search implementation, but this demonstrates the principle.

python

import os
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.schema.document import Document

# Assume environment variables for API keys are set
# os.environ["OPENAI_API_KEY"] = "..."

# --- 1. Setup Models and Embeddings ---
llm_generator = OpenAI(model_name="gpt-4-turbo-preview", temperature=0.1)
llm_judge = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0, top_p=1)
llm_rewriter = OpenAI(model_name="gpt-4-turbo-preview", temperature=0.3)
embeddings = OpenAIEmbeddings()

# --- 2. Create a Dummy Vector Store ---
# In a real scenario, this would be your actual knowledge base.
dummy_texts = [
    "In Q3, the billing service team resolved ticket BIL-789 by implementing batch processing for invoice generation, which mitigated the N+1 database query bottleneck.",
    "The N+1 query pattern is an anti-pattern in ORM usage where one query to retrieve parent items results in N additional queries to retrieve child items.",
    "Quarterly engineering reviews are held to discuss major architectural improvements. The Q3 review focused on performance optimizations across services.",
    "Our primary Java microservices, including the billing service, use the Spring Boot framework with JPA/Hibernate for data persistence.",
    "A deep-dive analysis of the billing service's performance degradation was presented in the October all-hands meeting, highlighting the ORM-level inefficiencies."
]
dummy_docs = [Document(page_content=t) for t in dummy_texts]

vector_store = FAISS.from_documents(dummy_docs, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

# --- 3. Baseline RAG Chain (for comparison) ---
baseline_prompt_template = """
Answer the question based only on the following context:
{context}

Question: {question}
Answer:
"""
BASELINE_PROMPT = PromptTemplate(
    template=baseline_prompt_template, input_variables=["context", "question"]
)

baseline_rag_chain = RetrievalQA.from_chain_type(
    llm=llm_generator,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": BASELINE_PROMPT}
)

# --- Test Baseline ---
query = "How did we solve the N+1 query problem in the billing service last quarter?"
result = baseline_rag_chain.invoke(query)
print(f"Baseline Answer: {result['result']}")

# Expected (poor) baseline result might be:
# Baseline Answer: The N+1 query pattern is an anti-pattern. The billing service had performance issues discussed in Q3.
# This answer is generic because the retriever likely pulled the general N+1 definition and the general Q3 review document, but not the specific solution.

Component 1: The LLM-as-a-Judge

This is the decision engine. It must be fast and reliable. We use a smaller, cheaper model (gpt-3.5-turbo-instruct) and a tightly constrained prompt that forces a binary JSON output. This avoids parsing flaky natural language responses.

python

import json

judge_prompt_template = """
Given the user's question and the retrieved context, evaluate if the context is sufficient to provide a detailed, factual answer. The context should directly address the key entities and relationships in the question.

Respond with a single JSON object containing two keys: "reason" and "decision".
- "reason": A brief explanation of your evaluation.
- "decision": Either "SUFFICIENT" or "INSUFFICIENT".

**User Question:**
{question}

**Retrieved Context:**
---
{context}
---

**JSON Response:**
"""

JUDGE_PROMPT = PromptTemplate(
    template=judge_prompt_template, input_variables=["question", "context"]
)

def run_judge(question: str, context_docs: list[Document]) -> dict:
    context_str = "\n\n".join([doc.page_content for doc in context_docs])
    prompt = JUDGE_PROMPT.format(question=question, context=context_str)
    
    response = llm_judge.invoke(prompt)
    try:
        # Using a robust JSON parser is key for production
        parsed_response = json.loads(response.strip())
        return parsed_response
    except json.JSONDecodeError:
        # Fallback for malformed JSON from the LLM
        return {"reason": "Failed to parse LLM response.", "decision": "INSUFFICIENT"}

# --- Test the Judge ---
initial_docs = retriever.get_relevant_documents(query)
judge_result = run_judge(query, initial_docs)

print(f"Initial Docs: {[doc.page_content for doc in initial_docs]}")
print(f"Judge Decision: {judge_result['decision']}")
print(f"Judge Reason: {judge_result['reason']}")

# Expected Judge Output for the poor retrieval:
# Judge Decision: INSUFFICIENT
# Judge Reason: The context mentions the N+1 pattern and the billing service in Q3, but it does not specify the exact solution implemented to resolve the problem (ticket BIL-789).

This explicit check prevents the generator LLM from even attempting to answer with bad context, saving a costly gpt-4 call and preventing confident-sounding hallucinations.

Component 2: The LLM-as-a-Rewriter

When the judge says INSUFFICIENT, we activate the rewriter. This is where advanced prompting comes into play. We won't just ask it to "rephrase the query." We'll implement two powerful, distinct strategies.

Strategy A: Step-Back Prompting

This technique involves generating a more general, higher-level query to retrieve broader, contextual documents. The original, specific query is then used on this broader context. It's excellent for queries that are too specific and miss the bigger picture.

python

step_back_prompt_template = """
You are an expert at query formulation. Your task is to transform a user's question into a more general, 'step-back' question. This step-back question should aim to retrieve broader context that is relevant to the original question.

For example:
Original Question: 'What is the capital of France?'
Step-Back Question: 'What are the political capitals of countries in Europe?'

Original Question: 'Which specific algorithm is used for fraud detection in our payment system?'
Step-Back Question: 'What are the different fraud detection techniques and systems used in our company?'

Now, generate a step-back question for the following user question.

**Original Question:**
{question}

**Step-Back Question:**
"""

STEP_BACK_PROMPT = PromptTemplate(
    template=step_back_prompt_template, input_variables=["question"]
)

def generate_step_back_query(question: str) -> str:
    prompt = STEP_BACK_PROMPT.format(question=question)
    response = llm_rewriter.invoke(prompt)
    return response.strip()

Strategy B: Hypothetical Document Embeddings (HyDE)

HyDE is a counter-intuitive but highly effective technique. Instead of rewriting the query, we ask an LLM to generate a hypothetical answer to the original query. This answer, while not factually correct, will contain the vocabulary, phrasing, and semantic structure of a good answer. We then create an embedding of this hypothetical document and use it to find real documents that are semantically similar. This bridges the terminology gap between the user's query and the source documents.

python

hyde_prompt_template = """
Write a detailed, hypothetical document that provides a perfect answer to the user's question. This document should be plausible and rich in keywords and concepts, even if it's not factually correct. It will be used to find similar real documents.

**User Question:**
{question}

**Hypothetical Document:**
"""

HYDE_PROMPT = PromptTemplate(
    template=hyde_prompt_template, input_variables=["question"]
)

def generate_hyde_document(question: str) -> Document:
    prompt = HYDE_PROMPT.format(question=question)
    response = llm_rewriter.invoke(prompt)
    return Document(page_content=response.strip())

# HyDE retriever is different - it embeds the generated doc, not the query
def retrieve_with_hyde(question: str) -> list[Document]:
    hyde_doc = generate_hyde_document(question)
    hyde_embedding = embeddings.embed_query(hyde_doc.page_content)
    # The key step: perform similarity search with the new embedding
    similar_docs = vector_store.similarity_search_by_vector(hyde_embedding, k=2)
    return similar_docs

Tying It All Together: The Self-Correcting Chain

Now we orchestrate the full logic. We'll prioritize HyDE as it's often more effective for specific queries like our example.

python

def self_correcting_rag(question: str):
    print(f"\n--- Starting Self-Correcting RAG for query: '{question}' ---")
    
    # 1. Initial Retrieval
    print("\nStep 1: Initial Retrieval")
    initial_docs = retriever.get_relevant_documents(question)
    print(f"Retrieved {len(initial_docs)} documents initially.")

    # 2. Judge the initial context
    print("\nStep 2: LLM-as-a-Judge Evaluation")
    judge_result = run_judge(question, initial_docs)
    print(f"Judge Decision: {judge_result['decision']}. Reason: {judge_result['reason']}")

    final_docs = initial_docs
    # 3. Conditional Rewriting and Final Retrieval
    if judge_result['decision'] == 'INSUFFICIENT':
        print("\nStep 3: Context is insufficient. Triggering Query Rewriting (HyDE).")
        
        # Generate hypothetical document
        hyde_doc = generate_hyde_document(question)
        print(f"\nGenerated HyDE Document:\n---\n{hyde_doc.page_content}\n---")
        
        # Retrieve using HyDE embedding
        print("\nStep 4: Final Retrieval using HyDE embedding")
        final_docs = retrieve_with_hyde(question)
        print(f"Retrieved {len(final_docs)} documents with HyDE.")
    else:
        print("\nStep 3 & 4: Skipped. Initial context is sufficient.")

    # 5. Final Generation
    print("\nStep 5: Final Generation")
    context_str = "\n\n".join([doc.page_content for doc in final_docs])
    final_prompt = BASELINE_PROMPT.format(context=context_str, question=question)
    
    final_answer = llm_generator.invoke(final_prompt)
    print(f"\n--- Final Answer ---
{final_answer.strip()}")
    
    return final_answer, final_docs

# --- Run the full pipeline ---
query = "How did we solve the N+1 query problem in the billing service last quarter?"
final_answer, final_docs = self_correcting_rag(query)

# Expected HyDE Document:
# The N+1 query problem in the billing service during Q3 was a critical performance issue identified in ticket BIL-789. The root cause was the lazy loading of customer invoice lines within the main invoice processing loop. To resolve this, the engineering team refactored the data access layer to use a batch processing strategy. Specifically, they replaced the iterative fetching with a single, optimized JPA query that utilized a JOIN FETCH on the associated entities, effectively pre-loading all necessary data and eliminating the subsequent N queries.

# The embedding of THIS document will be very close to our *actual* source document about BIL-789.

# Expected Final Answer (after HyDE retrieval):
# The N+1 query problem in the billing service was resolved in Q3 under ticket BIL-789 by implementing batch processing for invoice generation, which mitigated the database query bottleneck.

This final answer is specific, accurate, and directly addresses the user's query, a stark contrast to the baseline's vague response.

Advanced Considerations and Production Patterns

Implementing this architecture in production requires careful consideration of several trade-offs and edge cases.

1. Latency Management

The most significant drawback is increased latency. Each corrective loop adds at least two network hops and two LLM inference times.

Model Selection: The choice of gpt-3.5-turbo-instruct for the judge was deliberate. It's significantly faster and cheaper than GPT-4. For the rewriter, you might use a fine-tuned open-source model (like a 7B parameter Llama 3) hosted on dedicated hardware to reduce latency further.

Asynchronous Execution & Streaming: For user-facing applications, don't make the user wait. Start streaming the answer from the initial retrieval immediately. In the background, run the self-correction loop. If it produces a better answer, you can replace or augment the initially streamed response. This provides a fast initial response with eventual consistency.

Caching: Implement a semantic cache for the judge's decisions and the rewritten queries. If a similar query has been judged INSUFFICIENT before, you can immediately jump to the rewriting step. Cache the rewritten query itself to save that LLM call.

2. Cost Control

This pipeline can be 2-3x more expensive than a simple RAG chain due to the extra LLM calls.

Justification: This cost is only justified for high-value applications where accuracy is paramount (e.g., legal research, medical information, critical engineering support). For a casual chatbot, the cost is likely prohibitive.

Adaptive Invocation: Don't run the full loop for every query. Use a classifier to predict query complexity. Simple, keyword-based queries can bypass the correction loop entirely. You can train a simple model on query logs to predict which queries are likely to fail standard RAG.

Budgeting and Rate Limiting: Implement strict monitoring and budgeting. If your LLM costs spike, this self-correcting loop is a likely culprit.

3. Preventing Degenerate Loops

What if the rewritten query also results in poor context? You could, in theory, loop forever.

Maximum Iterations: Hard-cap the correction loop to a single iteration. A single, well-executed rewrite should be sufficient. If the context is still poor after one correction, it's more likely an issue with the underlying knowledge base than the query itself.

Confidence Scoring: The judge's output can be more nuanced than a binary decision. You can ask it for a confidence score (e.g., from 1 to 10). Only trigger a rewrite if the score is below a certain threshold (e.g., < 4). This prevents rewriting on marginally relevant context.

4. Evaluation and A/B Testing

How do you prove this complex system is better? Standard RAG metrics are a starting point, but you need more.

RAGAs Framework: Use frameworks like RAGAs to evaluate key dimensions:

- Context Precision/Recall: Does the final context (v2) have a higher relevance score to the query than the initial context (v1)?

- Faithfulness: Is the final answer grounded in the provided context? This is crucial to measure if the rewriter introduced any artifacts.

- Answer Relevance: How well does the final answer address the original user query?

A/B Testing in Production: Deploy the self-correcting pipeline to a subset of users. Track user engagement metrics: Did users rephrase their query less often? Did they click the "thumbs up" button more frequently? These product-level metrics are the ultimate ground truth.

Conclusion: From Brittle to Robust AI Systems

The shift from simple, linear RAG pipelines to dynamic, self-correcting loops represents a significant step in the maturity of applied AI engineering. It's an acknowledgment that real-world user input is messy and that our systems must be resilient enough to handle it. By treating components like retrieval evaluation and query formulation as first-class problems solvable by targeted LLM calls, we can build applications that are not only more accurate but also more reliable and trustworthy.

The techniques explored here—LLM-as-a-Judge, Step-Back Prompting, and HyDE—are powerful tools in a senior engineer's arsenal. While they introduce complexity in latency and cost, the resulting improvement in answer quality for challenging, high-stakes queries can be the deciding factor between a proof-of-concept and a production-ready AI system that users depend on.

The Fragility of Production RAG: Why Simple Retrieval Fails

Architectural Overview of the Self-Correcting Loop

Implementation Deep Dive: Building the Components

Setup and Baseline Retriever

Component 1: The LLM-as-a-Judge

Component 2: The LLM-as-a-Rewriter

Strategy A: Step-Back Prompting

Strategy B: Hypothetical Document Embeddings (HyDE)

Tying It All Together: The Self-Correcting Chain

Advanced Considerations and Production Patterns

1. Latency Management

2. Cost Control

3. Preventing Degenerate Loops

4. Evaluation and A/B Testing

Conclusion: From Brittle to Robust AI Systems

Found this article helpful?