Self-Correcting RAG: Advanced Query Rewriting for Production LLMs
The Fragility of Production RAG: Why Simple Retrieval Fails
In any production-grade Retrieval-Augmented Generation (RAG) system, the quality of the final generated answer is overwhelmingly dependent on the quality of the initial retrieval step. A simple vector similarity search works beautifully for straightforward queries that map cleanly to document chunks in your knowledge base. However, senior engineers know that production workloads are never that simple. User queries are often ambiguous, multi-faceted, or use terminology that differs from the source documents. This semantic gap leads to irrelevant context, which in turn causes factual inaccuracies, hallucinations, and ultimately, a loss of user trust.
Consider a standard RAG pipeline for an internal engineering documentation bot. A user might ask: "How did we solve the N+1 query problem in the billing service last quarter?"
A naive RAG system might perform a vector search on this entire query. It might find documents about "billing service," documents about "N+1 queries," and maybe some quarterly reports. The retrieved context is a scattered, unfocused collection of chunks, leading to a generic, unhelpful answer like: *"The billing service is a critical component. N+1 query problems are performance issues that should be optimized."
This is the critical failure mode we're addressing. The core problem is that the initial, user-provided query is not always the optimal query for retrieval. The solution is not to simply tweak embedding models or chunking strategies, but to build a dynamic, responsive system that can recognize poor retrieval and correct its own course. This is the essence of a Self-Correcting RAG Pipeline.
This article details the architecture and implementation of such a system. We will build a closed-loop pipeline that:
- Performs an initial retrieval.
- Uses a dedicated LLM call (an "LLM-as-a-Judge") to evaluate the relevance of the retrieved context against the original query.
- If the context is deemed insufficient, it triggers a query rewriting step (using an "LLM-as-a-Rewriter").
- Performs a second, more precise retrieval using the refined query.
- Generates the final answer with the superior context.
We'll move beyond simplistic rewriting and implement advanced, targeted techniques like Step-Back Prompting for broader context and Hypothetical Document Embeddings (HyDE) for semantic alignment.
Architectural Overview of the Self-Correcting Loop
Before diving into code, let's visualize the architecture. This is not a simple linear flow; it's a conditional graph.
graph TD
A[User Query] --> B{Initial Retrieval};
B --> C{Retrieved Context v1};
C --> D{LLM-as-a-Judge};
A --> D;
D -- Context Sufficient --> F[Final Generation];
D -- Context Insufficient --> E{LLM-as-a-Rewriter};
A --> E;
E --> G{Rewritten Query};
G --> H{Final Retrieval};
H --> I{Retrieved Context v2};
I --> F;
C --> F;
Context v1 with respect to the User Query. It outputs a decision: SUFFICIENT or INSUFFICIENT.INSUFFICIENT. This component uses more sophisticated prompting to transform the original query.Rewritten Query to fetch Context v2, which should be significantly more relevant.v2 if it exists, otherwise v1).Implementation Deep Dive: Building the Components
Let's build this system in Python. We'll use langchain for orchestration, but the core principles are framework-agnostic. Assume we have a pre-existing vector store (e.g., FAISS or a managed service like Pinecone) populated with our engineering documentation.
Setup and Baseline Retriever
First, let's establish our baseline. We'll use a langchain retriever. For production, you'd likely use a more robust hybrid search implementation, but this demonstrates the principle.
import os
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.schema.document import Document
# Assume environment variables for API keys are set
# os.environ["OPENAI_API_KEY"] = "..."
# --- 1. Setup Models and Embeddings ---
llm_generator = OpenAI(model_name="gpt-4-turbo-preview", temperature=0.1)
llm_judge = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0, top_p=1)
llm_rewriter = OpenAI(model_name="gpt-4-turbo-preview", temperature=0.3)
embeddings = OpenAIEmbeddings()
# --- 2. Create a Dummy Vector Store ---
# In a real scenario, this would be your actual knowledge base.
dummy_texts = [
"In Q3, the billing service team resolved ticket BIL-789 by implementing batch processing for invoice generation, which mitigated the N+1 database query bottleneck.",
"The N+1 query pattern is an anti-pattern in ORM usage where one query to retrieve parent items results in N additional queries to retrieve child items.",
"Quarterly engineering reviews are held to discuss major architectural improvements. The Q3 review focused on performance optimizations across services.",
"Our primary Java microservices, including the billing service, use the Spring Boot framework with JPA/Hibernate for data persistence.",
"A deep-dive analysis of the billing service's performance degradation was presented in the October all-hands meeting, highlighting the ORM-level inefficiencies."
]
dummy_docs = [Document(page_content=t) for t in dummy_texts]
vector_store = FAISS.from_documents(dummy_docs, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})
# --- 3. Baseline RAG Chain (for comparison) ---
baseline_prompt_template = """
Answer the question based only on the following context:
{context}
Question: {question}
Answer:
"""
BASELINE_PROMPT = PromptTemplate(
template=baseline_prompt_template, input_variables=["context", "question"]
)
baseline_rag_chain = RetrievalQA.from_chain_type(
llm=llm_generator,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": BASELINE_PROMPT}
)
# --- Test Baseline ---
query = "How did we solve the N+1 query problem in the billing service last quarter?"
result = baseline_rag_chain.invoke(query)
print(f"Baseline Answer: {result['result']}")
# Expected (poor) baseline result might be:
# Baseline Answer: The N+1 query pattern is an anti-pattern. The billing service had performance issues discussed in Q3.
# This answer is generic because the retriever likely pulled the general N+1 definition and the general Q3 review document, but not the specific solution.
Component 1: The LLM-as-a-Judge
This is the decision engine. It must be fast and reliable. We use a smaller, cheaper model (gpt-3.5-turbo-instruct) and a tightly constrained prompt that forces a binary JSON output. This avoids parsing flaky natural language responses.
import json
judge_prompt_template = """
Given the user's question and the retrieved context, evaluate if the context is sufficient to provide a detailed, factual answer. The context should directly address the key entities and relationships in the question.
Respond with a single JSON object containing two keys: "reason" and "decision".
- "reason": A brief explanation of your evaluation.
- "decision": Either "SUFFICIENT" or "INSUFFICIENT".
**User Question:**
{question}
**Retrieved Context:**
---
{context}
---
**JSON Response:**
"""
JUDGE_PROMPT = PromptTemplate(
template=judge_prompt_template, input_variables=["question", "context"]
)
def run_judge(question: str, context_docs: list[Document]) -> dict:
context_str = "\n\n".join([doc.page_content for doc in context_docs])
prompt = JUDGE_PROMPT.format(question=question, context=context_str)
response = llm_judge.invoke(prompt)
try:
# Using a robust JSON parser is key for production
parsed_response = json.loads(response.strip())
return parsed_response
except json.JSONDecodeError:
# Fallback for malformed JSON from the LLM
return {"reason": "Failed to parse LLM response.", "decision": "INSUFFICIENT"}
# --- Test the Judge ---
initial_docs = retriever.get_relevant_documents(query)
judge_result = run_judge(query, initial_docs)
print(f"Initial Docs: {[doc.page_content for doc in initial_docs]}")
print(f"Judge Decision: {judge_result['decision']}")
print(f"Judge Reason: {judge_result['reason']}")
# Expected Judge Output for the poor retrieval:
# Judge Decision: INSUFFICIENT
# Judge Reason: The context mentions the N+1 pattern and the billing service in Q3, but it does not specify the exact solution implemented to resolve the problem (ticket BIL-789).
This explicit check prevents the generator LLM from even attempting to answer with bad context, saving a costly gpt-4 call and preventing confident-sounding hallucinations.
Component 2: The LLM-as-a-Rewriter
When the judge says INSUFFICIENT, we activate the rewriter. This is where advanced prompting comes into play. We won't just ask it to "rephrase the query." We'll implement two powerful, distinct strategies.
Strategy A: Step-Back Prompting
This technique involves generating a more general, higher-level query to retrieve broader, contextual documents. The original, specific query is then used on this broader context. It's excellent for queries that are too specific and miss the bigger picture.
step_back_prompt_template = """
You are an expert at query formulation. Your task is to transform a user's question into a more general, 'step-back' question. This step-back question should aim to retrieve broader context that is relevant to the original question.
For example:
Original Question: 'What is the capital of France?'
Step-Back Question: 'What are the political capitals of countries in Europe?'
Original Question: 'Which specific algorithm is used for fraud detection in our payment system?'
Step-Back Question: 'What are the different fraud detection techniques and systems used in our company?'
Now, generate a step-back question for the following user question.
**Original Question:**
{question}
**Step-Back Question:**
"""
STEP_BACK_PROMPT = PromptTemplate(
template=step_back_prompt_template, input_variables=["question"]
)
def generate_step_back_query(question: str) -> str:
prompt = STEP_BACK_PROMPT.format(question=question)
response = llm_rewriter.invoke(prompt)
return response.strip()
Strategy B: Hypothetical Document Embeddings (HyDE)
HyDE is a counter-intuitive but highly effective technique. Instead of rewriting the query, we ask an LLM to generate a hypothetical answer to the original query. This answer, while not factually correct, will contain the vocabulary, phrasing, and semantic structure of a good answer. We then create an embedding of this hypothetical document and use it to find real documents that are semantically similar. This bridges the terminology gap between the user's query and the source documents.
hyde_prompt_template = """
Write a detailed, hypothetical document that provides a perfect answer to the user's question. This document should be plausible and rich in keywords and concepts, even if it's not factually correct. It will be used to find similar real documents.
**User Question:**
{question}
**Hypothetical Document:**
"""
HYDE_PROMPT = PromptTemplate(
template=hyde_prompt_template, input_variables=["question"]
)
def generate_hyde_document(question: str) -> Document:
prompt = HYDE_PROMPT.format(question=question)
response = llm_rewriter.invoke(prompt)
return Document(page_content=response.strip())
# HyDE retriever is different - it embeds the generated doc, not the query
def retrieve_with_hyde(question: str) -> list[Document]:
hyde_doc = generate_hyde_document(question)
hyde_embedding = embeddings.embed_query(hyde_doc.page_content)
# The key step: perform similarity search with the new embedding
similar_docs = vector_store.similarity_search_by_vector(hyde_embedding, k=2)
return similar_docs
Tying It All Together: The Self-Correcting Chain
Now we orchestrate the full logic. We'll prioritize HyDE as it's often more effective for specific queries like our example.
def self_correcting_rag(question: str):
print(f"\n--- Starting Self-Correcting RAG for query: '{question}' ---")
# 1. Initial Retrieval
print("\nStep 1: Initial Retrieval")
initial_docs = retriever.get_relevant_documents(question)
print(f"Retrieved {len(initial_docs)} documents initially.")
# 2. Judge the initial context
print("\nStep 2: LLM-as-a-Judge Evaluation")
judge_result = run_judge(question, initial_docs)
print(f"Judge Decision: {judge_result['decision']}. Reason: {judge_result['reason']}")
final_docs = initial_docs
# 3. Conditional Rewriting and Final Retrieval
if judge_result['decision'] == 'INSUFFICIENT':
print("\nStep 3: Context is insufficient. Triggering Query Rewriting (HyDE).")
# Generate hypothetical document
hyde_doc = generate_hyde_document(question)
print(f"\nGenerated HyDE Document:\n---\n{hyde_doc.page_content}\n---")
# Retrieve using HyDE embedding
print("\nStep 4: Final Retrieval using HyDE embedding")
final_docs = retrieve_with_hyde(question)
print(f"Retrieved {len(final_docs)} documents with HyDE.")
else:
print("\nStep 3 & 4: Skipped. Initial context is sufficient.")
# 5. Final Generation
print("\nStep 5: Final Generation")
context_str = "\n\n".join([doc.page_content for doc in final_docs])
final_prompt = BASELINE_PROMPT.format(context=context_str, question=question)
final_answer = llm_generator.invoke(final_prompt)
print(f"\n--- Final Answer ---
{final_answer.strip()}")
return final_answer, final_docs
# --- Run the full pipeline ---
query = "How did we solve the N+1 query problem in the billing service last quarter?"
final_answer, final_docs = self_correcting_rag(query)
# Expected HyDE Document:
# The N+1 query problem in the billing service during Q3 was a critical performance issue identified in ticket BIL-789. The root cause was the lazy loading of customer invoice lines within the main invoice processing loop. To resolve this, the engineering team refactored the data access layer to use a batch processing strategy. Specifically, they replaced the iterative fetching with a single, optimized JPA query that utilized a JOIN FETCH on the associated entities, effectively pre-loading all necessary data and eliminating the subsequent N queries.
# The embedding of THIS document will be very close to our *actual* source document about BIL-789.
# Expected Final Answer (after HyDE retrieval):
# The N+1 query problem in the billing service was resolved in Q3 under ticket BIL-789 by implementing batch processing for invoice generation, which mitigated the database query bottleneck.
This final answer is specific, accurate, and directly addresses the user's query, a stark contrast to the baseline's vague response.
Advanced Considerations and Production Patterns
Implementing this architecture in production requires careful consideration of several trade-offs and edge cases.
1. Latency Management
The most significant drawback is increased latency. Each corrective loop adds at least two network hops and two LLM inference times.
gpt-3.5-turbo-instruct for the judge was deliberate. It's significantly faster and cheaper than GPT-4. For the rewriter, you might use a fine-tuned open-source model (like a 7B parameter Llama 3) hosted on dedicated hardware to reduce latency further.INSUFFICIENT before, you can immediately jump to the rewriting step. Cache the rewritten query itself to save that LLM call.2. Cost Control
This pipeline can be 2-3x more expensive than a simple RAG chain due to the extra LLM calls.
3. Preventing Degenerate Loops
What if the rewritten query also results in poor context? You could, in theory, loop forever.
4. Evaluation and A/B Testing
How do you prove this complex system is better? Standard RAG metrics are a starting point, but you need more.
- Context Precision/Recall: Does the final context (v2) have a higher relevance score to the query than the initial context (v1)?
- Faithfulness: Is the final answer grounded in the provided context? This is crucial to measure if the rewriter introduced any artifacts.
- Answer Relevance: How well does the final answer address the original user query?
Conclusion: From Brittle to Robust AI Systems
The shift from simple, linear RAG pipelines to dynamic, self-correcting loops represents a significant step in the maturity of applied AI engineering. It's an acknowledgment that real-world user input is messy and that our systems must be resilient enough to handle it. By treating components like retrieval evaluation and query formulation as first-class problems solvable by targeted LLM calls, we can build applications that are not only more accurate but also more reliable and trustworthy.
The techniques explored here—LLM-as-a-Judge, Step-Back Prompting, and HyDE—are powerful tools in a senior engineer's arsenal. While they introduce complexity in latency and cost, the resulting improvement in answer quality for challenging, high-stakes queries can be the deciding factor between a proof-of-concept and a production-ready AI system that users depend on.