Advanced RAG: HyDE & Re-ranking for Production-Grade Retrieval
The Production RAG Problem: Beyond Naive Semantic Search
For any senior engineer who has deployed a Retrieval-Augmented Generation (RAG) system, the initial excitement of a working prototype quickly gives way to the harsh realities of production traffic. The core retrieval mechanism—typically a bi-encoder model creating embeddings for vector search—exhibits a critical flaw: the semantic gap. User queries are often short, abstract, or use different terminology than the source documents. A naive semantic search for a query like "impact of supply chain issues on Q3 profits" might retrieve documents that mention "supply chain" and "profits" but completely miss a nuanced paragraph discussing "logistical bottlenecks affecting quarterly revenue". This is the failure mode that separates toy RAGs from production-grade systems.
Standard vector search, while powerful, optimizes for semantic similarity, not necessarily for contextual relevance or answerability. This post assumes you've already hit this wall. We will not cover the basics of creating embeddings or setting up a vector store. Instead, we will dissect and implement two advanced, complementary techniques to bridge the semantic gap:
By combining these two patterns, we construct a multi-stage retrieval pipeline that maximizes both recall (getting all potentially relevant documents) and precision (ensuring the top documents are the most relevant), leading to a significant improvement in the final generated output.
Baseline Failure: A Concrete Example of Naive RAG's Limits
Let's establish a clear failure case to demonstrate the problem. We'll set up a minimal RAG system using langchain, FAISS for our in-memory vector store, and a standard sentence-transformers model.
Setup:
First, let's install the necessary libraries:
pip install langchain openai faiss-cpu sentence-transformers transformers torch
Now, let's create our baseline RAG implementation.
import os
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# Ensure you have your OPENAI_API_KEY set in your environment
# os.environ["OPENAI_API_KEY"] = "your_key_here"
# 1. Our Document Corpus
# A small, technical corpus where nuance matters.
documents_text = [
"The Q4 financial report indicates a 15% increase in revenue, primarily driven by the new FusionX processor line. However, operational costs rose by 22% due to unforeseen logistical bottlenecks.",
"Our primary logistical challenge this year has been the semiconductor shortage, which delayed the production of our flagship graphics cards, impacting our quarterly revenue targets.",
"The marketing team's 'Innovate a Better Tomorrow' campaign resulted in a 5% brand visibility increase, but failed to translate into direct sales growth for the period.",
"FusionX processor architecture leverages a 3nm process, offering a 30% performance boost over the previous generation. This has been a key differentiator in a competitive market.",
"To mitigate supply chain risks, we are diversifying our component sourcing to three new international partners, a strategic move expected to stabilize production by mid-next year."
]
docs = [Document(page_content=t) for t in documents_text]
# 2. Setup the Retriever
# Using a standard, high-performance bi-encoder for embedding.
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name)
vector_store = FAISS.from_documents(docs, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})
# 3. Setup the QA Chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever
)
# 4. The Problematic Query
query = "What caused the drop in expected earnings?"
# Execute and print results
print(f"Executing Naive RAG Query: {query}\n")
retrieved_docs = retriever.get_relevant_documents(query)
print("--- Retrieved Documents ---")
for i, doc in enumerate(retrieved_docs):
print(f"Doc {i+1}: {doc.page_content}\n")
llm_response = qa_chain.run(query)
print("--- LLM Response ---")
print(llm_response)
Expected Output:
Executing Naive RAG Query: What caused the drop in expected earnings?
--- Retrieved Documents ---
Doc 1: The Q4 financial report indicates a 15% increase in revenue, primarily driven by the new FusionX processor line. However, operational costs rose by 22% due to unforeseen logistical bottlenecks.
Doc 2: The marketing team's 'Innovate a Better Tomorrow' campaign resulted in a 5% brand visibility increase, but failed to translate into direct sales growth for the period.
--- LLM Response ---
Based on the documents, one factor that could have contributed to a drop in expected earnings is the rise in operational costs by 22% due to unforeseen logistical bottlenecks. Additionally, a marketing campaign failed to translate into direct sales growth.
The result is mediocre. It correctly identifies the "operational costs" issue but misses the more direct and crucial "semiconductor shortage" document. Why? The query "What caused the drop in expected earnings?" is semantically closer to documents discussing financial outcomes ("revenue", "sales growth") than to the document about production issues ("semiconductor shortage", "delayed production"). The bi-encoder model captures this surface-level semantic similarity but fails to infer the causal link that a human would instantly recognize.
This is the precise problem HyDE is designed to solve.
Section 1: Implementing Hypothetical Document Embeddings (HyDE)
HyDE flips the script. Instead of embedding the user's query, we ask an LLM to first generate a hypothetical answer to that query. This generated text, being a full document, is much more likely to exist in the same vector space as the actual source documents.
The flow becomes: Query -> LLM -> Hypothetical Document -> Embedding Model -> Vector Search
Let's implement a custom HyDERetriever that wraps our base retriever.
import os
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.schema import BaseRetriever, Document
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field
from typing import List
# Pydantic model for structured output
class HypotheticalDocument(BaseModel):
text: str = Field(description="A concise, factual-sounding document that answers the user's query.")
class HyDERetriever(BaseRetriever):
"""Retriever that generates a hypothetical document and retrieves documents based on that.
This class implements the Hypothetical Document Embeddings (HyDE) strategy.
It uses a language model to generate a hypothetical document based on the user's query.
The embedding of this hypothetical document is then used to perform a vector search
in the underlying vector store.
"""
base_retriever: BaseRetriever
llm: ChatOpenAI
def __init__(self, base_retriever: BaseRetriever, llm: ChatOpenAI):
super().__init__(base_retriever=base_retriever, llm=llm)
self.base_retriever = base_retriever
self.llm = llm
self.prompt_template = PromptTemplate(
template=(
"You are an expert in information retrieval. Your task is to generate a concise, "
"hypothetical document that directly answers the following user query. "
"The document should be neutral, factual, and written in the same style as a technical report. "
"Do not include any preambles like 'Here is a hypothetical document'.\n"
"USER QUERY: {query}\n\n"
"{format_instructions}"
),
input_variables=["query"],
partial_variables={
"format_instructions": PydanticOutputParser(pydantic_object=HypotheticalDocument).get_format_instructions()
}
)
self.output_parser = PydanticOutputParser(pydantic_object=HypotheticalDocument)
self.chain = self.prompt_template | self.llm | self.output_parser
def _get_relevant_documents(self, query: str) -> List[Document]:
print(f"\n--- HyDE: Generating hypothetical document for query: '{query}' ---")
# Generate the hypothetical document
hypothetical_doc_obj = self.chain.invoke({"query": query})
hypothetical_doc_text = hypothetical_doc_obj.text
print(f"--- HyDE: Generated Document ---\n{hypothetical_doc_text}\n")
# Retrieve documents based on the hypothetical document's embedding
return self.base_retriever.get_relevant_documents(hypothetical_doc_text)
async def _aget_relevant_documents(self, query: str) -> List[Document]:
# Async version for concurrent execution
hypothetical_doc_obj = await self.chain.ainvoke({"query": query})
hypothetical_doc_text = hypothetical_doc_obj.text
return await self.base_retriever.aget_relevant_documents(hypothetical_doc_text)
# --- Now, let's re-run our experiment with the HyDE retriever ---
# Re-use the vector_store and LLM from the previous setup
base_retriever = vector_store.as_retriever(search_kwargs={"k": 2})
hyde_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
hyde_retriever = HyDERetriever(base_retriever=base_retriever, llm=hyde_llm)
# The same problematic query
query = "What caused the drop in expected earnings?"
print(f"Executing HyDE RAG Query: {query}\n")
retrieved_docs_hyde = hyde_retriever.get_relevant_documents(query)
print("--- HyDE Retrieved Documents ---")
for i, doc in enumerate(retrieved_docs_hyde):
print(f"Doc {i+1}: {doc.page_content}\n")
# We can plug this new retriever directly into our QA chain
qa_chain_hyde = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=hyde_retriever
)
llm_response_hyde = qa_chain_hyde.run(query)
print("--- HyDE LLM Response ---")
print(llm_response_hyde)
Expected Output (HyDE):
Executing HyDE RAG Query: What caused the drop in expected earnings?
--- HyDE: Generating hypothetical document for query: 'What caused the drop in expected earnings?' ---
--- HyDE: Generated Document ---
A drop in expected earnings was primarily caused by significant delays in product manufacturing due to a global semiconductor shortage. This disruption in the supply chain directly impacted production schedules for key products, leading to lower than anticipated quarterly revenue.
--- HyDE Retrieved Documents ---
Doc 1: Our primary logistical challenge this year has been the semiconductor shortage, which delayed the production of our flagship graphics cards, impacting our quarterly revenue targets.
Doc 2: To mitigate supply chain risks, we are diversifying our component sourcing to three new international partners, a strategic move expected to stabilize production by mid-next year.
--- HyDE LLM Response ---
The drop in expected earnings was caused by a semiconductor shortage which delayed the production of flagship graphics cards, impacting quarterly revenue targets.
The difference is night and day. The LLM generated a hypothetical document that contained key phrases like "semiconductor shortage" and "delayed product manufacturing". The embedding for this document is now much closer in the vector space to the correct source document. The final LLM response is now precise and accurate because it was fed the correct context.
Performance, Cost, and Edge Cases of HyDE
* Performance & Cost: The obvious trade-off is an additional LLM call at the start of every retrieval. For a model like gpt-3.5-turbo, this can add 200-500ms of latency and a small cost per query. This is often an acceptable price for a massive accuracy boost. For high-throughput systems, consider using a smaller, faster, or self-hosted model (like Mistral 7B) for the generation step.
* Edge Case - Hallucination: What if the LLM hallucinates a factually incorrect hypothetical document? For example, if it had generated "...caused by a fire at our main factory...". This could poison the search, retrieving documents about factory safety instead of finance. Mitigation: The prompt is crucial. Using phrases like "factual-sounding", "neutral", and "technical report style" guides the LLM away from creative writing. Using smaller, less powerful models for HyDE can sometimes be better as they are less prone to elaborate fabrications.
* Edge Case - Query Specificity: HyDE works best for abstract or question-based queries. For highly specific queries with unique identifiers (e.g., "search for ticket FUSIONX-1234"), HyDE can be detrimental. The LLM might generate a generic document about ticketing systems, losing the specific ID. A production system should include logic to bypass HyDE for queries that match certain patterns (e.g., contain UUIDs, ticket numbers, etc.).
Section 2: Precision with Cross-Encoder Re-ranking
HyDE improves our recall—it gets more relevant documents into our initial candidate pool. But what if we need to be absolutely certain that the top 1-2 documents are the best possible matches? This is where Cross-Encoders come in. They provide the ultimate precision.
Bi-Encoder vs. Cross-Encoder:
* Bi-Encoder (for retrieval): Creates independent vector embeddings for the query and each document. The system then performs an efficient similarity search (e.g., cosine similarity) in the vector space. It's fast and scalable, perfect for searching over millions of documents.
Cross-Encoder (for re-ranking): Does not create separate embeddings. Instead, it takes the query and a document together* as a single input and outputs a single score from 0 to 1 representing their relevance. This allows the model to perform full self-attention across both the query and document text, making it far more accurate. However, it's computationally expensive and cannot be pre-indexed. You must run the model for every query-document pair.
The strategy is to use the best of both worlds: use a fast Bi-Encoder (with HyDE) to retrieve a larger set of candidates (e.g., k=50), then use a slow but accurate Cross-Encoder to re-rank just those 50 candidates.
Let's implement a re-ranking step.
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from sentence_transformers.cross_encoder import CrossEncoder
from typing import Sequence
class CrossEncoderReRanker(BaseDocumentCompressor):
"""Document compressor that uses a Cross-Encoder model to re-rank documents.
This class takes an initial list of retrieved documents and re-orders them based
on the relevance scores calculated by a powerful Cross-Encoder model. It keeps
only the top N documents with the highest scores.
"""
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2", top_n: int = 3):
self.model = CrossEncoder(model_name)
self.top_n = top_n
super().__init__()
def compress_documents(
self,
documents: Sequence[Document],
query: str,
callbacks=None
) -> Sequence[Document]:
if not documents:
return []
print(f"\n--- CrossEncoder: Re-ranking {len(documents)} documents for query: '{query}' ---")
# Create pairs of [query, doc_content] for the model
doc_query_pairs = [[query, doc.page_content] for doc in documents]
# Get the scores from the Cross-Encoder model
scores = self.model.predict(doc_query_pairs)
# Combine documents with their scores and sort
doc_scores = list(zip(documents, scores))
doc_scores.sort(key=lambda x: x[1], reverse=True)
# Keep only the top N documents
top_docs = [doc for doc, score in doc_scores[:self.top_n]]
print(f"--- CrossEncoder: Top {self.top_n} documents selected. ---")
return top_docs
# --- Let's build the final, combined pipeline ---
# 1. Base retriever now fetches a larger pool of candidates (e.g., k=5)
# In a real system with many docs, this might be k=50 or k=100.
base_retriever_for_rerank = vector_store.as_retriever(search_kwargs={"k": 5})
# 2. HyDE retriever to improve the initial candidate pool
hyde_retriever_for_rerank = HyDERetriever(base_retriever=base_retriever_for_rerank, llm=hyde_llm)
# 3. The Re-ranker compressor
reranker = CrossEncoderReRanker(top_n=2) # We only want the best 2 docs for the final context
# 4. The final pipeline using ContextualCompressionRetriever
# This LangChain class orchestrates the process: it calls the base retriever (HyDE in our case)
# and then passes the results to the compressor (our Re-ranker).
final_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=hyde_retriever_for_rerank
)
# The same problematic query
query = "What caused the drop in expected earnings?"
print(f"Executing Full Advanced RAG Pipeline Query: {query}\n")
# This single call will trigger the entire HyDE -> Retrieve -> Re-rank flow
final_docs = final_retriever.get_relevant_documents(query)
print("--- Final Retrieved & Re-ranked Documents ---")
for i, doc in enumerate(final_docs):
print(f"Doc {i+1}: {doc.page_content}\n")
# And finally, the QA chain with the fully advanced retriever
qa_chain_final = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=final_retriever
)
llm_response_final = qa_chain_final.run(query)
print("--- Final LLM Response ---")
print(llm_response_final)
Expected Output (Full Pipeline):
The output will show the full chain of operations: HyDE generation, followed by the re-ranker processing the initial 5 documents and selecting the top 2. The final result will be the most precise context possible, leading to the best answer.
Executing Full Advanced RAG Pipeline Query: What caused the drop in expected earnings?
--- HyDE: Generating hypothetical document for query: 'What caused the drop in expected earnings?' ---
--- HyDE: Generated Document ---
A drop in expected earnings was primarily caused by significant delays in product manufacturing due to a global semiconductor shortage. This disruption in the supply chain directly impacted production schedules for key products, leading to lower than anticipated quarterly revenue.
--- CrossEncoder: Re-ranking 5 documents for query: 'What caused the drop in expected earnings?' ---
--- CrossEncoder: Top 2 documents selected. ---
--- Final Retrieved & Re-ranked Documents ---
Doc 1: Our primary logistical challenge this year has been the semiconductor shortage, which delayed the production of our flagship graphics cards, impacting our quarterly revenue targets.
Doc 2: The Q4 financial report indicates a 15% increase in revenue, primarily driven by the new FusionX processor line. However, operational costs rose by 22% due to unforeseen logistical bottlenecks.
--- Final LLM Response ---
The primary cause for the drop in expected earnings was a semiconductor shortage that delayed the production of flagship graphics cards, impacting quarterly revenue targets. Additionally, operational costs rose by 22% due to unforeseen logistical bottlenecks.
Notice the subtle but important improvement. The re-ranker correctly identified that the "semiconductor shortage" document is the #1 most relevant, and the "operational costs" document is a strong #2. It pushed the less relevant document about "diversifying sourcing" out of the final context. This level of precision is critical when context window sizes are limited and every token counts.
Performance and Production Considerations for Re-ranking
* Performance is Key: The re-ranking step is a performance bottleneck. The time it takes is linear with the number of documents being re-ranked.
* Benchmark: On a standard CPU, re-ranking 50 documents with ms-marco-MiniLM-L-6-v2 can take ~400-600ms. Re-ranking 100 can take over a second. This is a significant addition to your response latency.
* Mitigation: Only re-rank a small number of initial candidates (k=25 to k=50 is a common range). Use a GPU for inference if possible, which can reduce this time by an order of magnitude. For interactive applications like chatbots, consider running the re-ranking asynchronously. You could show an initial answer based on the top-5 un-ranked results, then update the response once the re-ranked results are available.
* Model Selection: There's a trade-off between cross-encoder model size/accuracy and speed.
* ms-marco-MiniLM-L-6-v2: Small, fast, and a great starting point.
* ms-marco-MiniLM-L-12-v2: Larger, slightly slower, but more accurate.
* Specialized models (e.g., trained on your own domain data) can provide the best results.
Conclusion: A Production-Ready Retrieval Architecture
We have moved from a naive RAG system prone to semantic errors to a robust, multi-stage retrieval pipeline designed for production accuracy.
The final architecture, Query -> HyDE -> Bi-Encoder Retrieval (Top-K) -> Cross-Encoder Re-ranking (Top-N) -> LLM Generation, represents a state-of-the-art pattern for building high-fidelity RAG systems. While it introduces complexity and latency, the trade-offs are often necessary to achieve the level of reliability required for user-facing applications. The key for senior engineers is not just to implement these components, but to understand their performance characteristics, tune them for the specific application's latency budget, and build in the logic to handle their respective edge cases.