Advanced RAG: Sentence-Window Retrieval & Cross-Encoder Reranking

September 30, 2025

20 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Gap in Naive RAG

Retrieval-Augmented Generation (RAG) has become a cornerstone for building context-aware LLM applications. The standard approach—chunking documents, embedding the chunks, and performing vector search—is a powerful starting point. However, senior engineers quickly discover its limitations in production. Naive RAG systems often suffer from a fundamental context-granularity problem.

Consider a fixed-size chunking strategy (e.g., 1000 characters with 200 characters of overlap). This approach is blind to semantic boundaries and can lead to two critical failure modes:

Context Fragmentation: A single, coherent idea is split across two or more chunks. The most relevant chunk might be retrieved, but it's missing the crucial preceding or succeeding context, leading the LLM to generate incomplete or incorrect answers.

Context Dilution: A chunk contains a single relevant sentence buried within a sea of irrelevant text. The overall embedding for the chunk is diluted, causing it to rank lower than less relevant but more thematically consistent chunks during retrieval.

This leads to a frustrating paradox: your vector database contains the correct information, but the LLM never sees it in a usable form. The result is hallucination, low-quality responses, and a system that isn't trusted by its users.

To bridge this gap, we need to evolve our retrieval strategy. This article presents a production-grade, two-stage pipeline that directly addresses these issues:

Sentence-Window Retrieval: We embed individual sentences for high-precision semantic search. However, we retrieve a larger window of sentences surrounding the matched sentence to provide the LLM with rich, coherent context.

Cross-Encoder Reranking: We use the initial retrieval as a candidate generation step. We then apply a more powerful, computationally intensive cross-encoder model to rerank these candidates based on true semantic relevance to the query, ensuring only the most pertinent information reaches the LLM's context window.

Let's demonstrate the failure of naive RAG before building our advanced solution.

The Failure Case: A Concrete Example

Imagine a technical document about a fictional database system called 'ChronoDB'.

text

... The system's architecture is based on a Log-Structured Merge-Tree (LSM-Tree), which optimizes for high write throughput. This is critical for ingestion-heavy workloads. However, read performance can be a concern. To mitigate this, ChronoDB employs a tiered caching mechanism. The primary L1 cache is an in-memory LRU cache for hot data. The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM. The most impactful optimization for read latency, however, is not the caching. The 'Query Pipelining' feature, which is disabled by default, can reduce P99 latency by over 50% for complex analytical queries. To enable it, the configuration flag `enable_query_pipelining` must be set to `true` in the `chronos.yaml` file. This feature parallelizes query execution stages...

Query: "How do I reduce P99 read latency in ChronoDB?"

A naive chunker might create a chunk like this:

Chunk A: "...The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM. The most impactful optimization for read latency, however, is not the caching. The 'Query Pipelining' feature, which is disabled by default, can reduce P99 latency by over 50%..."

And another chunk:

Chunk B: "...for complex analytical queries. To enable it, the configuration flag enable_query_pipelining must be set to true in the chronos.yaml file. This feature parallelizes query execution stages..."

Vector search for our query might rank Chunk A highly because it contains "reduce P99 latency". However, it completely misses the actionable instruction on how to enable the feature. The LLM might respond, "You can reduce P99 latency by using the 'Query Pipelining' feature," which is correct but useless. Chunk B, containing the critical configuration detail, might not even be in the top-k results.

This is the problem we will solve.

python

# Setup: Ensure you have the necessary libraries installed
# !pip install -U langchain langchain-openai sentence-transformers faiss-cpu pypdf

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

# NOTE: Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "your_api_key"

# For demonstration, let's create a dummy text file that exemplifies the problem
dummy_text = """
The ChronoDB system architecture is based on a Log-Structured Merge-Tree (LSM-Tree), which optimizes for high write throughput. This is critical for ingestion-heavy workloads. However, read performance can be a concern, especially for analytical workloads that require scanning large data ranges.

To mitigate this, ChronoDB employs a tiered caching mechanism. The primary L1 cache is an in-memory LRU cache for hot data, providing sub-millisecond access times. The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM, offering single-digit millisecond latency.

While caching helps, the most impactful optimization for read latency is not the caching system itself. The 'Query Pipelining' feature, which is disabled by default for stability reasons in mixed workloads, can reduce P99 latency by over 50% for complex analytical queries. It achieves this by breaking a query into parallel execution stages.

To enable this powerful feature, the configuration flag `enable_query_pipelining` must be set to `true` in the `chronos.yaml` configuration file. After changing the setting, a full restart of the ChronoDB service is required for the change to take effect. Failure to restart will result in the setting not being applied.
"""

with open("chronodb_docs.txt", "w") as f:
    f.write(dummy_text)

from langchain_community.document_loaders import TextLoader

loader = TextLoader("chronodb_docs.txt")
docs = loader.load()

# Naive RAG: RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

print(f"Naive splitting created {len(splits)} chunks.")
for i, split in enumerate(splits):
    print(f"--- Chunk {i+1} ---\n{split.page_content}\n")

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore_naive = FAISS.from_documents(documents=splits, embedding=embeddings)

# Create QA Chain
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
qa_chain_naive = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore_naive.as_retriever(search_kwargs={'k': 2})
)

query = "How do I enable the feature that reduces P99 latency by 50% in ChronoDB?"
result_naive = qa_chain_naive.invoke({"query": query})

print("\n--- Naive RAG Query ---")
print(f"Query: {query}")
print(f"Result: {result_naive['result']}")

Running this code will likely yield an incomplete answer. The retriever might pull the chunk mentioning "Query Pipelining reduces P99 latency by 50%" but miss the chunk with enable_query_pipelining: true. The LLM, therefore, cannot provide the full, actionable answer.

Section 1: Implementing Sentence-Window Retrieval

The core idea is to decouple the unit of embedding from the unit of retrieval. We embed fine-grained sentences to achieve high-precision similarity search, but we provide the LLM with a wider, more coherent context window around that sentence.

The Strategy

Split Document into Sentences: Use a reliable sentence tokenizer (like NLTK or spaCy) to split the document into individual sentences.

Define a Context Window: For each sentence, create a "window" of context that includes k sentences before it and k sentences after it.

Store Metadata: In our vector store, each object will represent a single sentence. The vector will be the embedding of that single sentence. Crucially, we store the full context window as metadata associated with that vector.

Retrieve and Augment: During retrieval, we perform a vector search against the single-sentence embeddings. When we find the top-N matching sentences, we don't return the sentences themselves. Instead, we return the full context windows stored in their metadata.

Production-Grade Implementation

We'll build a custom data processing pipeline to handle this logic. While libraries like LlamaIndex have built-in SentenceWindowNodeParser, building it ourselves provides deeper understanding and customizability.

python

import nltk
from langchain.docstore.document import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import re

# Download sentence tokenizer models if not already present
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

class SentenceWindowRetriever:
    def __init__(self, docs, window_size=2, embedding_model=None):
        self.docs = docs
        self.window_size = window_size
        self.embedding_model = embedding_model or OpenAIEmbeddings()
        self.vector_store = self._build_vector_store()

    def _process_documents(self):
        processed_nodes = []
        for doc in self.docs:
            text = doc.page_content
            # Use a more robust sentence splitter
            sentences = nltk.sent_tokenize(text)
            for i, sentence in enumerate(sentences):
                # Edge case handling: window boundaries
                start_index = max(0, i - self.window_size)
                end_index = min(len(sentences), i + self.window_size + 1)
                
                window_sentences = sentences[start_index:end_index]
                context_window = " ".join(window_sentences)
                
                # Create a document for the single sentence (for embedding)
                # and store the full window in metadata
                node = Document(
                    page_content=sentence, 
                    metadata={
                        'source': doc.metadata.get('source', 'unknown'),
                        'window': context_window,
                        'original_sentence_index': i
                    }
                )
                processed_nodes.append(node)
        return processed_nodes

    def _build_vector_store(self):
        nodes = self._process_documents()
        # We embed only the single sentence (node.page_content)
        vector_store = FAISS.from_documents(documents=nodes, embedding=self.embedding_model)
        return vector_store

    def retrieve(self, query, k=4):
        # 1. Retrieve the most relevant sentence nodes
        sentence_nodes = self.vector_store.similarity_search(query, k=k)
        
        # 2. Extract the full context windows from metadata
        # De-duplicate windows to avoid feeding redundant context to the LLM
        unique_windows = {}
        for node in sentence_nodes:
            window = node.metadata['window']
            # A simple way to de-duplicate based on the window content itself
            unique_windows[window] = True
        
        # 3. Create new documents from the unique context windows
        context_docs = [Document(page_content=window) for window in unique_windows.keys()]
        return context_docs

# --- Let's use it with our ChronoDB example ---

# Load the document again
loader = TextLoader("chronodb_docs.txt")
docs = loader.load()

# Initialize our custom retriever
sentence_window_retriever = SentenceWindowRetriever(docs, window_size=2)

# Create a LangChain retriever from our custom logic
class CustomRetriever(FAISS.as_retriever().__class__):
    def __init__(self, custom_logic_retriever, **kwargs):
        super().__init__(**kwargs)
        self.custom_logic_retriever = custom_logic_retriever

    def _get_relevant_documents(self, query: str, *, run_manager):
        return self.custom_logic_retriever.retrieve(query, k=self.search_kwargs.get('k', 4))

# Instantiate the custom retriever for the QA chain
custom_retriever_instance = CustomRetriever(
    custom_logic_retriever=sentence_window_retriever,
    search_kwargs={'k': 4}
)

# Create the new QA chain
qa_chain_advanced = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model_name="gpt-4o", temperature=0),
    retriever=custom_retriever_instance
)

query = "How do I enable the feature that reduces P99 latency by 50% in ChronoDB?"
result_advanced = qa_chain_advanced.invoke({"query": query})

print("\n--- Advanced RAG (Sentence-Window) Query ---")
print(f"Query: {query}")
print(f"Result: {result_advanced['result']}")

# Let's inspect the retrieved context to see why it's better
retrieved_context = sentence_window_retriever.retrieve(query, k=4)
print("\n--- Retrieved Context (Sentence-Window) ---")
for i, doc in enumerate(retrieved_context):
    print(f"Context {i+1}:\n{doc.page_content}\n")

When you run this, the retrieved context will be far superior. The vector search will likely match the sentence "To enable this powerful feature, the configuration flag enable_query_pipelining must be set to true in the chronos.yaml configuration file." Because our window size is 2, the retrieved context passed to the LLM will be a coherent block including the sentences before and after, such as: "...The 'Query Pipelining' feature... can reduce P99 latency by over 50%... To enable this powerful feature, the configuration flag enable_query_pipelining must be set to true... After changing the setting, a full restart... is required...". The LLM now has all the necessary information to give a complete, actionable answer.

Section 2: The Reranking Imperative: Bi-Encoders vs. Cross-Encoders

Sentence-window retrieval dramatically improves context quality. However, the initial retrieval is still based on a bi-encoder model (like the one used by OpenAI or all-MiniLM-L6-v2). Bi-encoders are incredibly fast and scalable because they generate embeddings for the query and documents independently.

Bi-Encoder: cos_sim(embed(query), embed(document))

This is efficient for searching over millions of documents but has a drawback: it lacks deep interaction. The model never sees the query and the document at the same time. It's comparing two abstract representations, which can sometimes miss subtle but critical relevance cues.

Enter the cross-encoder. A cross-encoder takes both the query and a document as a single input and outputs a relevance score (e.g., from 0 to 1). It performs full self-attention across the combined text.

Cross-Encoder: model([query, document]) -> relevance_score

This is computationally expensive. You cannot run a cross-encoder over your entire corpus for every query. But it is far more accurate at determining true relevance.

The production pattern is a two-stage process:

Recall Stage: Use a fast bi-encoder (our sentence-window retriever) to fetch a large set of candidate documents (e.g., top 20-50).

Rerank Stage: Use a highly accurate cross-encoder to rerank only this small set of candidates and select the final top-N (e.g., top 3-5) to pass to the LLM.

Implementing a Cross-Encoder Reranking Stage

We'll use the sentence-transformers library, which provides pre-trained cross-encoder models perfect for this task. A popular choice is ms-marco-MiniLM-L-6-v2, which is trained on a massive dataset of search queries.

python

from sentence_transformers import CrossEncoder
from typing import List

class Reranker:
    def __init__(self, model_name='cross-encoder/ms-marco-MiniLM-L-6-v2'):
        # Initialize the CrossEncoder model
        # Using a GPU is highly recommended for performance
        self.model = CrossEncoder(model_name, max_length=512)

    def rerank(self, query: str, documents: List[Document], top_n: int = 3):
        if not documents:
            return []

        # Create pairs of [query, doc_content] for the model
        doc_contents = [doc.page_content for doc in documents]
        pairs = [[query, doc_content] for doc_content in doc_contents]

        # Predict scores. The model handles batching internally.
        scores = self.model.predict(pairs, show_progress_bar=False)

        # Combine documents with their scores
        doc_scores = list(zip(documents, scores))

        # Sort by score in descending order
        doc_scores.sort(key=lambda x: x[1], reverse=True)

        # Return the top N documents
        return [doc for doc, score in doc_scores[:top_n]]

# --- Integrating the Reranker into our full pipeline ---

class FullAdvancedRAGPipeline:
    def __init__(self, docs, window_size=2, retrieval_k=10, rerank_top_n=3):
        self.retrieval_k = retrieval_k
        self.rerank_top_n = rerank_top_n
        print("Initializing Sentence-Window Retriever...")
        self.retriever = SentenceWindowRetriever(docs, window_size=window_size)
        print("Initializing Cross-Encoder Reranker...")
        self.reranker = Reranker()
        print("Initializing LLM...")
        self.llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

    def query(self, query_text: str):
        # 1. Retrieve initial candidates using Sentence-Window method
        print(f"\nStep 1: Retrieving top {self.retrieval_k} candidates...")
        initial_candidates = self.retriever.retrieve(query_text, k=self.retrieval_k)
        print(f"Retrieved {len(initial_candidates)} unique context windows.")

        # 2. Rerank the candidates using the Cross-Encoder
        print(f"Step 2: Reranking candidates to select top {self.rerank_top_n}...")
        reranked_docs = self.reranker.rerank(query_text, initial_candidates, top_n=self.rerank_top_n)

        # 3. Augment the prompt and generate the answer
        print("Step 3: Generating response with reranked context...")
        context_for_llm = "\n\n---\n\n".join([doc.page_content for doc in reranked_docs])
        
        prompt_template = f"""
        Answer the following question based *only* on the provided context.
        If the answer is not in the context, say you don't know.

        Context:
        {context_for_llm}

        Question: {query_text}
        Answer:
        """
        
        response = self.llm.invoke(prompt_template)
        return response.content, reranked_docs

# --- Run the full pipeline ---

# Load docs
loader = TextLoader("chronodb_docs.txt")
docs = loader.load()

# Create and run the pipeline
full_pipeline = FullAdvancedRAGPipeline(docs, retrieval_k=10, rerank_top_n=3)
final_answer, final_context = full_pipeline.query(query)

print("\n--- Full Advanced RAG (Retrieval + Reranking) ---")
print(f"Query: {query}")
print(f"Final Answer: {final_answer}")

print("\n--- Final Context Provided to LLM ---")
for i, doc in enumerate(final_context):
    print(f"Context {i+1}:\n{doc.page_content}\n")

This final pipeline embodies a robust, production-ready pattern. We retrieve a wide net of 10 candidates to maximize our chances of capturing the relevant information (high recall). Then, we use the computationally expensive but highly precise cross-encoder to distill these down to the 3 most relevant contexts (high precision). This ensures the LLM's limited context window is filled with signal, not noise.

Section 3: Performance, Optimization, and Edge Cases

Deploying this advanced pipeline requires careful consideration of performance and potential failure modes.

Performance and Latency

The reranking step is the primary source of additional latency. Let's consider the trade-offs:

* retrieval_k (Initial Candidates): This is the most critical tuning parameter. A larger k increases the likelihood of finding the correct document but linearly increases the workload for the reranker. A typical starting point is between 20 and 50.

* Model Choice: The size of the cross-encoder model directly impacts latency. ms-marco-MiniLM-L-6-v2 is small and fast. Larger models like ms-marco-DeBERTa-v3-large might offer higher accuracy at the cost of significantly more computation. Always benchmark.

* Hardware: Cross-encoders are transformer models and benefit immensely from GPUs. Running on a CPU will be substantially slower. For production systems, a GPU endpoint (e.g., on SageMaker, Vertex AI, or a dedicated server) is essential for acceptable latency.

Optimization: Model Quantization and Compilation

For CPU-based deployments or to further optimize GPU inference, consider these techniques:

ONNX (Open Neural Network Exchange): Convert the PyTorch model to the ONNX format. The ONNX Runtime is often more optimized for inference than native PyTorch.

Quantization: Reduce the precision of the model's weights (e.g., from FP32 to INT8). This can provide a 2-4x speedup with a minimal drop in accuracy. Tools like Hugging Face's optimum library can simplify this process.

Handling Edge Cases

* Noisy Documents: The sentence tokenizer can be brittle. A document with many abbreviations, no periods, or unusual formatting can lead to poor sentence splits. Pre-processing the text to clean up formatting and normalize punctuation is a critical, often overlooked, step.

Short Documents: Our SentenceWindowRetriever's windowing logic must gracefully handle documents with fewer sentences than 2 window_size + 1. Our implementation using max(0, ...) and min(len(sentences), ...) already accounts for this by shrinking the window near document boundaries.

* Contradictory Information: What if the top-3 reranked documents contain conflicting facts? This is a difficult open problem. A practical strategy is to adjust the final prompt to the LLM:

text

    "Based on the following contexts, answer the question. If the contexts provide conflicting information, please highlight the contradiction in your answer."

This empowers the LLM to act as a synthesizer rather than a simple extractor, which is often more useful for the end-user.

* The 'Lost in the Middle' Problem: LLMs sometimes pay more attention to information at the beginning or end of their context window. When concatenating your final documents, consider whether the most relevant document (as determined by the reranker score) should always be placed first or last in the context string to maximize its impact.

Conclusion: From Prototype to Production

Moving from a naive RAG implementation to a sophisticated retrieval pipeline is a defining step in building AI systems that are not just impressive demos, but reliable, production-grade tools. By combining the high-recall, context-aware approach of Sentence-Window Retrieval with the high-precision filtering of Cross-Encoder Reranking, we directly attack the core weaknesses of basic RAG.

This two-stage retrieval pattern provides a robust framework:

* It respects semantic boundaries, preventing context fragmentation.

* It focuses embeddings on granular concepts (sentences), avoiding context dilution.

* It uses a computationally-aware hybrid model, leveraging the speed of bi-encoders for broad search and the accuracy of cross-encoders for final selection.

While more complex, this architecture is a powerful investment. It drastically reduces hallucinations, improves the factual accuracy of generated responses, and ultimately builds user trust in your LLM-powered applications. The next time your RAG system gives a frustratingly incomplete answer, you'll know that the solution lies not just in a better LLM, but in a fundamentally better retriever.

The Production Gap in Naive RAG

The Failure Case: A Concrete Example

Section 1: Implementing Sentence-Window Retrieval

The Strategy

Production-Grade Implementation

Section 2: The Reranking Imperative: Bi-Encoders vs. Cross-Encoders

Implementing a Cross-Encoder Reranking Stage

Section 3: Performance, Optimization, and Edge Cases

Performance and Latency

Handling Edge Cases

Conclusion: From Prototype to Production

Found this article helpful?