Advanced RAG: Sentence-Window Retrieval & Cross-Encoder Reranking

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Gap in Naive RAG

Retrieval-Augmented Generation (RAG) has become a cornerstone for building context-aware LLM applications. The standard approach—chunking documents, embedding the chunks, and performing vector search—is a powerful starting point. However, senior engineers quickly discover its limitations in production. Naive RAG systems often suffer from a fundamental context-granularity problem.

Consider a fixed-size chunking strategy (e.g., 1000 characters with 200 characters of overlap). This approach is blind to semantic boundaries and can lead to two critical failure modes:

  • Context Fragmentation: A single, coherent idea is split across two or more chunks. The most relevant chunk might be retrieved, but it's missing the crucial preceding or succeeding context, leading the LLM to generate incomplete or incorrect answers.
  • Context Dilution: A chunk contains a single relevant sentence buried within a sea of irrelevant text. The overall embedding for the chunk is diluted, causing it to rank lower than less relevant but more thematically consistent chunks during retrieval.
  • This leads to a frustrating paradox: your vector database contains the correct information, but the LLM never sees it in a usable form. The result is hallucination, low-quality responses, and a system that isn't trusted by its users.

    To bridge this gap, we need to evolve our retrieval strategy. This article presents a production-grade, two-stage pipeline that directly addresses these issues:

  • Sentence-Window Retrieval: We embed individual sentences for high-precision semantic search. However, we retrieve a larger window of sentences surrounding the matched sentence to provide the LLM with rich, coherent context.
  • Cross-Encoder Reranking: We use the initial retrieval as a candidate generation step. We then apply a more powerful, computationally intensive cross-encoder model to rerank these candidates based on true semantic relevance to the query, ensuring only the most pertinent information reaches the LLM's context window.
  • Let's demonstrate the failure of naive RAG before building our advanced solution.

    The Failure Case: A Concrete Example

    Imagine a technical document about a fictional database system called 'ChronoDB'.

    text
    ... The system's architecture is based on a Log-Structured Merge-Tree (LSM-Tree), which optimizes for high write throughput. This is critical for ingestion-heavy workloads. However, read performance can be a concern. To mitigate this, ChronoDB employs a tiered caching mechanism. The primary L1 cache is an in-memory LRU cache for hot data. The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM. The most impactful optimization for read latency, however, is not the caching. The 'Query Pipelining' feature, which is disabled by default, can reduce P99 latency by over 50% for complex analytical queries. To enable it, the configuration flag `enable_query_pipelining` must be set to `true` in the `chronos.yaml` file. This feature parallelizes query execution stages... 

    Query: "How do I reduce P99 read latency in ChronoDB?"

    A naive chunker might create a chunk like this:

    Chunk A: "...The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM. The most impactful optimization for read latency, however, is not the caching. The 'Query Pipelining' feature, which is disabled by default, can reduce P99 latency by over 50%..."

    And another chunk:

    Chunk B: "...for complex analytical queries. To enable it, the configuration flag enable_query_pipelining must be set to true in the chronos.yaml file. This feature parallelizes query execution stages..."

    Vector search for our query might rank Chunk A highly because it contains "reduce P99 latency". However, it completely misses the actionable instruction on how to enable the feature. The LLM might respond, "You can reduce P99 latency by using the 'Query Pipelining' feature," which is correct but useless. Chunk B, containing the critical configuration detail, might not even be in the top-k results.

    This is the problem we will solve.

    python
    # Setup: Ensure you have the necessary libraries installed
    # !pip install -U langchain langchain-openai sentence-transformers faiss-cpu pypdf
    
    import os
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.vectorstores import FAISS
    from langchain_openai import OpenAIEmbeddings, ChatOpenAI
    from langchain.chains import RetrievalQA
    
    # NOTE: Set your OpenAI API key
    # os.environ["OPENAI_API_KEY"] = "your_api_key"
    
    # For demonstration, let's create a dummy text file that exemplifies the problem
    dummy_text = """
    The ChronoDB system architecture is based on a Log-Structured Merge-Tree (LSM-Tree), which optimizes for high write throughput. This is critical for ingestion-heavy workloads. However, read performance can be a concern, especially for analytical workloads that require scanning large data ranges.
    
    To mitigate this, ChronoDB employs a tiered caching mechanism. The primary L1 cache is an in-memory LRU cache for hot data, providing sub-millisecond access times. The L2 cache utilizes a memory-mapped file on NVMe storage for warm data that doesn't fit in RAM, offering single-digit millisecond latency.
    
    While caching helps, the most impactful optimization for read latency is not the caching system itself. The 'Query Pipelining' feature, which is disabled by default for stability reasons in mixed workloads, can reduce P99 latency by over 50% for complex analytical queries. It achieves this by breaking a query into parallel execution stages.
    
    To enable this powerful feature, the configuration flag `enable_query_pipelining` must be set to `true` in the `chronos.yaml` configuration file. After changing the setting, a full restart of the ChronoDB service is required for the change to take effect. Failure to restart will result in the setting not being applied.
    """
    
    with open("chronodb_docs.txt", "w") as f:
        f.write(dummy_text)
    
    from langchain_community.document_loaders import TextLoader
    
    loader = TextLoader("chronodb_docs.txt")
    docs = loader.load()
    
    # Naive RAG: RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=50)
    splits = text_splitter.split_documents(docs)
    
    print(f"Naive splitting created {len(splits)} chunks.")
    for i, split in enumerate(splits):
        print(f"--- Chunk {i+1} ---\n{split.page_content}\n")
    
    # Create vector store
    embeddings = OpenAIEmbeddings()
    vectorstore_naive = FAISS.from_documents(documents=splits, embedding=embeddings)
    
    # Create QA Chain
    llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
    qa_chain_naive = RetrievalQA.from_chain_type(
        llm,
        retriever=vectorstore_naive.as_retriever(search_kwargs={'k': 2})
    )
    
    query = "How do I enable the feature that reduces P99 latency by 50% in ChronoDB?"
    result_naive = qa_chain_naive.invoke({"query": query})
    
    print("\n--- Naive RAG Query ---")
    print(f"Query: {query}")
    print(f"Result: {result_naive['result']}")

    Running this code will likely yield an incomplete answer. The retriever might pull the chunk mentioning "Query Pipelining reduces P99 latency by 50%" but miss the chunk with enable_query_pipelining: true. The LLM, therefore, cannot provide the full, actionable answer.

    Section 1: Implementing Sentence-Window Retrieval

    The core idea is to decouple the unit of embedding from the unit of retrieval. We embed fine-grained sentences to achieve high-precision similarity search, but we provide the LLM with a wider, more coherent context window around that sentence.

    The Strategy

  • Split Document into Sentences: Use a reliable sentence tokenizer (like NLTK or spaCy) to split the document into individual sentences.
  • Define a Context Window: For each sentence, create a "window" of context that includes k sentences before it and k sentences after it.
  • Store Metadata: In our vector store, each object will represent a single sentence. The vector will be the embedding of that single sentence. Crucially, we store the full context window as metadata associated with that vector.
  • Retrieve and Augment: During retrieval, we perform a vector search against the single-sentence embeddings. When we find the top-N matching sentences, we don't return the sentences themselves. Instead, we return the full context windows stored in their metadata.
  • Production-Grade Implementation

    We'll build a custom data processing pipeline to handle this logic. While libraries like LlamaIndex have built-in SentenceWindowNodeParser, building it ourselves provides deeper understanding and customizability.

    python
    import nltk
    from langchain.docstore.document import Document
    from langchain_openai import OpenAIEmbeddings
    from langchain_community.vectorstores import FAISS
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI
    import re
    
    # Download sentence tokenizer models if not already present
    try:
        nltk.data.find('tokenizers/punkt')
    except nltk.downloader.DownloadError:
        nltk.download('punkt')
    
    class SentenceWindowRetriever:
        def __init__(self, docs, window_size=2, embedding_model=None):
            self.docs = docs
            self.window_size = window_size
            self.embedding_model = embedding_model or OpenAIEmbeddings()
            self.vector_store = self._build_vector_store()
    
        def _process_documents(self):
            processed_nodes = []
            for doc in self.docs:
                text = doc.page_content
                # Use a more robust sentence splitter
                sentences = nltk.sent_tokenize(text)
                for i, sentence in enumerate(sentences):
                    # Edge case handling: window boundaries
                    start_index = max(0, i - self.window_size)
                    end_index = min(len(sentences), i + self.window_size + 1)
                    
                    window_sentences = sentences[start_index:end_index]
                    context_window = " ".join(window_sentences)
                    
                    # Create a document for the single sentence (for embedding)
                    # and store the full window in metadata
                    node = Document(
                        page_content=sentence, 
                        metadata={
                            'source': doc.metadata.get('source', 'unknown'),
                            'window': context_window,
                            'original_sentence_index': i
                        }
                    )
                    processed_nodes.append(node)
            return processed_nodes
    
        def _build_vector_store(self):
            nodes = self._process_documents()
            # We embed only the single sentence (node.page_content)
            vector_store = FAISS.from_documents(documents=nodes, embedding=self.embedding_model)
            return vector_store
    
        def retrieve(self, query, k=4):
            # 1. Retrieve the most relevant sentence nodes
            sentence_nodes = self.vector_store.similarity_search(query, k=k)
            
            # 2. Extract the full context windows from metadata
            # De-duplicate windows to avoid feeding redundant context to the LLM
            unique_windows = {}
            for node in sentence_nodes:
                window = node.metadata['window']
                # A simple way to de-duplicate based on the window content itself
                unique_windows[window] = True
            
            # 3. Create new documents from the unique context windows
            context_docs = [Document(page_content=window) for window in unique_windows.keys()]
            return context_docs
    
    # --- Let's use it with our ChronoDB example ---
    
    # Load the document again
    loader = TextLoader("chronodb_docs.txt")
    docs = loader.load()
    
    # Initialize our custom retriever
    sentence_window_retriever = SentenceWindowRetriever(docs, window_size=2)
    
    # Create a LangChain retriever from our custom logic
    class CustomRetriever(FAISS.as_retriever().__class__):
        def __init__(self, custom_logic_retriever, **kwargs):
            super().__init__(**kwargs)
            self.custom_logic_retriever = custom_logic_retriever
    
        def _get_relevant_documents(self, query: str, *, run_manager):
            return self.custom_logic_retriever.retrieve(query, k=self.search_kwargs.get('k', 4))
    
    # Instantiate the custom retriever for the QA chain
    custom_retriever_instance = CustomRetriever(
        custom_logic_retriever=sentence_window_retriever,
        search_kwargs={'k': 4}
    )
    
    # Create the new QA chain
    qa_chain_advanced = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(model_name="gpt-4o", temperature=0),
        retriever=custom_retriever_instance
    )
    
    query = "How do I enable the feature that reduces P99 latency by 50% in ChronoDB?"
    result_advanced = qa_chain_advanced.invoke({"query": query})
    
    print("\n--- Advanced RAG (Sentence-Window) Query ---")
    print(f"Query: {query}")
    print(f"Result: {result_advanced['result']}")
    
    # Let's inspect the retrieved context to see why it's better
    retrieved_context = sentence_window_retriever.retrieve(query, k=4)
    print("\n--- Retrieved Context (Sentence-Window) ---")
    for i, doc in enumerate(retrieved_context):
        print(f"Context {i+1}:\n{doc.page_content}\n")

    When you run this, the retrieved context will be far superior. The vector search will likely match the sentence "To enable this powerful feature, the configuration flag enable_query_pipelining must be set to true in the chronos.yaml configuration file." Because our window size is 2, the retrieved context passed to the LLM will be a coherent block including the sentences before and after, such as: "...The 'Query Pipelining' feature... can reduce P99 latency by over 50%... To enable this powerful feature, the configuration flag enable_query_pipelining must be set to true... After changing the setting, a full restart... is required...". The LLM now has all the necessary information to give a complete, actionable answer.

    Section 2: The Reranking Imperative: Bi-Encoders vs. Cross-Encoders

    Sentence-window retrieval dramatically improves context quality. However, the initial retrieval is still based on a bi-encoder model (like the one used by OpenAI or all-MiniLM-L6-v2). Bi-encoders are incredibly fast and scalable because they generate embeddings for the query and documents independently.

    Bi-Encoder: cos_sim(embed(query), embed(document))

    This is efficient for searching over millions of documents but has a drawback: it lacks deep interaction. The model never sees the query and the document at the same time. It's comparing two abstract representations, which can sometimes miss subtle but critical relevance cues.

    Enter the cross-encoder. A cross-encoder takes both the query and a document as a single input and outputs a relevance score (e.g., from 0 to 1). It performs full self-attention across the combined text.

    Cross-Encoder: model([query, document]) -> relevance_score

    This is computationally expensive. You cannot run a cross-encoder over your entire corpus for every query. But it is far more accurate at determining true relevance.

    The production pattern is a two-stage process:

  • Recall Stage: Use a fast bi-encoder (our sentence-window retriever) to fetch a large set of candidate documents (e.g., top 20-50).
  • Rerank Stage: Use a highly accurate cross-encoder to rerank only this small set of candidates and select the final top-N (e.g., top 3-5) to pass to the LLM.
  • Implementing a Cross-Encoder Reranking Stage

    We'll use the sentence-transformers library, which provides pre-trained cross-encoder models perfect for this task. A popular choice is ms-marco-MiniLM-L-6-v2, which is trained on a massive dataset of search queries.

    python
    from sentence_transformers import CrossEncoder
    from typing import List
    
    class Reranker:
        def __init__(self, model_name='cross-encoder/ms-marco-MiniLM-L-6-v2'):
            # Initialize the CrossEncoder model
            # Using a GPU is highly recommended for performance
            self.model = CrossEncoder(model_name, max_length=512)
    
        def rerank(self, query: str, documents: List[Document], top_n: int = 3):
            if not documents:
                return []
    
            # Create pairs of [query, doc_content] for the model
            doc_contents = [doc.page_content for doc in documents]
            pairs = [[query, doc_content] for doc_content in doc_contents]
    
            # Predict scores. The model handles batching internally.
            scores = self.model.predict(pairs, show_progress_bar=False)
    
            # Combine documents with their scores
            doc_scores = list(zip(documents, scores))
    
            # Sort by score in descending order
            doc_scores.sort(key=lambda x: x[1], reverse=True)
    
            # Return the top N documents
            return [doc for doc, score in doc_scores[:top_n]]
    
    # --- Integrating the Reranker into our full pipeline ---
    
    class FullAdvancedRAGPipeline:
        def __init__(self, docs, window_size=2, retrieval_k=10, rerank_top_n=3):
            self.retrieval_k = retrieval_k
            self.rerank_top_n = rerank_top_n
            print("Initializing Sentence-Window Retriever...")
            self.retriever = SentenceWindowRetriever(docs, window_size=window_size)
            print("Initializing Cross-Encoder Reranker...")
            self.reranker = Reranker()
            print("Initializing LLM...")
            self.llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
    
        def query(self, query_text: str):
            # 1. Retrieve initial candidates using Sentence-Window method
            print(f"\nStep 1: Retrieving top {self.retrieval_k} candidates...")
            initial_candidates = self.retriever.retrieve(query_text, k=self.retrieval_k)
            print(f"Retrieved {len(initial_candidates)} unique context windows.")
    
            # 2. Rerank the candidates using the Cross-Encoder
            print(f"Step 2: Reranking candidates to select top {self.rerank_top_n}...")
            reranked_docs = self.reranker.rerank(query_text, initial_candidates, top_n=self.rerank_top_n)
    
            # 3. Augment the prompt and generate the answer
            print("Step 3: Generating response with reranked context...")
            context_for_llm = "\n\n---\n\n".join([doc.page_content for doc in reranked_docs])
            
            prompt_template = f"""
            Answer the following question based *only* on the provided context.
            If the answer is not in the context, say you don't know.
    
            Context:
            {context_for_llm}
    
            Question: {query_text}
            Answer:
            """
            
            response = self.llm.invoke(prompt_template)
            return response.content, reranked_docs
    
    # --- Run the full pipeline ---
    
    # Load docs
    loader = TextLoader("chronodb_docs.txt")
    docs = loader.load()
    
    # Create and run the pipeline
    full_pipeline = FullAdvancedRAGPipeline(docs, retrieval_k=10, rerank_top_n=3)
    final_answer, final_context = full_pipeline.query(query)
    
    print("\n--- Full Advanced RAG (Retrieval + Reranking) ---")
    print(f"Query: {query}")
    print(f"Final Answer: {final_answer}")
    
    print("\n--- Final Context Provided to LLM ---")
    for i, doc in enumerate(final_context):
        print(f"Context {i+1}:\n{doc.page_content}\n")

    This final pipeline embodies a robust, production-ready pattern. We retrieve a wide net of 10 candidates to maximize our chances of capturing the relevant information (high recall). Then, we use the computationally expensive but highly precise cross-encoder to distill these down to the 3 most relevant contexts (high precision). This ensures the LLM's limited context window is filled with signal, not noise.

    Section 3: Performance, Optimization, and Edge Cases

    Deploying this advanced pipeline requires careful consideration of performance and potential failure modes.

    Performance and Latency

    The reranking step is the primary source of additional latency. Let's consider the trade-offs:

    * retrieval_k (Initial Candidates): This is the most critical tuning parameter. A larger k increases the likelihood of finding the correct document but linearly increases the workload for the reranker. A typical starting point is between 20 and 50.

    * Model Choice: The size of the cross-encoder model directly impacts latency. ms-marco-MiniLM-L-6-v2 is small and fast. Larger models like ms-marco-DeBERTa-v3-large might offer higher accuracy at the cost of significantly more computation. Always benchmark.

    * Hardware: Cross-encoders are transformer models and benefit immensely from GPUs. Running on a CPU will be substantially slower. For production systems, a GPU endpoint (e.g., on SageMaker, Vertex AI, or a dedicated server) is essential for acceptable latency.

    Optimization: Model Quantization and Compilation

    For CPU-based deployments or to further optimize GPU inference, consider these techniques:

  • ONNX (Open Neural Network Exchange): Convert the PyTorch model to the ONNX format. The ONNX Runtime is often more optimized for inference than native PyTorch.
  • Quantization: Reduce the precision of the model's weights (e.g., from FP32 to INT8). This can provide a 2-4x speedup with a minimal drop in accuracy. Tools like Hugging Face's optimum library can simplify this process.
  • Handling Edge Cases

    * Noisy Documents: The sentence tokenizer can be brittle. A document with many abbreviations, no periods, or unusual formatting can lead to poor sentence splits. Pre-processing the text to clean up formatting and normalize punctuation is a critical, often overlooked, step.

    Short Documents: Our SentenceWindowRetriever's windowing logic must gracefully handle documents with fewer sentences than 2 window_size + 1. Our implementation using max(0, ...) and min(len(sentences), ...) already accounts for this by shrinking the window near document boundaries.

    * Contradictory Information: What if the top-3 reranked documents contain conflicting facts? This is a difficult open problem. A practical strategy is to adjust the final prompt to the LLM:

    text
        "Based on the following contexts, answer the question. If the contexts provide conflicting information, please highlight the contradiction in your answer."

    This empowers the LLM to act as a synthesizer rather than a simple extractor, which is often more useful for the end-user.

    * The 'Lost in the Middle' Problem: LLMs sometimes pay more attention to information at the beginning or end of their context window. When concatenating your final documents, consider whether the most relevant document (as determined by the reranker score) should always be placed first or last in the context string to maximize its impact.

    Conclusion: From Prototype to Production

    Moving from a naive RAG implementation to a sophisticated retrieval pipeline is a defining step in building AI systems that are not just impressive demos, but reliable, production-grade tools. By combining the high-recall, context-aware approach of Sentence-Window Retrieval with the high-precision filtering of Cross-Encoder Reranking, we directly attack the core weaknesses of basic RAG.

    This two-stage retrieval pattern provides a robust framework:

    * It respects semantic boundaries, preventing context fragmentation.

    * It focuses embeddings on granular concepts (sentences), avoiding context dilution.

    * It uses a computationally-aware hybrid model, leveraging the speed of bi-encoders for broad search and the accuracy of cross-encoders for final selection.

    While more complex, this architecture is a powerful investment. It drastically reduces hallucinations, improves the factual accuracy of generated responses, and ultimately builds user trust in your LLM-powered applications. The next time your RAG system gives a frustratingly incomplete answer, you'll know that the solution lies not just in a better LLM, but in a fundamentally better retriever.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles