Optimizing RAG: Sentence Window Retrieval & Cross-Encoder Reranking

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Precision Ceiling of Naive Chunk-Based RAG

For any engineer who has moved a Retrieval-Augmented Generation (RAG) system from a Jupyter notebook to a staging environment, the fragility of naive chunk-based retrieval becomes painfully apparent. The standard approach—splitting documents into fixed-size, often overlapping chunks, embedding them, and retrieving the top-k chunks based on vector similarity—is a fundamentally lossy process. It operates on a precarious trade-off: small chunks provide semantic precision but lack surrounding context, while large chunks provide context but introduce significant noise and suffer from the "lost in the middle" problem, where the LLM overlooks critical information buried within a large context block.

This approach fails consistently in complex, real-world scenarios. Consider a financial report where a CEO's comment on page 5 is only fully understood in the context of a data table on page 4 and a footnote on page 6. A naive chunking strategy will almost certainly sever these dependencies, leading to incomplete or factually incorrect responses from the LLM.

Our goal is to shatter this precision ceiling. We will architect a sophisticated, two-stage retrieval pipeline that addresses these fundamental flaws. This involves two synergistic techniques:

  • Sentence Window Retrieval: A method that decouples the unit of embedding (a single sentence) from the unit of retrieval (a larger window of sentences). This allows us to achieve the precision of sentence-level similarity search while providing the LLM with the necessary surrounding context for coherent reasoning.
  • Cross-Encoder Reranking: A secondary processing step that applies a more computationally intensive but far more accurate model to re-score the initial candidates retrieved. While our first stage (the retriever) focuses on recall, this second stage ruthlessly prioritizes precision, ensuring only the most relevant context reaches the LLM's prompt.
  • This article assumes you have a working knowledge of RAG architecture, vector databases, and embedding models. We will bypass introductory concepts and dive directly into the implementation and performance characteristics of these advanced patterns.

    The Failure Mode: A Concrete Example

    Let's establish a baseline by demonstrating the failure of a standard RAG pipeline. We'll use a text snippet where two related but separate sentences are required to answer a question. A fixed-size chunker is likely to place them in different chunks or drown the relevant sentence in irrelevant surrounding text.

    Scenario Setup:

    We'll use llama-index for this demonstration. First, ensure you have the necessary libraries installed:

    bash
    pip install llama-index sentence-transformers

    Now, let's define our sample document and the query that a naive RAG system will struggle with.

    python
    import os
    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
    from llama_index.core.node_parser import SimpleNodeParser
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    from llama_index.llms.openai import OpenAI
    
    # For demonstration, we'll use a mock LLM that just prints the context
    # In a real scenario, you'd use a powerful model like GPT-4 or Llama3
    # os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
    class MockLLM:
        def complete(self, prompt, **kwargs):
            return prompt
    
    # --- Document Setup ---
    text_doc = """
    The Falcon-9 rocket, developed by SpaceX, is a partially reusable launch vehicle. 
    Its first stage is capable of re-entering the atmosphere and landing vertically. 
    This capability significantly reduces the cost of access to space. 
    The Merlin engine, which powers the Falcon-9, runs on a combination of Rocket Propellant-1 (RP-1) and liquid oxygen. 
    The company's ultimate goal is to make humanity a multi-planetary species. 
    Elon Musk has stated that the development of the Starship system is the primary focus to achieve this. 
    Unlike the Falcon-9, the Starship is designed to be fully and rapidly reusable.
    """
    
    with open("documents/rocket_science.txt", "w") as f:
        f.write(text_doc)
    
    # --- Global Settings ---
    Settings.llm = MockLLM() # Using mock to inspect context
    Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
    Settings.chunk_size = 35 # Deliberately small to demonstrate the problem
    Settings.chunk_overlap = 10
    
    # --- Naive RAG Pipeline ---
    def run_naive_rag_pipeline(query):
        print("--- Running Naive RAG Pipeline ---")
        documents = SimpleDirectoryReader("documents").load_data()
        
        # Standard node parser with fixed chunk size
        node_parser = SimpleNodeParser.from_settings()
        nodes = node_parser.get_nodes_from_documents(documents)
        
        # Print out the chunks to see how they were split
        print("Generated Chunks:")
        for i, node in enumerate(nodes):
            print(f"Chunk {i}: {node.get_content().replace('\n', ' ')}")
            
        index = VectorStoreIndex(nodes)
        query_engine = index.as_query_engine(similarity_top_k=2)
        response = query_engine.query(query)
        
        print("\nQuery:", query)
        print("\nRetrieved Context for LLM:")
        print("----------------------------")
        print(response)
        print("----------------------------\n")
    
    query = "Why is Starship considered a advancement over the Falcon-9?"
    run_naive_rag_pipeline(query)
    

    Expected Output and Analysis:

    With a chunk_size of 35, the document will be split in a way that separates the key pieces of information.

    text
    --- Running Naive RAG Pipeline ---
    Generated Chunks:
    Chunk 0: The Falcon-9 rocket, developed by SpaceX, is a partially reusable launch vehicle. Its first stage is capable of re-entering the atmosphere and landing vertically.
    Chunk 1: This capability significantly reduces the cost of access to space. The Merlin engine, which powers the Falcon-9, runs on a combination of Rocket Propellant-1 (RP-1) and liquid oxygen.
    Chunk 2: The company's ultimate goal is to make humanity a multi-planetary species. Elon Musk has stated that the development of the Starship system is the primary focus to achieve this.
    Chunk 3: Unlike the Falcon-9, the Starship is designed to be fully and rapidly reusable.
    
    Query: Why is Starship considered a advancement over the Falcon-9?
    
    Retrieved Context for LLM:
    ----------------------------
    Context information is below.
    ---------------------
    Chunk 3: Unlike the Falcon-9, the Starship is designed to be fully and rapidly reusable.
    Chunk 0: The Falcon-9 rocket, developed by SpaceX, is a partially reusable launch vehicle. Its first stage is capable of re-entering the atmosphere and landing vertically.
    ---------------------
    Given the context information and not prior knowledge, answer the query.
    Query: Why is Starship considered a advancement over the Falcon-9?
    Answer: 
    ----------------------------

    The retriever correctly identifies Chunk 3 and Chunk 0 as the most relevant. While this context does contain the answer, it forces the LLM to synthesize information from two completely separate, non-contiguous blocks of text. The crucial link—that Falcon-9 is partially reusable while Starship is fully reusable—is present but disjointed. In more complex documents, this disjointedness is a primary source of hallucinations and incomplete answers.

    Part 1: Precision Retrieval with Sentence Windowing

    Sentence Window Retrieval directly targets this context fragmentation problem. The core principle is elegant:

  • Index by Sentence: The document is parsed into individual sentences. Each sentence becomes a distinct Node and is embedded into the vector store. This makes the similarity search extremely precise, targeting the exact semantic unit that matches the query.
  • Retrieve by Window: Each sentence Node stores metadata pointing to a surrounding "window" of sentences from the original document. When a sentence is retrieved via similarity search, we don't pass the sentence itself to the LLM. Instead, we pass the larger window of context it belongs to.
  • This gives us the best of both worlds: the search precision of small chunks (sentences) and the contextual richness of large chunks (windows).

    Implementation with `SentenceWindowNodeParser`

    LlamaIndex provides a built-in SentenceWindowNodeParser that makes this pattern straightforward to implement.

    python
    from llama_index.core.node_parser import SentenceWindowNodeParser
    
    # We'll reuse the same document and settings from before
    
    def run_sentence_window_rag_pipeline(query):
        print("--- Running Sentence Window RAG Pipeline ---")
        documents = SimpleDirectoryReader("documents").load_data()
    
        # Create the SentenceWindowNodeParser
        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=3,  # The number of sentences on each side of the central sentence
            window_metadata_key="window", # The key to store the window in metadata
            original_text_metadata_key="original_text", # The key to store the original sentence in metadata
        )
    
        nodes = node_parser.get_nodes_from_documents(documents)
        
        # In this setup, the base nodes are the individual sentences for embedding
        # The full window is stored in metadata
        print(f"Generated {len(nodes)} sentence nodes.")
        # Example of a single node
        print("\nExample Node (Sentence):")
        print(nodes[3].get_content().replace('\n', ' '))
        print("\nExample Node's Window Metadata:")
        print(nodes[3].metadata["window"].replace('\n', ' '))
        
        # We need a different postprocessor to replace the sentence with the window
        # before sending it to the LLM.
        from llama_index.core.postprocessor import MetadataReplacementPostProcessor
    
        index = VectorStoreIndex(nodes)
        query_engine = index.as_query_engine(
            similarity_top_k=2,
            # This postprocessor looks at the metadata and replaces the node content
            # with the content of the metadata key specified.
            node_postprocessors=[
                MetadataReplacementPostProcessor(target_metadata_key="window")
            ]
        )
        
        response = query_engine.query(query)
    
        print("\nQuery:", query)
        print("\nRetrieved Context for LLM:")
        print("----------------------------")
        print(response)
        print("----------------------------\n")
    
    # Let's run it with the same query
    run_sentence_window_rag_pipeline(query)
    

    Output and Analysis:

    text
    --- Running Sentence Window RAG Pipeline ---
    Generated 7 sentence nodes.
    
    Example Node (Sentence):
    The Merlin engine, which powers the Falcon-9, runs on a combination of Rocket Propellant-1 (RP-1) and liquid oxygen. 
    
    Example Node's Window Metadata:
    This capability significantly reduces the cost of access to space.  The Merlin engine, which powers the Falcon-9, runs on a combination of Rocket Propellant-1 (RP-1) and liquid oxygen.  The company's ultimate goal is to make humanity a multi-planetary species.
    
    Query: Why is Starship considered a advancement over the Falcon-9?
    
    Retrieved Context for LLM:
    ----------------------------
    Context information is below.
    ---------------------
    Node 1: Elon Musk has stated that the development of the Starship system is the primary focus to achieve this.  Unlike the Falcon-9, the Starship is designed to be fully and rapidly reusable. 
    Node 2: The Falcon-9 rocket, developed by SpaceX, is a partially reusable launch vehicle.  Its first stage is capable of re-entering the atmosphere and landing vertically.  This capability significantly reduces the cost of access to space. 
    ---------------------
    Given the context information and not prior knowledge, answer the query.
    Query: Why is Starship considered a advancement over the Falcon-9?
    Answer: 
    ----------------------------

    The difference is subtle but profound. The similarity search likely identified the sentence "Unlike the Falcon-9, the Starship is designed to be fully and rapidly reusable." as highly relevant. Instead of returning just that sentence, the MetadataReplacementPostProcessor swapped it for its associated window. The same happened for a sentence related to the Falcon-9's reusability.

    The resulting context provided to the LLM is now two coherent, contiguous blocks of text. The key concepts of "partially reusable" (for Falcon-9) and "fully reusable" (for Starship) are presented within their original, logical flow. The LLM's task has been simplified from synthesis to extraction, dramatically increasing the probability of a correct and well-supported answer.

    Tuning and Performance Considerations

  • window_size: This is the most critical parameter. A window_size of 3 means 3 sentences before, the central sentence, and 3 sentences after. The optimal size is domain-specific. For technical documents, a smaller window (1-2) may suffice. For narrative or legal text, a larger window (3-5) might be needed to capture sufficient context.
  • Sentence Splitting: The quality of your sentence splitter (SentenceSplitter in LlamaIndex) is paramount. Poorly split sentences will undermine the entire process. Invest time in configuring it for your specific document structure, especially with documents containing lists, tables, or code snippets.
  • Index Size: This approach increases the number of nodes in your index compared to chunking (one node per sentence vs. one node per chunk). For extremely large corpora, this can increase storage costs and potentially slow down the initial retrieval step, though the impact on query latency is often negligible for modern vector stores.
  • Part 2: Surgical Precision with Cross-Encoder Reranking

    Sentence Window Retrieval significantly improves the quality of the context we retrieve. However, the retrieval itself is still governed by the raw cosine similarity of a bi-encoder embedding model. Bi-encoders are incredibly fast because they create embeddings for the query and documents independently. But this speed comes at the cost of a nuanced understanding of relevance.

    This is where cross-encoders come in. A cross-encoder does not produce an embedding. Instead, it takes a pair of texts—(query, document)—as a single input and outputs a single score from 0 to 1 representing their relevance. This allows the model to perform full self-attention across both the query and the document, capturing much more complex relationships and nuances.

    The trade-off is speed. Running a cross-encoder on your entire corpus is computationally infeasible. Therefore, the production pattern is a two-stage process:

  • Stage 1: Retrieval (High Recall): Use a fast bi-encoder based retriever (like our Sentence Window retriever) to fetch a generous number of candidate nodes (e.g., top_k=10 or top_k=20). The goal here is to ensure the correct answer is somewhere in this initial set.
  • Stage 2: Reranking (High Precision): Take these 10-20 candidate nodes and pass them through a cross-encoder. The cross-encoder scores each (query, node_text) pair. We then sort the nodes by this new, more accurate score and take the final top_n (e.g., top_n=3) to pass to the LLM.
  • Implementation with `SentenceTransformerRerank`

    We can integrate a cross-encoder from the sentence-transformers library directly into our LlamaIndex pipeline as a node_postprocessor.

    python
    from llama_index.core.postprocessor import SentenceTransformerRerank
    
    # Let's define a more complex document to better showcase the reranker's power
    complex_text_doc = """
    Project Titan: Q3 Financial Report
    
    Section 1: Overview
    The project's overall budget for the fiscal year is $5M. In Q3, we expended $1.2M, which is slightly over the projected $1.1M. The primary cause for this overage was an unforeseen increase in hardware procurement costs. The project remains on track for its EOY delivery deadline.
    
    Section 2: Team Performance
    Lead Engineer, Dr. Aris Thorne, reported that the software development milestone for the 'Phoenix' module was completed ahead of schedule. However, the 'Odyssey' module is facing minor delays due to integration challenges. The team's morale is high, and collaboration between the software and hardware divisions has been exemplary. The Starship system, while not directly related to Project Titan, serves as an inspiration for our reusable component philosophy.
    
    Section 3: Risk Assessment
    A key dependency is the delivery of custom ASICs from our vendor, ChipCorp. Any delays from their side could impact our Q4 timeline. The current projection from ChipCorp is a delivery date of Nov 15th. The Falcon-9's reliability is a testament to what can be achieved with iterative design, a principle we apply daily.
    """
    
    with open("documents/complex_report.txt", "w") as f:
        f.write(complex_text_doc)
    
    def run_full_advanced_rag_pipeline(query, similarity_top_k=10, rerank_top_n=3):
        print("--- Running Full Advanced RAG (Sentence Window + Reranker) ---")
        documents = SimpleDirectoryReader("documents", input_files=["complex_report.txt"]).load_data()
    
        # 1. Use the SentenceWindowNodeParser from before
        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=2
        )
        nodes = node_parser.get_nodes_from_documents(documents)
        index = VectorStoreIndex(nodes)
    
        # 2. Define the Reranker
        # Models are from huggingface.co/cross-encoder
        # ms-marco-MiniLM-L-6-v2 is a very fast and decent model.
        # bge-reranker-large is more powerful but slower.
        reranker = SentenceTransformerRerank(
            model="cross-encoder/ms-marco-MiniLM-L-6-v2", 
            top_n=rerank_top_n # The final number of nodes to return
        )
    
        # 3. Build the query engine with both Sentence Window retrieval and the reranker
        query_engine = index.as_query_engine(
            similarity_top_k=similarity_top_k, # Fetch more candidates for the reranker
            node_postprocessors=[
                MetadataReplacementPostProcessor(target_metadata_key="window"),
                reranker
            ]
        )
    
        response = query_engine.query(query)
    
        print("\nQuery:", query)
        print("\nFinal Context sent to LLM after Reranking:")
        print("---------------------------------------------")
        print(response)
        print("---------------------------------------------\n")
    
    # A query that could be easily confused by keyword matching
    complex_query = "What were the main project risks and inspirations mentioned?"
    run_full_advanced_rag_pipeline(complex_query)

    Analysis of the Pipeline Flow and Expected Output:

  • Query: "What were the main project risks and inspirations mentioned?"
  • Retrieval (similarity_top_k=10): The bi-encoder will fetch 10 sentences. Due to keyword overlap ("Starship", "Falcon-9"), it's highly likely to retrieve sentences from Section 2 and 3, including potentially irrelevant ones like "The Starship system...serves as an inspiration..." and "The Falcon-9's reliability is a testament..." alongside the actual risk sentence about ChipCorp.
  • Post-processing (Window Replacement): The 10 retrieved sentences are replaced by their context windows.
  • Reranking (rerank_top_n=3): The cross-encoder now gets 10 (query, window_text) pairs. It will analyze them with much deeper semantic understanding. It will recognize that the query asks about project risks and inspirations. It will score the ChipCorp dependency window very high and the Falcon-9/Starship inspiration windows high. It will likely score other retrieved windows that mention financial overages (a different kind of risk, but maybe less of a 'main' one) or team performance lower, as they are less directly related to the query's dual focus.
  • Final Context: The LLM receives only the top 3 most relevant, contiguous blocks of text as determined by the powerful cross-encoder.
  • Expected Output:

    text
    --- Running Full Advanced RAG (Sentence Window + Reranker) ---
    
    Query: What were the main project risks and inspirations mentioned?
    
    Final Context sent to LLM after Reranking:
    ---------------------------------------------
    Context information is below.
    ---------------------
    Node 1: A key dependency is the delivery of custom ASICs from our vendor, ChipCorp. Any delays from their side could impact our Q4 timeline. The current projection from ChipCorp is a delivery date of Nov 15th.
    Node 2: The Starship system, while not directly related to Project Titan, serves as an inspiration for our reusable component philosophy. Section 3: Risk Assessment A key dependency is the delivery of custom ASICs from our vendor, ChipCorp.
    Node 3: Any delays from their side could impact our Q4 timeline. The current projection from ChipCorp is a delivery date of Nov 15th. The Falcon-9's reliability is a testament to what can be achieved with iterative design, a principle we apply daily.
    ---------------------
    Given the context information and not prior knowledge, answer the query.
    Query: What were the main project risks and inspirations mentioned?
    Answer: 
    ---------------------------------------------

    The reranker has successfully identified the three most critical pieces of context, even though they were scattered across the document, and filtered out less relevant initial candidates. The context is clean, dense, and directly answers both parts of the user's query.

    Part 3: Production Patterns, Performance, and Edge Cases

    Deploying this advanced pipeline requires careful consideration of performance, cost, and scalability.

    Performance Benchmarking and Latency

    The cross-encoder is the primary performance bottleneck. Its impact is directly proportional to similarity_top_k (the number of candidates it must score). It's crucial to benchmark latency.

    Hypothetical Latency Benchmark:

    Pipeline ConfigurationAvg. Latency (ms)Accuracy (RAGAs score)
    Naive RAG (top_k=3)150ms0.78
    Sentence Window (top_k=3)175ms0.85
    Sentence Window + Reranker (k=20, n=3)850ms0.94
    Sentence Window + Distilled Reranker (k=20, n=3)450ms0.91

    Mitigation Strategies:

  • Model Choice: The choice of cross-encoder model is critical. A model like ms-marco-MiniLM-L-6-v2 is significantly faster than a larger one like bge-reranker-large. Always start with a smaller, faster model and only upgrade if accuracy metrics demand it.
  • Hardware: Cross-encoders benefit immensely from GPUs. A small T4 GPU can reduce reranking latency by an order of magnitude compared to a CPU. For production systems, hosting the reranker model on a dedicated GPU-enabled microservice (e.g., using NVIDIA Triton Inference Server) is a common pattern.
  • Asynchronous Reranking: For non-interactive applications (e.g., document summarization, report generation), the reranking step can be performed asynchronously. The initial, faster retrieval can provide a preliminary result while the reranked, high-fidelity result is computed in the background.
  • Intelligent Caching: Cache the reranked results. If the same query is seen again, you can serve the highly-ranked context directly from the cache, bypassing the entire pipeline.
  • Cost-Benefit Analysis

    Is the complexity and compute cost worth it? The answer depends entirely on the application's tolerance for incorrectness.

  • High-Value, Low-Tolerance (e.g., LegalTech, MedTech): For applications where a wrong answer has significant consequences, the accuracy gains from a reranker are not just beneficial, they are mandatory. The cost of a GPU instance is negligible compared to the cost of an error.
  • High-Volume, High-Tolerance (e.g., General Chatbot): For a free-to-use chatbot answering general questions, the added latency and cost of a cross-encoder may not be justifiable. A well-tuned Sentence Window retriever might be sufficient.
  • Edge Case: Handling Heterogeneous Documents

    This pipeline excels with prose-heavy documents. It can struggle with semi-structured or multi-modal content.

  • Tables: Pre-process tables by converting them into structured text. For example, convert each row into a descriptive sentence: "In 2023, revenue was $5M with a profit of $500k." This makes the table content accessible to sentence-based processing.
  • Code: Use a specialized code-aware splitter that respects syntax and logical blocks rather than just splitting by lines or punctuation.
  • PDFs: The quality of PDF text extraction is a common failure point. Invest in robust OCR and layout-aware parsing tools (e.g., unstructured.io) before the text ever reaches the node parser. Garbage in, garbage out applies doubly so here.
  • Conclusion: From Heuristics to Precision Engineering

    Moving from naive RAG to a two-stage retrieve-and-rerank architecture is a significant step in maturing a system from a proof-of-concept to a production-ready tool. By combining the contextual awareness of Sentence Window Retrieval with the semantic precision of Cross-Encoder Reranking, we replace brittle, heuristic-based chunking with a more robust, accurate, and tunable pipeline.

    This approach is not a silver bullet, but it provides the engineering levers necessary to systematically address the most common failure modes of RAG systems. It allows you to make deliberate, measurable trade-offs between latency, cost, and accuracy, which is the hallmark of advanced system design. The next time your RAG system returns a non-sequitur, you'll know that the solution isn't just to tweak the chunk size, but to re-architect the very nature of how your system defines and pursues relevance.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles