Advanced RAG: Sentence Window Retrieval & Cross-Encoder Re-ranking

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Bottleneck: Context Fragmentation and Semantic Noise

If you've moved a Retrieval-Augmented Generation (RAG) system from a proof-of-concept to a production-candidate, you've inevitably encountered the limitations of naive chunking. The standard approach—splitting documents into fixed-size, often overlapping, chunks—is a serviceable starting point, but it consistently fails on complex documents where context is key.

The core problem is a fundamental trade-off:

  • Small Chunks (e.g., 128-256 tokens): These create highly specific, dense embeddings, which are great for vector search precision. However, they often sever critical context. A retrieved chunk might contain the exact answer to a query, but lack the preceding sentence that defines a key acronym or the following sentence that provides crucial nuance. The LLM, fed this decontextualized snippet, hallucinates or gives an incomplete answer.
  • Large Chunks (e.g., 1024+ tokens): These preserve more context, mitigating the fragmentation problem. However, they introduce significant noise. A large chunk might contain one highly relevant sentence buried among paragraphs of tangentially related information. This noise dilutes the signal, making it harder for the LLM to synthesize a precise answer and increasing the risk of it focusing on the wrong part of the text.
  • Consider this snippet from a hypothetical quarterly financial report:

    "...The 'Phoenix Initiative' was a major driver of Q3 growth, exceeding initial projections by 15%. However, initial capital expenditures for the initiative were higher than anticipated. The primary reason for this overrun was unforeseen supply chain disruptions in our Southeast Asia manufacturing facilities. These disruptions led to a 5% increase in raw material costs. Consequently, the project's overall ROI is now projected to be 12% over a five-year period, down from the initial 14% estimate..."

    A query like, "Why was the Phoenix Initiative's ROI revised downwards?" is challenging for a naive RAG system.

  • A small chunk might only contain "consequently, the project's overall ROI is now projected to be 12% over a five-year period, down from the initial 14% estimate...". This identifies the revision but misses the why.
  • Another small chunk might retrieve "unforeseen supply chain disruptions in our Southeast Asia manufacturing facilities". This is part of the reason, but lacks the connection to the ROI.
    • A large chunk might retrieve the whole paragraph, but if the query was slightly different, it could also retrieve adjacent, irrelevant paragraphs about marketing spend or executive compensation, confusing the LLM.

    To overcome this, we need a more sophisticated, multi-stage retrieval architecture. This post details a powerful, two-stage pattern that we've found highly effective in production: Sentence Window Retrieval for context enrichment, followed by Cross-Encoder Re-ranking for precision enhancement.


    Part 1: Context Enrichment with Sentence Window Retrieval

    Sentence Window Retrieval directly addresses the context fragmentation problem. The core principle is simple but powerful: retrieve based on a single sentence, but provide the LLM with a window of sentences surrounding it.

    This approach combines the best of both worlds:

    * Indexing & Retrieval: Embeddings are generated for individual sentences. This makes the vector search highly precise, as each vector represents a very specific semantic unit.

    Context Augmentation: Once the most relevant sentence is identified via vector search, we retrieve it plus* the k sentences before and after it. This bundle of sentences is then passed to the LLM, ensuring the core retrieved fact is surrounded by its original context.

    Implementation with LlamaIndex

    LlamaIndex provides an elegant, out-of-the-box solution for this pattern with its SentenceWindowNodeParser.

    Let's set up a working example. First, ensure you have the necessary libraries:

    bash
    pip install llama-index sentence-transformers

    Now, let's write a script that ingests a document, parses it using the sentence window strategy, and executes a query.

    python
    import os
    from llama_index.core import Document, VectorStoreIndex, Settings
    from llama_index.core.node_parser import SentenceWindowNodeParser
    from llama_index.core.postprocessor import MetadataReplacementPostProcessor
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    from llama_index.llms.openai import OpenAI # Replace with your preferred LLM
    
    # For this example, we'll use a mock LLM. In production, use a real one.
    # from llama_index.core.llms import MockLLM
    
    # --- 1. Configuration ---
    # Set up your OpenAI API Key if you are using it
    # os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
    
    # Use a local embedding model
    Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
    
    # In a real scenario, configure your LLM
    # For this example, we can use a mock to see the retrieval process without API calls
    # Settings.llm = MockLLM(max_tokens=256)
    # If using OpenAI:
    Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    
    # --- 2. Document Preparation ---
    # Create a more complex document to demonstrate the pattern's strength
    document_text = ( 
        "This document details the project 'Odyssey'. Project Odyssey began in 2021. "
        "Its primary goal was to refactor the legacy monolithic backend. The project team consisted of 12 engineers. "
        "The budget for Odyssey was set at $2.5 million. Initial phases focused on infrastructure setup. "
        "A major challenge encountered was data migration from the old OracleDB. This migration was complex due to schema drift over 15 years. "
        "The team adopted a microservices architecture using Kubernetes. The chosen programming language was Go for its performance characteristics. "
        "Security was a top priority, with Vault used for secrets management. The 'Phoenix Initiative', a related sub-project, focused on the frontend rewrite. "
        "The Phoenix Initiative's ROI is now projected to be 12% over five years. This revision was due to unforeseen supply chain disruptions. "
        "These disruptions caused a 5% increase in hardware procurement costs. Final deployment of Odyssey is scheduled for Q4 2024. "
        "Post-launch, a dedicated SRE team will manage the new infrastructure. Key performance indicators will be latency and uptime."
    )
    document = Document(text=document_text)
    
    # --- 3. The Sentence Window Node Parser ---
    note = """
    Key Parameters:
    - sentence_splitter: The function to split text into sentences. Defaults to a regex-based splitter.
    - window_size: The number of sentences on each side of the central sentence to include in the window.
      A window_size of 1 means 1 sentence before and 1 sentence after.
    - window_metadata_key: The key in the node's metadata to store the windowed text.
    - original_sentence_metadata_key: The key to store the original sentence that was embedded.
    """
    
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=3, # The 'k' value
        window_metadata_key="window",
        original_sentence_metadata_key="original_sentence",
    )
    
    # --- 4. Indexing Pipeline ---
    # This will create nodes where the text is the single sentence, but the metadata contains the full window.
    nodes = node_parser.get_nodes_from_documents([document])
    
    # Let's inspect a node to understand the structure
    print(f"--- Inspecting a sample node ---")
    print(f"Original Sentence (for embedding): '{nodes[5].metadata['original_sentence']}'")
    print(f"Window (for LLM context): '{nodes[5].metadata['window']}'")
    print(f"Node text (what's embedded): '{nodes[5].text}'")
    print("-" * 30)
    
    # Build the vector index over the 'original_sentence' embeddings
    index = VectorStoreIndex(nodes)
    
    # --- 5. Querying Pipeline with Context Replacement ---
    # The MetadataReplacementPostProcessor is the magic ingredient.
    # It replaces the node's text (the single sentence) with the text from the metadata key ('window').
    # This happens *after* retrieval but *before* synthesis.
    query_engine = index.as_query_engine(
        similarity_top_k=2,
        node_postprocessors=[
            MetadataReplacementPostProcessor(target_metadata_key="window")
        ]
    )
    
    query = "What was the primary challenge related to the Odyssey project's database?"
    response = query_engine.query(query)
    
    print(f"--- Query ---
    {query}
    ")
    print(f"--- Response ---
    {response}
    ")
    
    # Let's inspect the source nodes to see what the LLM received
    print(f"--- Source Nodes for LLM ---")
    for node in response.source_nodes:
        print(f"Score: {node.score:.4f}")
        print(f"Content: {node.text}") # This will be the full window!
        print("-"*5)
    

    Analysis of the Implementation

  • SentenceWindowNodeParser: This is the core component. During indexing, it first splits the document into sentences. Then, for each sentence, it creates a Node object. The text property of the node (which gets embedded) is the single sentence. Crucially, it also creates a window in the metadata containing the sentence itself plus the 3 sentences before and after it.
  • MetadataReplacementPostProcessor: This is the critical link in the query chain. By default, after retrieving nodes from the vector store, the query engine would pass the node.text (the single sentence) to the LLM. This postprocessor intercepts the retrieved nodes and replaces their text attribute with the content of metadata['window']. The result is that the vector search is performed on precise single sentences, but the LLM receives the full, context-rich window.
  • Running the script above, you'll see the source node provided to the LLM isn't just "A major challenge encountered was data migration from the old OracleDB.". Instead, it's a much richer context block:

    text
    Content: The budget for Odyssey was set at $2.5 million. Initial phases focused on infrastructure setup. A major challenge encountered was data migration from the old OracleDB. This migration was complex due to schema drift over 15 years. The team adopted a microservices architecture using Kubernetes. The chosen programming language was Go for its performance characteristics. Security was a top priority, with Vault used for secrets management.

    This provides the LLM with everything it needs to understand the challenge (data migration), the database type (OracleDB), and the reason for the complexity (schema drift).

    Edge Cases and Performance Considerations

    * Window Size (k): The choice of k is application-dependent. For dense technical manuals, a k of 1 or 2 might be sufficient. For narrative or legal documents where context builds over several paragraphs, a k of 3-5 might be better. A larger k increases the context size sent to the LLM, which can increase cost and latency, and potentially re-introduce noise if set too high. Always test with a representative evaluation set.

    * Document Boundaries: The parser correctly handles sentences at the beginning or end of a document, simply including fewer sentences on one side of the window.

    * Indexing Overhead: This method creates more Node objects than simple chunking (one per sentence vs. one per chunk). This increases the size of your vector index and can slightly increase indexing time. However, the query-time benefits usually outweigh this one-time cost.


    Part 2: Precision Enhancement with Cross-Encoder Re-ranking

    Sentence Window Retrieval dramatically improves the quality of the context we retrieve. However, the initial retrieval is still based on vector similarity (cosine distance or dot product) from a bi-encoder. Bi-encoders (like bge-small-en-v1.5 used above) are fast because they generate embeddings for the query and documents independently. The search is a nearest-neighbor search in a vector space.

    This is efficient but has a flaw: the model never sees the query and the document at the same time. It's a semantic search, but it can still return results that are topically related but not truly relevant to the user's specific question.

    This is where cross-encoders come in.

    A cross-encoder is a different type of Transformer model. Instead of creating separate embeddings, it takes both the query and a document as a single input [CLS] query [SEP] document [SEP] and outputs a single score between 0 and 1 representing their relevance. This process is much more computationally expensive because it requires a full model forward pass for every query-document pair. However, it is significantly more accurate because the model can perform full attention across both the query and the document simultaneously.

    We can't use a cross-encoder for initial retrieval from a corpus of millions of documents—it would be far too slow. The production pattern is to use them for re-ranking:

  • Retrieve: Use the fast bi-encoder and vector search to retrieve a larger-than-needed set of candidate documents (e.g., top_k=10).
  • Re-rank: Pass these 10 candidates and the query to a cross-encoder. The cross-encoder calculates a precise relevance score for each.
  • Prune: Sort the candidates by their new cross-encoder scores and take the new top_n (e.g., top_n=3) to pass to the LLM.
  • Implementation with `sentence-transformers` and LlamaIndex

    Let's integrate a cross-encoder re-ranker into our previous pipeline.

    First, install the necessary library:

    bash
    pip install sentence-transformers

    We'll create a custom re-ranker class that conforms to LlamaIndex's BaseNodePostprocessor interface, using a model from the sentence-transformers library.

    python
    import os
    import torch
    from typing import List, Optional
    from llama_index.core.schema import NodeWithScore, QueryBundle
    from llama_index.core.postprocessor import BaseNodePostprocessor
    from sentence_transformers.cross_encoder import CrossEncoder
    
    # --- (Previous setup code from Part 1 remains the same) ---
    from llama_index.core import Document, VectorStoreIndex, Settings
    from llama_index.core.node_parser import SentenceWindowNodeParser
    from llama_index.core.postprocessor import MetadataReplacementPostProcessor
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    from llama_index.llms.openai import OpenAI
    
    Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
    Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    
    document_text = ( 
        "This document details the project 'Odyssey'. Project Odyssey began in 2021. "
        "Its primary goal was to refactor the legacy monolithic backend. The project team consisted of 12 engineers. "
        "The budget for Odyssey was set at $2.5 million. Initial phases focused on infrastructure setup. "
        "A major challenge encountered was data migration from the old OracleDB. This migration was complex due to schema drift over 15 years. "
        "The team adopted a microservices architecture using Kubernetes. The chosen programming language was Go for its performance characteristics. "
        "Security was a top priority, with Vault used for secrets management. The 'Phoenix Initiative', a related sub-project, focused on the frontend rewrite. "
        "The Phoenix Initiative's ROI is now projected to be 12% over five years. This revision was due to unforeseen supply chain disruptions. "
        "These disruptions caused a 5% increase in hardware procurement costs. Final deployment of Odyssey is scheduled for Q4 2024. "
        "Post-launch, a dedicated SRE team will manage the new infrastructure. Key performance indicators will be latency and uptime."
    )
    document = Document(text=document_text)
    
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=3,
        window_metadata_key="window",
        original_sentence_metadata_key="original_sentence",
    )
    nodes = node_parser.get_nodes_from_documents([document])
    index = VectorStoreIndex(nodes)
    
    # --- 1. Custom Cross-Encoder Re-ranker Class ---
    class CrossEncoderReRank(BaseNodePostprocessor):
        def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2", top_n: int = 3, device: str = "cpu"):
            super().__init__()
            self._model = CrossEncoder(model_name, device=device)
            self._top_n = top_n
    
        def _postprocess_nodes(
            self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle] = None
        ) -> List[NodeWithScore]:
            if query_bundle is None:
                raise ValueError("Query bundle is required for re-ranking.")
            if not nodes:
                return []
    
            query_str = query_bundle.query_str
            node_texts = [n.get_content() for n in nodes]
    
            # Create pairs of [query, node_text] for the cross-encoder
            query_node_pairs = [[query_str, text] for text in node_texts]
    
            # Get scores from the cross-encoder model
            scores = self._model.predict(query_node_pairs)
    
            # Create a new list of nodes with updated scores
            new_nodes = []
            for i, node in enumerate(nodes):
                new_node = NodeWithScore(node=node.node, score=scores[i])
                new_nodes.append(new_node)
    
            # Sort nodes by the new scores in descending order
            new_nodes.sort(key=lambda x: x.score, reverse=True)
    
            # Return the top_n nodes
            return new_nodes[:self._top_n]
    
    # --- 2. Build the Production-Grade Query Engine ---
    
    # Instantiate the re-ranker
    # Use a GPU if available: device="cuda"
    reranker = CrossEncoderReRank(top_n=2)
    
    # In the query engine, we retrieve more documents initially (similarity_top_k=5)
    # then the re-ranker will prune them down to top_n=2.
    query_engine = index.as_query_engine(
        similarity_top_k=5, # Retrieve more to give the re-ranker more to work with
        node_postprocessors=[
            # First, expand the context with the window
            MetadataReplacementPostProcessor(target_metadata_key="window"),
            # Second, re-rank the expanded context nodes
            reranker
        ]
    )
    
    # --- 3. Execute and Analyze ---
    query = "What was the budget for the Odyssey project and what was its main technical challenge?"
    response = query_engine.query(query)
    
    print(f"--- Query ---
    {query}
    ")
    print(f"--- Response ---
    {response}
    ")
    
    print(f"--- Source Nodes for LLM (after re-ranking) ---")
    for node in response.source_nodes:
        # The score is now the cross-encoder's score, not the vector similarity
        print(f"Score: {node.score:.4f}") 
        print(f"Content: {node.text}")
        print("-"*5)
    

    Analysis of the Re-ranking Pipeline

  • Increased Candidate Pool (similarity_top_k=5): We deliberately fetch more documents from the vector store than we intend to show the LLM. This creates a pool of potentially relevant candidates for the more intelligent cross-encoder to analyze.
  • Order of Postprocessors: The order is critical. MetadataReplacementPostProcessor runs first to ensure the text content of each node is the full sentence window. The CrossEncoderReRank then runs on these expanded context windows, which is exactly what we want. It's re-ranking based on the actual text the LLM will see.
  • CrossEncoderReRank Logic: The custom class takes the retrieved nodes, pairs their content with the query string, and uses the cross-encoder model to generate a new, more accurate relevance score. It then re-sorts the nodes based on this new score and returns only the top N.
  • With a complex query like "What was the budget for the Odyssey project and what was its main technical challenge?", the bi-encoder might retrieve nodes about the budget and nodes about the technical challenge with similar vector scores. The cross-encoder is much better at identifying that the node containing "The budget for Odyssey was set at $2.5 million" and the node containing "A major challenge encountered was data migration" are both highly relevant to the composite query and will score them appropriately high.

    Performance and Latency Trade-offs

    This is the most critical consideration for using cross-encoders in production.

    * Latency: Re-ranking adds a non-trivial latency penalty. A forward pass through a cross-encoder model is orders of magnitude slower than a vector similarity calculation.

    Model (on Hugging Face)SizeRelative Speed (CPU)Relative Accuracy (MS MARCO)
    cross-encoder/ms-marco-MiniLM-L-6-v2~90MBFastGood
    cross-encoder/ms-marco-TinyBERT-L-2-v2~58MBFastestDecent
    cross-encoder/ms-marco-electra-base~440MBSlowExcellent
    BAAI/bge-reranker-large~1.3GBVery SlowState-of-the-Art

    * Benchmarking is Non-Negotiable: Before deploying, you must benchmark. On a typical cloud CPU instance, re-ranking 10 candidates with MiniLM-L-6-v2 might add 100-200ms to your response time. On a T4 GPU, this could drop to 20-40ms. The larger models will be significantly slower.

    * Hardware: For any user-facing application requiring low latency, running the cross-encoder on a GPU is practically a requirement.

    * Strategic Application: You don't need to use re-ranking for every RAG query. You can use it selectively for applications where precision is more important than raw speed, such as document analysis, legal Q&A, or complex financial reporting.

    * Token Limits: Cross-encoders have a maximum sequence length (typically 512 tokens). Since our input is query + document, a long query and a large sentence window could exceed this. The sentence-transformers library handles this by default with a warning, but for optimal performance, ensure your window_size results in context chunks that comfortably fit within this limit.


    Part 3: The Complete Production-Grade Pipeline

    Let's combine everything into a final, unified architecture. This represents a robust, high-fidelity RAG system that addresses the core weaknesses of naive implementations.

    Architectural Diagram:

    QueryBi-Encoder EmbeddingVector DB Search (Top K)Retrieve K Nodes[Stage 1: Context Enrichment] MetadataReplacementPostProcessor (Sentence Window)[Stage 2: Precision Enhancement] CrossEncoderReRank (Score & Prune to Top N)Construct Final PromptLLMResponse

    The consolidated code is a direct combination of the previous two scripts, demonstrating the chained post-processing pipeline.

    python
    # This script represents the final, combined pipeline.
    # It assumes all previous setup (imports, class definitions, etc.) is present.
    
    # 1. Instantiate the re-ranker
    reranker = CrossEncoderReRank(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", 
        top_n=2
    )
    
    # 2. Build the query engine with a chained post-processing pipeline
    query_engine = index.as_query_engine(
        similarity_top_k=10, # Retrieve a wide net of 10 candidates
        node_postprocessors=[
            # The order here is crucial
            MetadataReplacementPostProcessor(target_metadata_key="window"),
            reranker
        ]
    )
    
    # 3. Execute a complex query
    final_query = "Compare the financial aspects of the Odyssey project with the ROI of its sub-project, the Phoenix Initiative."
    response = query_engine.query(final_query)
    
    # 4. Print results and observe the high-quality source nodes
    print(f"--- Final Query ---
    {final_query}
    ")
    print(f"--- Final Response ---
    {response}
    ")
    
    print(f"--- Final Source Nodes for LLM (Enriched & Re-ranked) ---")
    for node in response.source_nodes:
        print(f"Re-ranked Score: {node.score:.4f}")
        print(f"Content: {node.text}")
        print("-"*10)

    This architecture is a significant leap forward. It systematically addresses the dual challenges of context and relevance, resulting in LLM prompts that are both information-rich and semantically precise. The final responses from the LLM will be demonstrably more accurate, comprehensive, and less prone to hallucination.

    Final Considerations: Evaluation and Cost

    * Quantitative Evaluation: Do not rely on anecdotal evidence. To justify the added complexity and latency of this pipeline, you must have a robust evaluation framework. Libraries like Ragas, TruLens, or DeepEval are essential. Create a golden set of question-answer pairs and measure metrics like faithfulness, answer_relevancy, and context_precision to prove the superiority of this advanced pipeline over a baseline.

    * Cost Analysis: This system has higher operational costs.

    * Compute: A dedicated GPU endpoint for the cross-encoder might be necessary for real-time applications, which adds to your cloud bill.

    * LLM Tokens: The sentence window approach sends larger contexts to the LLM, increasing token consumption per query. You must balance the improved quality against these increased API costs.

    Conclusion

    Moving from a basic RAG prototype to a production-ready system requires a shift in thinking from simple chunking to a multi-stage, precision-oriented retrieval pipeline. The naive approach is a leaky abstraction that breaks down under the weight of real-world document complexity.

    By combining Sentence Window Retrieval to solve for context fragmentation and Cross-Encoder Re-ranking to solve for semantic relevance, we can build RAG systems that are significantly more accurate and reliable. While this architecture introduces new considerations around latency and cost, the dramatic improvement in response quality makes it an essential pattern for any senior engineer tasked with building high-stakes LLM applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles