Optimizing RAG: Sentence-Window Retrieval & Cohere Re-ranking

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Plateau of Naive RAG: Why Your Vector Search Fails

If you've deployed a Retrieval-Augmented Generation (RAG) system into production, you've likely encountered the frustrating plateau of 'good enough' but not 'great'. The system works for simple queries, but complex questions that require nuanced understanding across sentence or paragraph boundaries often yield hallucinatory or incomplete answers. The root cause is almost always a failure in the retrieval step, not the generation step. The LLM is powerful, but it's operating on garbage-in, garbage-out principles.

The most common culprit is naive chunking. Splitting documents by a fixed character count (RecursiveCharacterTextSplitter with chunk_size=1024, for example) is a blunt instrument. It's fast and simple, but it systematically destroys the contextual integrity of the source material. A critical sentence might be split from its preceding explanatory sentence, rendering its embedding less meaningful and making it less likely to be retrieved. When it is retrieved, it lacks the surrounding context the LLM needs for accurate synthesis.

Consider this text snippet:

"The primary controller communicates with the replica sets via a dedicated gRPC channel. This channel is secured using mTLS with certificates rotated every 24 hours. Therefore, any authentication failure will trigger a 'ReplicaAuthError' and immediately halt the synchronization process."

A naive 100-character chunker might split this into:

  • "The primary controller communicates with the replica sets via a dedicated gRPC channel. This channel i"
  • "s secured using mTLS with certificates rotated every 24 hours. Therefore, any authentication failu"
  • "re will trigger a 'ReplicaAuthError' and immediately halt the synchronization process."
  • A query like "What causes a ReplicaAuthError?" might only match the third chunk based on semantic similarity. The LLM receives this fragment, devoid of the crucial context about mTLS and certificate rotation, and can only provide a superficial answer. This is the core problem we must solve to elevate RAG performance.

    This article presents a production-tested, two-part strategy to overcome this limitation:

  • Sentence-Window Retrieval: We will index small, precise units (sentences) to ensure high-quality embedding matches, but retrieve a larger "window" of surrounding sentences to provide the LLM with the necessary context.
  • Post-Retrieval Re-ranking: We will intentionally retrieve more documents than needed (top_n) and then use a highly accurate cross-encoder model (like Cohere's Re-ranker) to re-evaluate and select the most relevant documents (top_k) before passing them to the LLM.
  • We will use LlamaIndex for this implementation, as its modular architecture is well-suited for these advanced pipeline customizations.


    Part 1: Sentence-Window Retrieval - Recapturing Lost Context

    The fundamental idea behind Sentence-Window Retrieval is to decouple the unit of embedding from the unit of retrieval. We want the precision of sentence-level embeddings for similarity search, but the contextual richness of paragraph-level chunks for LLM synthesis.

    How It Works Under the Hood

    The SentenceWindowNodeParser in LlamaIndex orchestrates this process during indexing:

  • Sentence Splitting: The document is first split into individual sentences using a configurable sentence splitter.
  • Node Creation: For each sentence, a Node object is created. The text of this node is the single sentence.
  • Window Metadata: The magic happens here. For each sentence Node, the parser looks window_size sentences before and window_size sentences after it. This entire block of text (the "window") is stored in the metadata of the Node.
  • Embedding: The embedding is generated only from the single sentence in the Node.text field.
  • Indexing: The sentence embedding is stored in the vector database, linked to the Node which contains both the single sentence and the larger window in its metadata.
  • When a query is executed:

    • The query is embedded.
    • A similarity search is performed against the sentence embeddings in the vector store.
  • The top matching Nodes are returned.
  • The RAG pipeline is configured to extract the window text from the metadata of these nodes, not the single sentence text.
    • This expanded context is then passed to the LLM.

    Detailed Implementation

    Let's build this. First, ensure you have the necessary libraries installed.

    bash
    pip install llama-index llama-index-llms-openai llama-index-embeddings-openai cohere

    We'll set up our environment and load a sample document. For this example, we'll use a local file policy_document.md containing complex, interconnected information.

    policy_document.md

    markdown
    # Internal Data Handling Policy
    
    ## Section 1: Data Classification
    
    All internal data is classified into three tiers: Public, Confidential, and Restricted. Public data requires no special handling. Confidential data must be encrypted at rest using AES-256. The keys for this encryption are managed by the central KMS.
    
    Restricted data, which includes PII and financial records, requires an additional layer of security. It must be stored in dedicated hardware security modules (HSMs). Access to Restricted data is logged to a write-only audit trail which is reviewed quarterly by the compliance team. The 'ComplianceOverwatch' system is responsible for this review process.
    
    ## Section 2: Access Control
    
    Access to Confidential data is granted based on role-based access control (RBAC) policies defined in our identity provider. Any request for temporary elevated access must be approved by the data steward. This approval is logged as a 'PrivilegeEscalationEvent'.
    
    For Restricted data, access requires multi-factor authentication and is limited to specific IP ranges. Any access attempt from an unauthorized IP will trigger a 'HighRiskAuthAlert' and lock the account. This is a non-negotiable security posture.

    Now for the Python implementation. We'll configure the SentenceWindowNodeParser and build our index.

    python
    import os
    import cohere
    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
    from llama_index.core.node_parser import SentenceWindowNodeParser
    from llama_index.core.postprocessor import MetadataReplacementPostProcessor
    from llama_index.llms.openai import OpenAI
    from llama_index.embeddings.openai import OpenAIEmbedding
    
    # --- Configuration ---
    os.environ["OPENAI_API_KEY"] = "sk-..."
    os.environ["COHERE_API_KEY"] = "..."
    
    # Configure global settings for consistent behavior
    Settings.llm = OpenAI(model="gpt-4-turbo-preview")
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
    
    # --- Load Data ---
    documents = SimpleDirectoryReader(input_files=["policy_document.md"]).load_data()
    
    # --- Build the Sentence-Window Index ---
    def build_sentence_window_index(documents, window_size=3):
        """Builds an index using the SentenceWindowNodeParser."""
        print(f"\nBuilding index with window size: {window_size}\n")
        
        # Create the node parser
        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=window_size,
            window_metadata_key="window",
            original_text_metadata_key="original_text",
        )
        nodes = node_parser.get_nodes_from_documents(documents)
    
        # Build the vector index
        sentence_index = VectorStoreIndex(nodes)
        return sentence_index
    
    sentence_index = build_sentence_window_index(documents, window_size=3)
    
    # --- Setup the Query Engine ---
    def get_sentence_window_query_engine(sentence_index, similarity_top_k=6):
        """Builds a query engine that retrieves windows of text."""
        
        # Postprocessor to replace the sentence with the full window
        postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    
        # The query engine
        query_engine = sentence_index.as_query_engine(
            similarity_top_k=similarity_top_k,
            node_postprocessors=[postproc],
        )
        return query_engine
    
    query_engine = get_sentence_window_query_engine(sentence_index)
    
    # --- Query and Observe ---
    query = "What triggers a HighRiskAuthAlert and what is the consequence?"
    response = query_engine.query(query)
    
    print("--- Query ---")
    print(f"{query}\n")
    print("--- Response ---")
    print(f"{response}\n")
    
    print("--- Source Nodes ---")
    for node in response.source_nodes:
        print(f"Score: {node.score:.4f}")
        print(f"Original Sentence: {node.metadata['original_text']}")
        print(f"Window: {node.metadata['window']}")
        print("-" * 20)
    

    When you run this, observe the output for the source nodes. The Original Sentence will be a single, highly relevant sentence like "Any access attempt from an unauthorized IP will trigger a 'HighRiskAuthAlert' and lock the account." However, the Window metadata will contain that sentence plus the three sentences before and after it, providing the LLM with the crucial context about Restricted data and MFA. The MetadataReplacementPostProcessor ensures this window is what the LLM actually sees.

    This single change dramatically improves contextuality. However, it can also introduce noise. What if some sentences in the window are irrelevant? This leads us to the second part of our strategy: re-ranking.


    Part 2: Post-Retrieval Re-ranking - Surgical Relevance

    Vector similarity search is powerful but imperfect. It's a measure of semantic closeness, not necessarily contextual relevance to a specific query. A query might be semantically close to several retrieved chunks, but only a subset of them are truly essential for forming a correct answer. This is where re-rankers excel.

    Bi-Encoders vs. Cross-Encoders: A Critical Distinction

    * Bi-Encoders (like our text-embedding-3-large model) create embeddings for the query and documents independently. The system then calculates a cheap distance metric (like cosine similarity) between them. This is fast and scalable, making it ideal for the initial retrieval from a massive corpus.

    Cross-Encoders (like Cohere's Re-rank model) work differently. They take the query and a single document together* as input and output a relevance score. This allows the model to perform a much deeper, token-by-token analysis of the relationship between the query and the document. This process is far more computationally expensive and thus unsuitable for initial retrieval, but it is vastly more accurate for ranking a small set of candidate documents.

    The strategy is to combine the strengths of both:

  • Use the fast bi-encoder and vector search to retrieve a larger-than-needed set of candidate documents (e.g., top_n=10).
    • Pass these 10 documents and the original query to a cross-encoder re-ranker.
    • The re-ranker returns a new, more accurate relevance score for each document.
  • We take the top top_k (e.g., top_k=3) documents from the re-ranked list and pass them to the LLM.
  • Implementation with Cohere Re-rank

    Cohere provides a high-performance re-ranking endpoint that is easy to integrate. LlamaIndex has a built-in CohereRerank node postprocessor.

    Let's modify our query engine setup to include the re-ranker. We will build on the sentence-window index we created earlier.

    python
    import os
    import cohere
    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
    from llama_index.core.node_parser import SentenceWindowNodeParser
    from llama_index.core.postprocessor import MetadataReplacementPostProcessor, CohereRerank
    from llama_index.llms.openai import OpenAI
    from llama_index.embeddings.openai import OpenAIEmbedding
    
    # --- Assume previous setup code is here (Settings, document loading, index building) ---
    # ...
    # sentence_index = build_sentence_window_index(documents, window_size=3)
    
    # --- Build a Full Production-Grade Query Engine ---
    def get_advanced_query_engine(sentence_index, similarity_top_k=10, rerank_top_n=3):
        """Builds an advanced query engine with sentence-window replacement and re-ranking."""
        
        # Postprocessor to replace the sentence with the full window
        replace_proc = MetadataReplacementPostProcessor(target_metadata_key="window")
    
        # Cohere re-ranker
        cohere_rerank = CohereRerank(
            api_key=os.environ["COHERE_API_KEY"],
            top_n=rerank_top_n  # This is the 'top_k' from our explanation
        )
    
        # The query engine
        query_engine = sentence_index.as_query_engine(
            similarity_top_k=similarity_top_k, # This is the 'top_n' from our explanation
            node_postprocessors=[replace_proc, cohere_rerank],
        )
        return query_engine
    
    # Let's use the previously built index
    sentence_index = build_sentence_window_index(documents, window_size=3)
    advanced_query_engine = get_advanced_query_engine(sentence_index)
    
    # --- Run a complex query ---
    query = "What is the process for reviewing access to PII and what system is involved?"
    response = advanced_query_engine.query(query)
    
    print("--- Query ---")
    print(f"{query}\n")
    print("--- Response ---")
    print(f"{response}\n")
    
    print("--- Re-ranked Source Nodes ---")
    for node in response.source_nodes:
        print(f"Re-rank Score: {node.score:.4f}")
        print(f"Window: {node.metadata['window']}")
        print("-" * 20)
    

    In this setup:

  • similarity_top_k=10: The vector store will initially retrieve the 10 most semantically similar sentences.
  • replace_proc: The MetadataReplacementPostProcessor runs first, expanding these 10 sentences into their full context windows.
  • cohere_rerank: The CohereRerank postprocessor then takes these 10 context windows, sends them to the Cohere API along with the query, and receives a new relevance score for each.
  • top_n=3: The re-ranker discards all but the top 3 most relevant windows.
    • These 3 highly relevant, context-rich windows are passed to the LLM.

    This pipeline is significantly more robust. It finds potentially relevant information in a wide net (similarity_top_k) and then uses a precision tool (CohereRerank) to select only the most valuable pieces for the final synthesis.


    Performance, Cost, and Tuning Considerations

    This advanced pipeline introduces trade-offs that senior engineers must manage.

    Latency

    The re-ranking step is a network call that adds latency. A typical re-rank call for 10-20 documents of ~500 tokens each can add 200-500ms to your response time. This is a significant consideration for real-time applications.

    * Mitigation: Use the re-ranker judiciously. For applications where speed is paramount and queries are simple, you might fall back to a simpler retrieval strategy. For complex analysis where accuracy is critical, the added latency is often an acceptable price.

    Cost

    Re-ranking services are not free. Cohere, for instance, charges per document processed. If you retrieve 10 documents and re-rank them for every query, the cost can add up quickly.

    * Mitigation: The most important lever is similarity_top_k. Tune this value carefully. Retrieving too many documents (e.g., 50) for re-ranking can be slow and expensive. A value between 8 and 15 is often a good starting point. You are looking for the sweet spot that is large enough to capture the relevant documents but small enough to manage cost and latency.

    Tuning `similarity_top_k` vs. `rerank_top_n`

    These two parameters are the primary knobs for tuning the pipeline:

    * similarity_top_k (The Net): This determines the size of the candidate pool for the re-ranker. If this value is too small, you risk not even retrieving the correct document in the initial pass (recall error), and the re-ranker can't fix that. If it's too large, you increase cost and latency.

    * rerank_top_n (The Scalpel): This determines how much context is passed to the final LLM prompt. A smaller value (2-3) results in a more focused, concise prompt, which can reduce the chance of the LLM getting distracted. A larger value (4-5) provides more context but increases prompt size and the risk of including marginally relevant information.

    A good tuning strategy:

  • Start with similarity_top_k=10 and rerank_top_n=3.
    • Create an evaluation dataset of 20-50 representative queries and their ideal answers.
  • Run the pipeline and measure retrieval metrics like Mean Reciprocal Rank (MRR) and Hit Rate (did the correct document appear in the top k results?).
  • If your Hit Rate is low, you need to increase similarity_top_k to cast a wider net.
  • If your Hit Rate is high but the final LLM answers are poor, the context might be too noisy. Try reducing rerank_top_n to be more selective.
  • Edge Cases and Nuances

    * Document Structure: This strategy excels on prose-heavy documents (legal text, technical manuals, knowledge bases). For structured data like code or logs, the concept of a "sentence" is less meaningful. You may need to revert to other chunking strategies (CodeSplitter) or use a hybrid approach.

    * Window Size: The optimal window_size is domain-dependent. For dense technical text, a smaller window (window_size=1 or 2) might be sufficient. For narrative text where context is spread out, a larger window (window_size=4 or 5) might be necessary. This is a hyperparameter you should tune based on your specific corpus.

    * Noisy Re-ranking: Occasionally, a re-ranker can be swayed by keyword stuffing and might demote a semantically rich but less keyword-dense document. This is rare with high-quality models like Cohere's but is a possibility. It underscores the need for an evaluation set to catch such regressions.

    Conclusion: From Heuristics to Precision Engineering

    Moving a RAG system from a prototype to a reliable production service requires moving beyond simple heuristics. Naive chunking is a heuristic that fails under the pressure of complex information retrieval.

    The two-stage retrieval process detailed here—Sentence-Window Retrieval followed by Cross-Encoder Re-ranking—represents a significant step up in architectural maturity. It addresses the core problem of context fragmentation by separating the unit of embedding from the unit of retrieval, and then refines the results by applying a computationally expensive but highly accurate relevance model.

    By implementing this pattern, you are no longer just performing a similarity search; you are engineering a multi-stage pipeline that balances the speed of vector search with the precision of deep language models. This is the level of detail required to build RAG systems that don't just answer questions, but provide accurate, context-aware, and trustworthy insights.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles