Optimizing RAG: Sentence-Window Retrieval and Cohere Re-ranking

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Problem: Beyond Naive Chunking

For any engineer who has moved a Retrieval-Augmented Generation (RAG) system from a proof-of-concept to a production workload, the limitations of naive chunking become painfully apparent. The standard approach—splitting a long document into fixed-size, often overlapping, chunks—is a blunt instrument. It frequently fails when confronted with dense, complex documents like financial reports, legal contracts, or technical papers. The core issues are twofold:

  • Context Fragmentation: A critical piece of information might be split across two chunks. The embedding for either chunk alone may not have a high cosine similarity to the user's query, causing the retriever to miss it entirely.
  • The "Lost in the Middle" Problem: When a query requires synthesizing information from multiple parts of a large document, a simple vector search retrieving the top-k chunks often floods the LLM's context window with irrelevant or low-signal noise. The most relevant chunk might be buried among other, less useful ones, and the LLM's attention mechanism may fail to focus on it.
  • This article is not an introduction to RAG. It assumes you understand the fundamentals of vector embeddings, retrieval, and generation. Instead, we will architect and implement a production-grade, multi-stage retrieval pipeline that directly addresses these failures. Our strategy involves two key architectural patterns:

    * Sentence-Window Retrieval: We will index individual sentences for extremely precise semantic matching but retrieve a larger window of surrounding sentences. This gives the LLM the focused context it needs without sacrificing the precision of the initial search.

    * Cross-Encoder Re-ranking: After the initial retrieval, we will use a more powerful (and computationally expensive) cross-encoder model, specifically Cohere's Re-rank API, to re-order the retrieved context windows based on their actual relevance to the query. This acts as a crucial filtering step, ensuring only the highest-quality context reaches the LLM.

    By the end of this post, you will have a complete, benchmarked, and production-ready Python implementation using LlamaIndex that demonstrates a significant improvement in retrieval accuracy and generation quality over naive RAG pipelines.


    Part 1: Demonstrating the Failure of Naive Retrieval

    To understand the solution, we must first experience the problem. Let's use a real-world, complex document: Paul Graham's essay, "What I Worked On." It's a long, autobiographical piece with information scattered throughout. We'll set up a baseline RAG system and pose a question whose answer is subtle and easily missed by simplistic chunking.

    Setup and Baseline Implementation

    First, ensure you have the necessary libraries and API keys set up. We'll be using llama-index, OpenAI's models for embedding and generation, and later, Cohere for re-ranking.

    bash
    pip install llama-index llama-index-llms-openai llama-index-embeddings-openai llama-index-postprocessor-cohere-rerank python-dotenv

    Create a .env file with your API keys:

    .env
    OPENAI_API_KEY="sk-..."
    COHERE_API_KEY="..."

    Now, let's build our naive RAG system.

    python
    import os
    import time
    from dotenv import load_dotenv
    
    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
    from llama_index.core.node_parser import SentenceSplitter
    from llama_index.llms.openai import OpenAI
    from llama_index.embeddings.openai import OpenAIEmbedding
    
    # Load environment variables
    load_dotenv()
    
    # Configure global settings
    Settings.llm = OpenAI(model="gpt-4-turbo-preview")
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
    
    # Download the document if it doesn't exist
    if not os.path.exists("data/paul_graham_essay.txt"):
        os.makedirs("data", exist_ok=True)
        os.system("wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O data/paul_graham_essay.txt")
    
    # Load documents
    documents = SimpleDirectoryReader("data").load_data()
    
    print(f"Loaded document with {len(documents)} pages.")
    
    # --- Naive RAG Pipeline ---
    def build_naive_rag_pipeline(documents):
        print("\nBuilding Naive RAG Pipeline...")
        # Default SentenceSplitter with chunk_size=1024, chunk_overlap=20
        node_parser = SentenceSplitter(chunk_size=1024)
        nodes = node_parser.get_nodes_from_documents(documents)
        
        print(f"Parsed into {len(nodes)} nodes.")
        
        index = VectorStoreIndex(nodes)
        query_engine = index.as_query_engine(similarity_top_k=5)
        return query_engine
    
    naive_query_engine = build_naive_rag_pipeline(documents)
    
    # Define a query that requires synthesizing information
    query = "What were the key challenges and breakthroughs Paul Graham faced while developing the first web-based store builder?"
    
    print(f"\n--- Querying Naive RAG ---")
    print(f"Query: {query}")
    
    start_time = time.time()
    response = naive_query_engine.query(query)
    end_time = time.time()
    
    print(f"Response: {response}")
    print(f"Time taken: {end_time - start_time:.2f}s")
    
    # Let's inspect the retrieved source nodes
    print("\n--- Retrieved Source Nodes (Naive) ---")
    for i, node in enumerate(response.source_nodes):
        print(f"Node {i+1} (Score: {node.score:.4f}):")
        # Displaying a snippet of the text
        print(f"'{node.text[:250]}...'\n")

    Analysis of the Failure

    When you run this, the response you get will likely be generic or incomplete. It might mention "Viaweb" but will struggle to pinpoint the specific technical challenges like using Lisp, the stateful nature of HTTP, or the breakthrough of generating HTML dynamically.

    Let's analyze the source_nodes. You'll notice the retrieved chunks often have decent keyword overlap but lack deep semantic context. For example, a chunk might mention "web store," and another might mention "Lisp," but the chunk that explicitly connects the challenge of using Lisp for a web store might have a lower similarity score and not make it into the top 5 retrieved nodes. The default SentenceSplitter with a chunk_size of 1024 is arbitrarily slicing the document, leading to this fragmentation.

    This is the classic failure mode we aim to solve. The information is in the vector store, but our retrieval mechanism is too crude to extract it effectively.


    Part 2: Deep Dive into Sentence-Window Retrieval

    The core idea behind Sentence-Window Retrieval is to decouple the unit of embedding from the unit of retrieval. We perform similarity search on fine-grained sentences to find the most precise semantic matches. However, once a sentence is identified, we retrieve a larger window of sentences surrounding it. This provides the LLM with the necessary context to understand the matched sentence.

    LlamaIndex provides a SentenceWindowNodeParser that automates this process.

    How `SentenceWindowNodeParser` Works

    When you pass documents to this node parser, it performs the following steps:

  • Splits document into sentences: It first uses a sentence splitter to break down the text into individual sentences.
  • Creates sentence nodes: Each sentence becomes a Node object.
  • Adds context window metadata: For each sentence Node, it identifies the window_size sentences before and window_size sentences after it. This combined block of text is stored in the metadata dictionary of the Node under the key window.
  • Generates embedding from the sentence only: Crucially, when the index is built, the vector embedding is calculated only on the text of the individual sentence Node, not the window.
  • Retrieves the window at query time: During retrieval, the query engine performs a similarity search against the sentence embeddings. When a Node is selected, the query engine is configured to pass the text from its window metadata—not the sentence itself—to the LLM.
  • This architecture gives us the best of both worlds: the precision of sentence-level search and the contextual richness of a larger chunk.

    Implementation

    Let's refactor our code to use this strategy.

    python
    # (Keep the initial setup from Part 1)
    from llama_index.core.node_parser import SentenceWindowNodeParser
    from llama_index.core.postprocessor import MetadataReplacementPostProcessor
    
    # --- Sentence-Window RAG Pipeline ---
    def build_sentence_window_pipeline(documents):
        print("\nBuilding Sentence-Window RAG Pipeline...")
        # Create the sentence window node parser
        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=3,  # The number of sentences on each side of a sentence to store
            window_metadata_key="window",  # The metadata key that holds the window text
            original_text_metadata_key="original_text", # The metadata key that holds the original sentence
        )
        nodes = node_parser.get_nodes_from_documents(documents)
        print(f"Parsed into {len(nodes)} nodes.")
    
        # Build the index
        index = VectorStoreIndex(nodes)
    
        # Build the query engine, which needs a postprocessor to replace the sentence with the window
        query_engine = index.as_query_engine(
            similarity_top_k=5,
            # The postprocessor is crucial for this pipeline to work
            node_postprocessors=[
                MetadataReplacementPostProcessor(target_metadata_key="window")
            ],
        )
        return query_engine
    
    sentence_window_query_engine = build_sentence_window_pipeline(documents)
    
    # Use the same query as before
    print(f"\n--- Querying Sentence-Window RAG ---")
    print(f"Query: {query}")
    
    start_time = time.time()
    response = sentence_window_query_engine.query(query)
    end_time = time.time()
    
    print(f"Response: {response}")
    print(f"Time taken: {end_time - start_time:.2f}s")
    
    # Inspect the source nodes to see the difference
    print("\n--- Retrieved Source Nodes (Sentence-Window) ---")
    for i, node in enumerate(response.source_nodes):
        print(f"Node {i+1} (Score: {node.score:.4f}):")
        # The 'text' attribute now contains the full window
        print(f"'{node.text[:500]}...'\n")
        # We can also inspect the original sentence that was matched
        # Note: LlamaIndex might place it in a different metadata key depending on version
        original_sentence = node.metadata.get('original_text', 'N/A')
        print(f"Original matched sentence: '{original_sentence}'\n")

    Analysis of the Improvement

    Running this code will yield a noticeably better response. The LLM's answer will likely be more detailed and specific, referencing the actual technical hurdles.

    When you inspect the source_nodes, the key difference is apparent. The node.text now contains a full paragraph-like chunk of text (the window), providing rich context. However, the node.score was calculated based on a very specific sentence within that window (which you can inspect via the metadata). We've successfully retrieved a broad context based on a narrow match.

    This solves the context fragmentation problem. But it can introduce a new, more subtle issue: relevance dilution. We might retrieve 5 windows, but perhaps only the top 2 are truly relevant to the user's specific query. The other 3, while related, are noise. This is where a re-ranker becomes a powerful second-stage filter.


    Part 3: Adding a Cohere Re-ranker for Relevance Filtering

    Our retrieval process is now more context-aware, but not necessarily more relevant. The initial retrieval stage (the bi-encoder based vector search) is optimized for speed and recall. It casts a wide net to find potentially relevant documents. The role of a re-ranker is to take this list of candidates and apply a more computationally intensive but far more accurate model to re-order them based on true relevance to the query.

    Bi-Encoders vs. Cross-Encoders

    Bi-Encoder (used in retrieval): Generates embeddings for the query and documents independently*. The similarity (e.g., cosine similarity) is then calculated between these static vectors. It's extremely fast and scalable, perfect for searching over millions of documents.

    Cross-Encoder (used in re-ranking): Takes both the query and a document as a single input* and passes them through a powerful transformer model (like BERT). It outputs a single score from 0 to 1 representing the relevance. This allows the model to perform deep, token-level attention between the query and the document, making it far more accurate at judging relevance. The downside is that it's too slow to run on an entire corpus, making it perfect for a second-stage pass on a small number of candidates (e.g., the top 25-50 results from the bi-encoder).

    We'll use the Cohere Re-rank API, which provides a highly optimized, production-ready cross-encoder model.

    Implementation with `CohereRerank`

    Integrating a re-ranker in LlamaIndex is straightforward using a node_postprocessor. It intercepts the retrieved nodes before they are sent to the LLM, re-orders them, and truncates the list to a new top_n.

    python
    # (Keep the setup from previous parts)
    from llama_index.core.postprocessor import CohereRerank
    
    # --- Full Pipeline: Sentence-Window + Cohere Re-rank ---
    def build_full_pipeline(documents):
        print("\nBuilding Full RAG Pipeline (Sentence-Window + Cohere Re-rank)...")
        
        # 1. Node Parser (Sentence-Window)
        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=3,
            window_metadata_key="window",
            original_text_metadata_key="original_text",
        )
        nodes = node_parser.get_nodes_from_documents(documents)
        print(f"Parsed into {len(nodes)} nodes.")
    
        # 2. Indexing
        # Using a service context is good practice for managing settings
        index = VectorStoreIndex(nodes)
    
        # 3. Re-ranker
        # We retrieve more documents (e.g., top 10) initially...
        similarity_top_k = 10
        # ...and the re-ranker will filter it down to the most relevant (e.g., top 3)
        cohere_rerank = CohereRerank(top_n=3)
    
        # 4. Query Engine
        query_engine = index.as_query_engine(
            similarity_top_k=similarity_top_k,
            node_postprocessors=[
                # First, replace sentence with window
                MetadataReplacementPostProcessor(target_metadata_key="window"),
                # Second, apply the re-ranker
                cohere_rerank
            ],
        )
        return query_engine
    
    full_query_engine = build_full_pipeline(documents)
    
    # Use the same query again
    print(f"\n--- Querying Full Pipeline ---")
    print(f"Query: {query}")
    
    start_time = time.time()
    response = full_query_engine.query(query)
    end_time = time.time()
    
    print(f"Response: {response}")
    print(f"Time taken: {end_time - start_time:.2f}s")
    
    # Inspect the final, re-ranked source nodes
    print("\n--- Retrieved and Re-ranked Source Nodes ---")
    for i, node in enumerate(response.source_nodes):
        print(f"Node {i+1} (Re-rank Score: {node.score:.4f}):")
        print(f"'{node.text[:500]}...'\n")

    Analysis of the Final Result

    This is our production-grade pipeline. The response from the LLM should now be exceptionally accurate and detailed. The key is what happens behind the scenes:

    • The vector store does a fast search and retrieves the top 10 candidate windows that are semantically similar to the query.
    • These 10 windows (along with the query) are sent to the Cohere API.
    • The cross-encoder model meticulously scores each window for its relevance to the query.
  • The CohereRerank postprocessor re-orders the nodes based on these new scores and returns only the top 3.
    • These 3 highly relevant, context-rich windows are passed to the LLM.

    By inspecting the source_nodes, you'll now see only 3 nodes, and their score attribute will be the high-confidence relevance score from Cohere (e.g., > 0.9), not the original cosine similarity.


    Part 4: Performance, Cost, and Advanced Considerations

    A production system isn't just about accuracy; it's about managing performance, cost, and edge cases.

    Performance and Latency Benchmarking

    Let's put some numbers to our claims. We'll measure the end-to-end latency for each of our three pipelines.

    PipelineInitial Retrieval (similarity_top_k)Final Context (top_n)Avg. Latency (sec)Response Quality
    Naive RAG55~1.5 - 2.5sLow to Medium. Often generic, misses details.
    Sentence-Window55~1.8 - 3.0sMedium to High. More context-aware, better detail.
    Sentence-Window + Cohere Re-rank103~2.5 - 4.0sHigh to Very High. Precise, relevant, and detailed.

    Note: Latency figures are illustrative and depend heavily on API response times and document complexity.

    The key takeaway is that each enhancement adds a latency cost. The re-ranking step, involving an external API call, is the most significant. This is a classic trade-off: we are trading latency and cost for a significant increase in quality. For many production use cases (e.g., enterprise search, complex Q&A bots), this trade-off is not just acceptable but necessary.

    Cost Analysis

    Let's model the cost for 1,000 queries:

    * OpenAI Embeddings (text-embedding-3-large): ~$0.00013 per 1K tokens. A query is ~30 tokens. Cost is negligible.

    OpenAI Generation (gpt-4-turbo-preview): Input: ~$0.01/1K tokens, Output: ~$0.03/1K tokens. Let's assume 3 retrieved windows of ~800 tokens each (2400 tokens) and a 300-token response. Cost per query: (2.4 0.01) + (0.3 * 0.03) = $0.033.

    * Cohere Re-rank: ~$1.00 per 1,000 requests (for documents up to 500 tokens). We are re-ranking 10 documents per query.

    Cost breakdown per 1,000 queries:

    ComponentCost per QueryCost per 1,000 Queries
    OpenAI Generation~$0.033~$33.00
    Cohere Re-rank~$0.001~$1.00
    Total (Full Pipeline)~$0.034~$34.00

    Interestingly, the re-ranking cost is dwarfed by the generation cost of a powerful model like GPT-4 Turbo. However, the re-ranker improves the efficiency of the generation step. By providing cleaner, more relevant context, it can lead to shorter, more concise answers and reduces the number of tokens you need to stuff into the LLM's context window, potentially lowering your generation costs in the long run.

    Edge Case: Using a Local Re-ranker

    What if you can't use an external API for re-ranking due to data privacy or latency concerns? You can swap CohereRerank for a local cross-encoder model.

    bash
    pip install sentence-transformers
    python
    from llama_index.core.postprocessor import SentenceTransformerRerank
    
    # In your pipeline builder, replace CohereRerank with this:
    local_rerank = SentenceTransformerRerank(
        model="cross-encoder/ms-marco-MiniLM-L-2-v2", # A small but effective model
        top_n=3,
    )
    
    # ... add local_rerank to your node_postprocessors list

    Trade-offs:

    * Pros: No network latency, no API costs, data stays within your environment.

    * Cons: You are now responsible for managing the model's compute infrastructure (CPU/GPU). The initial model load can add to application startup time. Smaller open-source models may not be as powerful as commercial APIs like Cohere's.

    Tuning `window_size` and `top_k`

    These are the two most important hyperparameters to tune:

    window_size: This depends on the nature of your documents. For prose, a size of 2-4 is often effective. For code or structured text, you might need a larger window. A good heuristic is to make the total window size (2 window_size + 1 sentences) roughly match the average paragraph length in your documents.

    * similarity_top_k (for initial retrieval) and top_n (for re-ranker): This is a balancing act. A larger similarity_top_k (e.g., 20) increases recall, giving the re-ranker more material to work with, but also increases latency and cost. A smaller top_n (e.g., 2-3) provides a very clean, focused context to the LLM but risks discarding a useful node. A good starting point is similarity_top_k=10 and top_n=3.

    Conclusion: Architecting for Quality

    We have successfully moved from a naive, brittle RAG implementation to a robust, multi-stage pipeline that excels at extracting nuanced information from complex documents. By combining the precision of Sentence-Window Retrieval with the relevance filtering of a Cross-Encoder Re-ranker, we directly address the fundamental weaknesses of simplistic chunking strategies.

    This architecture represents a shift in thinking for production RAG systems. Instead of treating retrieval as a single, monolithic step, we view it as a funnel:

  • Broad Recall: A fast, scalable bi-encoder search retrieves a wide set of candidate documents.
  • Precision Filtering: A slower, more accurate cross-encoder re-ranks and filters these candidates for maximum relevance.
  • Intelligent Synthesis: A powerful LLM receives only the highest quality context to generate its final response.
  • For senior engineers tasked with building reliable AI systems, adopting these advanced patterns is no longer optional. It is the critical step in moving from impressive demos to production-grade applications that deliver consistent, accurate, and trustworthy results.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles