Advanced RAG: Implementing a Re-ranking Layer with Cohere

22 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production RAG Bottleneck: Beyond Naive Vector Search

In any production-grade Retrieval-Augmented Generation (RAG) system, the quality of the final output is inextricably linked to the quality of the retrieved context. The principle of "Garbage In, Garbage Out" is brutally apparent. While basic vector similarity search provides a functional baseline, senior engineers quickly discover its limitations in real-world scenarios. Semantic search can be imprecise, often surfacing documents that are topically related but contextually irrelevant, leading to LLM hallucinations or unhelpful answers.

The common reaction is to increase the top_k parameter in the initial retrieval, hoping to cast a wider net and catch the correct information. This is a counter-productive anti-pattern. It bloats the context window, increases token costs, and risks overwhelming the LLM with noise, potentially obscuring the very information you need due to the "lost in the middle" problem.

The solution is not to retrieve more, but to retrieve smarter. This requires a multi-stage architecture that decouples the initial, high-recall retrieval from a subsequent, high-precision re-ranking step. This post details the implementation of such a system, using a hybrid search for the first stage and Cohere's sophisticated re-ranking model for the second.

This architecture addresses a fundamental trade-off in information retrieval:

  • Stage 1 (Retrieval): Uses efficient models (like bi-encoders for dense vectors and BM25 for sparse vectors) to quickly scan a massive corpus and retrieve a large set of potentially relevant candidates (k=50 or k=100). The goal here is high recall. We want to ensure the correct document is somewhere in this initial set.
  • Stage 2 (Re-ranking): Uses a more powerful, computationally expensive model (a cross-encoder) to scrutinize this smaller candidate set. The cross-encoder examines the query and each document simultaneously, capturing fine-grained semantic relevance that bi-encoders miss. The goal is high precision. We want to identify the most relevant documents from the candidate pool and rank them correctly.
  • We will build this system from the ground up, focusing on production concerns like latency, cost, and error handling.


    Phase 1: High-Recall Hybrid Search Implementation

    Our first step is to retrieve a superset of candidate documents. Relying solely on dense vector search is often insufficient. It can struggle with keyword matching and specific terminology. A hybrid approach, combining sparse (keyword-based, e.g., BM25) and dense (semantic-based) vectors, provides a more robust foundation.

    For this example, we will simulate a client for a vector database that supports sparse-dense hybrid search, such as Pinecone, Weaviate, or Elasticsearch (with ELSER and a dense vector field). The implementation details of the vector DB are abstracted to focus on the orchestration logic.

    Simulating the Hybrid Search Client

    Let's define a mock client that returns a list of documents. In a real application, this would be a network call to your vector database.

    python
    import os
    import time
    import uuid
    from typing import List, Dict, Any
    
    # Mock document corpus
    DOCUMENT_CORPUS = [
        {"id": "doc1", "text": "The new X1 GPU offers 24GB of VRAM and is optimized for deep learning workloads."}, 
        {"id": "doc2", "text": "Our latest CPU, the Z-Processor, has 16 cores and a boost clock of 5.2GHz."}, 
        {"id": "doc3", "text": "Cloud storage solutions now include tiered access for cost optimization."}, 
        {"id": "doc4", "text": "To fine-tune a language model, you need a substantial dataset and significant GPU memory, often more than 16GB."}, 
        {"id": "doc5", "text": "The X1 GPU driver update addresses performance issues in parallel computing tasks."}, 
        {"id": "doc6", "text": "When selecting a GPU for AI, VRAM capacity is a critical factor to consider."}, 
        {"id": "doc7", "text": "Our new server rack is designed for high-density computing with efficient cooling for multiple X1 GPUs."}
    ]
    
    class MockVectorDBClient:
        """A mock client simulating a hybrid search call to a vector database."""
        def __init__(self, corpus: List[Dict[str, Any]]):
            self._corpus = {doc['id']: doc for doc in corpus}
    
        def hybrid_search(self, query: str, top_k: int = 10) -> List[Dict[str, Any]]:
            """
            Simulates a hybrid search. In a real system, this would involve:
            1. Creating sparse and dense vectors for the query.
            2. Sending a query to the vector DB (e.g., Pinecone, Elasticsearch).
            3. Receiving a ranked list of documents.
            
            This mock implementation returns a fixed, slightly noisy result set 
            to demonstrate the need for re-ranking.
            """
            print(f"[VectorDB] Performing hybrid search for query: '{query}' with top_k={top_k}")
            # Simulate network latency
            time.sleep(0.15) 
    
            # Let's craft a specific scenario to show the weakness of basic retrieval.
            # Query: "How much memory does the X1 GPU have for AI tasks?"
            # Expected best docs: doc1, doc4, doc6
            # A naive search might rank them poorly.
            if "x1 gpu" in query.lower() and "memory" in query.lower():
                # Naive retrieval might rank by keyword frequency or imperfect semantics
                retrieved_ids = ["doc5", "doc1", "doc7", "doc2", "doc6", "doc4", "doc3"]
                results = [self._corpus[id] for id in retrieved_ids[:top_k]]
                print(f"[VectorDB] Retrieved {len(results)} candidate documents.")
                return results
            
            # Generic fallback
            return list(self._corpus.values())[:top_k]
    
    # Instantiate the client
    vector_db_client = MockVectorDBClient(DOCUMENT_CORPUS)
    
    # --- Example Usage of Phase 1 ---
    query = "How much memory does the X1 GPU have for AI tasks?"
    candidate_docs = vector_db_client.hybrid_search(query, top_k=7)
    
    print("\n--- Initial Retrieval Results (Potentially Noisy) ---")
    for i, doc in enumerate(candidate_docs):
        print(f"{i+1}. ID: {doc['id']}, Text: {doc['text']}")
    

    Running this code will produce an output like this:

    text
    [VectorDB] Performing hybrid search for query: 'How much memory does the X1 GPU have for AI tasks?' with top_k=7
    [VectorDB] Retrieved 7 candidate documents.
    
    --- Initial Retrieval Results (Potentially Noisy) ---
    1. ID: doc5, Text: The X1 GPU driver update addresses performance issues in parallel computing tasks.
    2. ID: doc1, Text: The new X1 GPU offers 24GB of VRAM and is optimized for deep learning workloads.
    3. ID: doc7, Text: Our new server rack is designed for high-density computing with efficient cooling for multiple X1 GPUs.
    4. ID: doc2, Text: Our latest CPU, the Z-Processor, has 16 cores and a boost clock of 5.2GHz.
    5. ID: doc6, Text: When selecting a GPU for AI, VRAM capacity is a critical factor to consider.
    6. ID: doc4, Text: To fine-tune a language model, you need a substantial dataset and significant GPU memory, often more than 16GB.
    7. ID: doc3, Text: Cloud storage solutions now include tiered access for cost optimization.

    Notice the suboptimal ranking. The most relevant document (doc1) is second. A completely irrelevant document about a CPU (doc2) is ranked higher than two highly relevant documents (doc6 and doc4). Feeding this context directly to an LLM would force it to sift through noise, increasing the chance of an incorrect or incomplete answer. This is the precise problem we will solve with a re-ranking layer.


    Phase 2: High-Precision Re-ranking with Cohere

    Now we take the noisy candidate list and apply a more powerful model to re-order it based on true semantic relevance. This is where a cross-encoder shines. Unlike bi-encoders, which create vector embeddings for the query and documents independently, a cross-encoder processes them together ([CLS] query [SEP] document [SEP]), allowing it to model the interactions between their terms much more deeply.

    Cohere provides a managed, highly optimized cross-encoder model via their rerank API, saving us the significant effort of hosting and scaling such a model ourselves.

    Integrating the Cohere Re-rank API

    First, ensure you have the Cohere Python SDK installed and an API key set as an environment variable (COHERE_API_KEY):

    bash
    npm install cohere

    Now, let's write the code to interact with the API.

    python
    import cohere
    
    class Reranker:
        """A client for the Cohere Rerank API."""
        def __init__(self, api_key: str):
            if not api_key:
                raise ValueError("Cohere API key is required.")
            self.co = cohere.Client(api_key)
    
        def rerank_documents(
            self, 
            query: str, 
            documents: List[Dict[str, Any]], 
            top_n: int = 3,
            model: str = 'rerank-english-v2.0'
        ) -> List[Dict[str, Any]]:
            """
            Reranks a list of documents based on a query using the Cohere API.
    
            Args:
                query: The user's query.
                documents: The list of documents retrieved from the first stage.
                top_n: The number of top documents to return after re-ranking.
                model: The re-ranking model to use.
    
            Returns:
                A sorted list of the top_n most relevant documents with relevance scores.
            """
            print(f"\n[Reranker] Sending {len(documents)} documents to Cohere for re-ranking...")
            
            # The API expects a list of strings.
            doc_texts = [doc['text'] for doc in documents]
    
            try:
                start_time = time.time()
                # The core API call
                rerank_results = self.co.rerank(
                    model=model,
                    query=query,
                    documents=doc_texts,
                    top_n=top_n
                )
                latency = time.time() - start_time
                print(f"[Reranker] Cohere API call took {latency:.4f} seconds.")
    
                # The API returns a list of RerankResult objects.
                # We need to map these back to our original document objects.
                sorted_docs = []
                for result in rerank_results.results:
                    original_doc = documents[result.index]
                    sorted_doc = {
                        'id': original_doc['id'],
                        'text': original_doc['text'],
                        'relevance_score': result.relevance_score
                    }
                    sorted_docs.append(sorted_doc)
                
                return sorted_docs
    
            except cohere.errors.CohereError as e:
                print(f"[Reranker] Error calling Cohere API: {e}")
                # Fallback strategy: return the original top_n documents without re-ranking
                return documents[:top_n]
    
    # --- Example Usage of Phase 2 ---
    
    # Ensure your API key is set in your environment
    # export COHERE_API_KEY="YOUR_API_KEY"
    api_key = os.getenv("COHERE_API_KEY")
    
    if api_key:
        reranker = Reranker(api_key)
    
        # Use the noisy documents from Phase 1
        reranked_docs = reranker.rerank_documents(
            query=query, 
            documents=candidate_docs, 
            top_n=3
        )
    
        print("\n--- Re-ranked Results (High Precision) ---")
        for i, doc in enumerate(reranked_docs):
            print(f"{i+1}. ID: {doc['id']}, Score: {doc['relevance_score']:.4f}, Text: {doc['text']}")
    else:
        print("\nSkipping re-ranking as COHERE_API_KEY is not set.")
    

    The output demonstrates the power of the re-ranking step:

    text
    [Reranker] Sending 7 documents to Cohere for re-ranking...
    [Reranker] Cohere API call took 0.2815 seconds.
    
    --- Re-ranked Results (High Precision) ---
    1. ID: doc1, Score: 0.9987, Text: The new X1 GPU offers 24GB of VRAM and is optimized for deep learning workloads.
    2. ID: doc6, Score: 0.9521, Text: When selecting a GPU for AI, VRAM capacity is a critical factor to consider.
    3. ID: doc4, Score: 0.8912, Text: To fine-tune a language model, you need a substantial dataset and significant GPU memory, often more than 16GB.

    The transformation is dramatic. The re-ranker correctly identified doc1 as the most relevant, followed by doc6 and doc4, which provide excellent supporting context. The irrelevant documents (doc2, doc3, doc5, doc7) have been completely eliminated from the final context that will be passed to the LLM.


    End-to-End Production-Grade Orchestration

    Now let's tie everything together into a single, robust class that represents our advanced RAG pipeline. This class will handle orchestration, configuration, and error handling.

    python
    import openai
    
    class AdvancedRAGPipeline:
        def __init__(self, vector_db_client, reranker_client, llm_client):
            self.vector_db = vector_db_client
            self.reranker = reranker_client
            self.llm = llm_client
            self.config = {
                'retrieval_k': 50,  # Retrieve a large number of candidates
                'rerank_n': 5,      # Re-rank and return the top N
            }
    
        def execute_query(self, query: str) -> Dict[str, Any]:
            """Executes the full retrieve -> rerank -> generate pipeline."""
            request_id = str(uuid.uuid4())
            print(f"\n[Pipeline:{request_id}] Starting advanced RAG query: '{query}'")
    
            # 1. High-Recall Retrieval
            try:
                candidate_docs = self.vector_db.hybrid_search(query, top_k=self.config['retrieval_k'])
                if not candidate_docs:
                    print(f"[Pipeline:{request_id}] No documents found in initial retrieval.")
                    return {"answer": "I could not find any information related to your query.", "context": []}
            except Exception as e:
                print(f"[Pipeline:{request_id}] Error during retrieval phase: {e}")
                # Critical failure, cannot proceed
                return {"answer": "An error occurred while searching for information.", "context": []}
    
            # 2. High-Precision Re-ranking
            final_context_docs = self.reranker.rerank_documents(
                query=query,
                documents=candidate_docs,
                top_n=self.config['rerank_n']
            )
    
            # 3. Generation with the LLM
            prompt = self._build_prompt(query, final_context_docs)
            
            try:
                response = self.llm.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant. Answer the user's question based ONLY on the provided context."},
                        {"role": "user", "content": prompt}
                    ]
                )
                answer = response.choices[0].message.content
            except Exception as e:
                print(f"[Pipeline:{request_id}] Error during generation phase: {e}")
                answer = "An error occurred while generating the answer."
    
            return {
                "answer": answer,
                "retrieved_context": final_context_docs
            }
    
        def _build_prompt(self, query: str, context_docs: List[Dict[str, Any]]) -> str:
            """Builds the final prompt for the LLM."""
            context_str = "\n\n".join([f"Document {i+1} (ID: {doc['id']}):\n{doc['text']}" for i, doc in enumerate(context_docs)])
            
            prompt = f"""
            Context information is provided below.
            ---------------------
            {context_str}
            ---------------------
            Given the context information and not prior knowledge, answer the query.
            Query: {query}
            Answer:
            """
            return prompt
    
    # --- Full Pipeline Execution ---
    
    # Setup clients
    openai_api_key = os.getenv("OPENAI_API_KEY")
    cohere_api_key = os.getenv("COHERE_API_KEY")
    
    if openai_api_key and cohere_api_key:
        mock_db_client = MockVectorDBClient(DOCUMENT_CORPUS)
        cohere_reranker = Reranker(cohere_api_key)
        openai_client = openai.OpenAI(api_key=openai_api_key)
    
        pipeline = AdvancedRAGPipeline(
            vector_db_client=mock_db_client,
            reranker_client=cohere_reranker,
            llm_client=openai_client
        )
    
        # Let's use our test query again
        final_query = "How much memory does the X1 GPU have for AI tasks?"
        result = pipeline.execute_query(final_query)
    
        print("\n--- FINAL RAG PIPELINE OUTPUT ---")
        print(f"Query: {final_query}")
        print(f"\nGenerated Answer:\n{result['answer']}")
        print("\n--- Context Used for Generation ---")
        for doc in result['retrieved_context']:
            print(f"- ID: {doc['id']}, Score: {doc.get('relevance_score', 'N/A'):.4f}, Text: {doc['text']}")
    else:
        print("\nSkipping full pipeline execution as OpenAI or Cohere API keys are not set.")
    

    With the clean, re-ranked context, the LLM can now easily synthesize the correct answer:

    text
    Generated Answer:
    Based on the provided context, the new X1 GPU offers 24GB of VRAM. This is important for AI tasks like deep learning and fine-tuning language models, which require significant GPU memory.

    This answer is accurate, concise, and directly derived from the high-quality context we provided, demonstrating the success of the two-stage retrieval pattern.


    Performance, Latency, and Cost Considerations

    Introducing a network call to a third-party API in your retrieval path has significant implications that must be carefully managed in a production environment.

    Latency Analysis

    The re-ranking step adds a serial network hop, directly increasing the total time-to-first-token for your RAG system.

    * Initial Retrieval (Vector DB): Typically 50-200ms.

    Re-ranking (Cohere API): Our tests showed ~280ms for 7 documents. This can increase with the number of documents sent for re-ranking. Cohere's documentation states that latency scales with query_length sum(document_lengths).

    * Total Added Latency: ~300-500ms before the LLM even starts generating.

    Mitigation Strategies:

  • Limit Documents to Re-rank: The most critical optimization. Do not send your entire k=100 retrieval set to the re-ranker. A common pattern is to retrieve k=50 to k=100 from the vector DB, but only pass the top 10-25 of those to the re-ranker. This balances recall and latency.
  • Aggressive Timeouts: Implement a strict timeout (e.g., 500ms) on the re-ranker call. If it fails or times out, fall back to using the top N results from the initial retrieval. This ensures system availability at the cost of occasional sub-optimal ranking.
  • Caching: If you have common queries, cache the re-ranked document IDs. A query-hash -> sorted-doc-IDs mapping in a Redis cache can serve subsequent identical requests instantly.
  • Cost Analysis

    Re-ranking is not free. As of late 2023, Cohere's Rerank model is priced per document processed. For example, at $1.00 per 1,000,000 units (where a unit is ~100 tokens), re-ranking 25 documents of 200 tokens each would be 25 * (200/100) = 50 units. This seems small, but at scale, it adds up.

    Cost-Benefit Calculation:

    * Cost of Re-ranking: (Number of queries) x (Documents per query) x (Cost per document).

    * Cost Savings (Potentially): By providing a cleaner, more concise context, you might be able to use a smaller, cheaper LLM (e.g., GPT-3.5-Turbo instead of GPT-4) or use fewer total tokens in the prompt. You must benchmark this trade-off.

    * Value of Accuracy: For many applications (e.g., customer support bots, internal knowledge search), the cost of providing a wrong answer (customer churn, wasted employee time) far exceeds the sub-cent cost of a re-ranking call. The ROI is often justified by the significant increase in quality.


    Advanced Edge Cases and Nuances

    Deploying this pattern in the wild reveals several subtleties that must be handled.

    Document Chunking Strategy Matters

    The re-ranker's effectiveness is highly dependent on the quality of your document chunks. If your chunks are arbitrary 256-token blocks sliced from a larger document, the re-ranker may struggle to find relevance.

    Best Practice: Implement semantic or logical chunking. Chunks should represent self-contained ideas—paragraphs, sections, or list items. This ensures that the text sent to the re-ranker has sufficient context to be judged for relevance. The re-ranker's context window should also inform your maximum chunk size.

    The "Lost in the Middle" Problem

    Research has shown that LLMs pay more attention to information at the beginning and end of their context window. Even with perfectly re-ranked documents, the order of presentation matters.

    Solution: Always place the most relevant document (the one with the highest re-ranking score) first in the context you build for the LLM. The default behavior of our pipeline—sorting by relevance_score—already accomplishes this. This simple step ensures the most critical information is in a prime position for the LLM's attention mechanism.

    Conditional Bypass for Re-ranking

    Is a second API call always necessary? Not always. If the initial retrieval from your vector database returns documents with extremely high similarity scores (e.g., a score > 0.95), it might indicate a very high-confidence match.

    Implementation Pattern:

    • Perform the initial hybrid search.
    • Check the scores of the top few documents.
  • If top_1_score > threshold and (top_1_score - top_2_score) > delta_threshold, you can infer a confident, unambiguous result.
    • In this case, you can bypass the re-ranking step, save the latency and cost, and proceed directly to generation. This adds complexity but can optimize performance for common, easy queries.

    Offline Evaluation is Crucial

    How do you prove the re-ranker is adding value? You need a robust offline evaluation framework before deploying.

  • Create a Golden Dataset: Collect a set of representative queries and manually label the most relevant documents for each from your corpus.
  • Run Both Pipelines: For each query, run your RAG system with and without the re-ranking layer.
  • Measure IR Metrics: Calculate metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) on the ranked lists produced by both systems. A significant lift in these metrics provides quantitative evidence that the re-ranker is improving retrieval quality. This data is essential for justifying the added complexity and cost of the architecture.
  • Conclusion

    The transition from a basic RAG prototype to a production-ready AI system requires moving beyond single-stage, naive retrieval. By implementing a two-stage architecture—using a high-recall hybrid search to find candidates followed by a high-precision cross-encoder re-ranker—engineers can solve the critical problem of noisy context. This pattern directly addresses the core weakness of many RAG systems, ensuring the LLM receives a clean, relevant, and well-ordered context.

    While this approach introduces added latency and cost, the dramatic improvement in answer quality, accuracy, and reliability often provides a compelling return on investment. For any serious RAG application, mastering the "retrieve and re-rank" pattern is no longer an option, but a necessity for delivering state-of-the-art performance.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles