Optimizing RAG with Cohere Re-rank and PGVector IVFFlat Indexes

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Gap in Production RAG Systems

In any mature Retrieval-Augmented Generation (RAG) system, the initial euphoria of functional semantic search quickly gives way to the harsh reality of the "semantic gap." Standard vector search, typically based on cosine similarity or Euclidean distance, is exceptionally good at identifying documents that share statistical correlations in their embedding space. However, it often fails to capture the nuanced, contextual relevance required by a Large Language Model (LLM) to generate a precise and accurate response.

A query for "What were the financial implications of the Q3 server migration?" might retrieve documents about "Q3 financials" and "server migration planning" but miss the critical document titled "Post-Mortem: Q3 Infrastructure Overspend" because its embedding is marginally further away than the more keyword-heavy, yet less insightful, alternatives. This is the core problem: we are optimizing for vector proximity, not true relevance. This leads to a classic precision/recall trade-off. A simple vector search that retrieves a large number of candidates (top_k=100) might have high recall (the correct document is likely in the set) but suffers from abysmal precision (most of the 100 documents are noise).

To bridge this gap in a production environment, we must evolve from a single-stage retrieval process to a multi-stage pipeline that separates the concerns of broad candidate sourcing (recall) from precise relevance filtering (precision). This is the foundation of the Retrieve-and-Re-rank architecture.

The Two-Stage Retrieval Architecture: Retrieve and Re-rank

This pattern decomposes the retrieval problem into two distinct phases:

  • L1 Retrieval (The Recaller): A highly optimized, low-latency search mechanism designed to fetch a large set of potentially relevant candidates from the entire document corpus. The goal here is speed and recall. We accept that many of the results will be irrelevant. We'll use PostgreSQL with pgvector and a specifically tuned IVFFlat index for this stage.
  • L2 Re-ranking (The Refiner): A more computationally intensive, but far more sophisticated, model that takes the candidate set from L1 and re-scores it based on a deeper understanding of contextual relevance. This model, typically a cross-encoder, directly compares the query against each candidate document. We'll use the Cohere Re-rank API for this stage.
  • This architecture allows us to use the right tool for each job: a scalable and fast Approximate Nearest Neighbor (ANN) index for the initial wide net, and a powerful, context-aware neural model for the final, precise selection.

    mermaid
    graph TD
        A[User Query] --> B{Embedding Model};
        B --> C[L1 Retrieval: PGVector IVFFlat];
        C -- Top-K Candidates (e.g., K=100) --> D{L2 Re-ranking: Cohere API};
        A --> D;
        D -- Top-N Relevant Docs (e.g., N=5) --> E{LLM for Generation};
        A --> E;
        E --> F[Final Answer];
    
        subgraph Legend
            direction LR
            subgraph L1 - Recall
                C
            end
            subgraph L2 - Precision
                D
            end
        end

    Deep Dive: Configuring PGVector with IVFFlat Indexes

    While HNSW (Hierarchical Navigable Small World) is a popular choice for pgvector indexes due to its excellent performance balance, IVFFlat (Inverted File with Flat Compression) offers compelling advantages for massive, largely static datasets where we can afford a longer index build time in exchange for tunable query performance.

    IVFFlat works by partitioning the vector space into a predefined number of clusters, or lists. During a query, instead of searching the entire space, the system identifies the clusters closest to the query vector and searches only within them. This is controlled by the probes parameter.

    Why IVFFlat over HNSW for this Architecture?

  • Tunable Precision/Performance: The probes parameter gives us explicit, granular control over the trade-off between speed and recall. For our L1 stage, we can choose a lower probes value, accepting slightly lower recall for significantly lower latency, knowing the L2 re-ranker will compensate.
  • Memory Usage: IVFFlat can be more memory-efficient than HNSW at very large scales, as the index structure is simpler.
  • Predictable Performance: Query time is more directly proportional to the number of probes, making latency more predictable.
  • Implementation and Parameter Tuning

    Let's assume a table for our documents. The embedding dimension is 768, from a model like sentence-transformers/all-mpnet-base-v2.

    sql
    -- Ensure the pgvector extension is enabled
    CREATE EXTENSION IF NOT EXISTS vector;
    
    -- Document table schema
    CREATE TABLE documents (
        id BIGSERIAL PRIMARY KEY,
        content TEXT NOT NULL,
        metadata JSONB,
        embedding VECTOR(768)
    );
    
    -- Example of inserting a document
    -- INSERT INTO documents (content, metadata, embedding) VALUES (...);

    The crucial step is creating the IVFFlat index. The most important parameter is lists.

    Choosing the lists Parameter:

    The number of lists determines how many partitions the vector space is divided into. A good starting point is:

  • For up to 1 million rows: lists = sqrt(num_rows)
  • For over 1 million rows: lists = num_rows / 1000
  • Let's assume we have 500,000 documents. A reasonable lists value would be sqrt(500000) ≈ 707.

    sql
    -- Create the IVFFlat index
    -- This can take a significant amount of time on a large table.
    CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 707);

    Tuning the probes Query Parameter:

    The probes parameter is not set at index creation; it's set at query time. It dictates how many of the nearby lists (clusters) are searched. This is our primary lever for balancing latency and recall.

    We can set this parameter for our session before running the query:

    sql
    -- Set the number of probes for the current session
    SET ivfflat.probes = 10;
    
    -- Now execute the query
    SELECT id, content
    FROM documents
    ORDER BY embedding <=> '[...your_query_vector...]' -- The <=> operator performs the distance search
    LIMIT 100;

    Benchmarking `probes` for Production

    To choose the right probes value, you must benchmark. The goal is to find the "knee" in the curve—the point where increasing probes yields diminishing returns in recall for a significant latency cost.

    Here is a Python script demonstrating how to perform this benchmark. It requires a ground truth dataset where you know the truly relevant document IDs for a set of test queries.

    python
    import psycopg2
    import numpy as np
    import time
    import os
    
    # --- Configuration ---
    DB_CONN_STRING = os.getenv("DB_URL")
    TEST_QUERIES = { # A map of test queries to their ground truth document IDs
        "query_vector_1": [101, 205, 308],
        "query_vector_2": [415, 522, 631, 749]
    }
    PROBE_VALUES = [1, 2, 4, 8, 16, 32]
    TOP_K_RETRIEVAL = 100
    
    def get_embedding(query_text): # Placeholder for your embedding function
        # In a real scenario, this would call your embedding model
        return np.random.rand(768).tolist()
    
    def run_benchmark():
        results = {}
        conn = psycopg2.connect(DB_CONN_STRING)
        
        for probes in PROBE_VALUES:
            latencies = []
            recalls = []
            
            with conn.cursor() as cur:
                cur.execute(f"SET LOCAL ivfflat.probes = {probes};")
                print(f"\n--- Benchmarking with probes = {probes} ---")
    
                for query_text, true_ids in TEST_QUERIES.items():
                    query_vector = get_embedding(query_text)
                    
                    start_time = time.perf_counter()
                    
                    cur.execute(
                        "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
                        (str(query_vector), TOP_K_RETRIEVAL)
                    )
                    retrieved_ids = {row[0] for row in cur.fetchall()}
                    
                    end_time = time.perf_counter()
                    latencies.append((end_time - start_time) * 1000) # milliseconds
                    
                    # Calculate recall@K
                    hits = len(retrieved_ids.intersection(set(true_ids)))
                    recall_at_k = hits / len(true_ids) if true_ids else 0
                    recalls.append(recall_at_k)
    
            avg_latency = np.mean(latencies)
            avg_recall = np.mean(recalls)
            results[probes] = {"latency_ms": avg_latency, "recall": avg_recall}
            
            print(f"Avg Latency: {avg_latency:.2f} ms")
            print(f"Avg Recall@{TOP_K_RETRIEVAL}: {avg_recall:.4f}")
    
        conn.close()
        return results
    
    if __name__ == "__main__":
        benchmark_results = run_benchmark()
        print("\n--- Final Results ---")
        print("| Probes | Avg Latency (ms) | Recall@100 |")
        print("|--------|------------------|------------|")
        for p, r in benchmark_results.items():
            print(f"| {p:<6} | {r['latency_ms']:<16.2f} | {r['recall']:<10.4f} |")
    

    Hypothetical Benchmark Results:

    ProbesAvg Latency (ms)Recall@100
    18.510.8214
    212.330.8929
    419.870.9643
    835.120.9821
    1665.450.9821
    32121.900.9821

    From this data, probes = 8 is the clear winner. It achieves nearly maximum recall with a manageable latency of ~35ms. Increasing to 16 nearly doubles the latency for no improvement in recall. This is the data-driven decision-making required for production systems.

    Implementing the Cohere Re-ranker Layer

    With our fast L1 retriever tuned, we now feed its output—a list of 100 candidate documents—to our L2 re-ranker. The Cohere Re-rank model is a cross-encoder. Unlike a bi-encoder (which creates embeddings independently), a cross-encoder processes the (query, document) pair together, allowing it to capture far more subtle and complex relevance signals.

    The API Call and Implementation

    The implementation is straightforward. We take the raw text of the documents retrieved from PostgreSQL and pass them to the Cohere API.

    python
    import cohere
    
    class Reranker:
        def __init__(self, api_key):
            self.co = cohere.Client(api_key)
    
        def rerank_documents(self, query: str, documents: list[dict], top_n: int = 5) -> list[dict]:
            """
            Reranks a list of documents based on a query.
    
            Args:
                query (str): The user's query.
                documents (list[dict]): A list of dictionaries, each with 'id' and 'content'.
                top_n (int): The number of top documents to return.
    
            Returns:
                list[dict]: A sorted list of the top_n most relevant documents.
            """
            if not documents:
                return []
    
            # Extract the content for the API call
            doc_contents = [doc['content'] for doc in documents]
    
            try:
                # The model 'rerank-english-v2.0' is optimized for performance and quality.
                response = self.co.rerank(
                    model='rerank-english-v2.0',
                    query=query,
                    documents=doc_contents,
                    top_n=top_n
                )
    
                # Map the reranked results back to our original document objects
                reranked_indices = [r.index for r in response.results]
                reranked_docs = [documents[i] for i in reranked_indices]
                
                # Optionally, you can attach the relevance score
                for i, result in enumerate(response.results):
                    reranked_docs[i]['relevance_score'] = result.relevance_score
    
                return reranked_docs
    
            except cohere.errors.CohereAPIError as e:
                # Production-grade error handling is crucial
                print(f"Cohere API error: {e}")
                # Fallback strategy: return the original top_n documents from L1 retrieval
                return documents[:top_n]
    
    # Example Usage:
    # cohere_reranker = Reranker(api_key=os.getenv("COHERE_API_KEY"))
    # retrieved_docs = [...] # List of {'id': 1, 'content': '...'} from PGVector
    # final_docs = cohere_reranker.rerank_documents("my query", retrieved_docs, top_n=5)

    This implementation includes a critical fallback mechanism. If the Cohere API fails, we don't fail the entire request; we gracefully degrade by returning the top N documents from the initial L1 retrieval. This ensures system resilience.

    End-to-End Production Pipeline

    Now, let's assemble the full pipeline into a cohesive service class. This class will manage the database connection, embedding generation, L1 retrieval, and L2 re-ranking.

    python
    import psycopg2
    from psycopg2.extras import RealDictCursor
    import cohere
    from sentence_transformers import SentenceTransformer
    import os
    
    class AdvancedRAGPipeline:
        def __init__(self, db_conn_string, cohere_api_key, embedding_model_name='all-mpnet-base-v2'):
            self.db_conn = psycopg2.connect(db_conn_string)
            self.cohere_client = cohere.Client(cohere_api_key)
            self.embedding_model = SentenceTransformer(embedding_model_name)
            self.ivfflat_probes = 8 # Determined from our benchmark
    
        def _embed(self, text: str) -> list[float]:
            return self.embedding_model.encode(text).tolist()
    
        def _retrieve_candidates(self, query_vector: list[float], top_k: int) -> list[dict]:
            with self.db_conn.cursor(cursor_factory=RealDictCursor) as cur:
                # Use a transaction to set probes locally
                cur.execute("BEGIN;")
                cur.execute(f"SET LOCAL ivfflat.probes = {self.ivfflat_probes};")
                
                cur.execute(
                    "SELECT id, content FROM documents ORDER BY embedding <=> %s LIMIT %s",
                    (str(query_vector), top_k)
                )
                candidates = cur.fetchall()
                cur.execute("COMMIT;")
            return candidates
    
        def _rerank(self, query: str, documents: list[dict], top_n: int) -> list[dict]:
            if not documents:
                return []
            
            doc_contents = [doc['content'] for doc in documents]
            try:
                response = self.cohere_client.rerank(
                    model='rerank-english-v2.0',
                    query=query,
                    documents=doc_contents,
                    top_n=top_n
                )
                reranked_docs = [documents[r.index] for r in response.results]
                for i, result in enumerate(response.results):
                    reranked_docs[i]['relevance_score'] = result.relevance_score
                return reranked_docs
            except cohere.errors.CohereAPIError as e:
                print(f"Cohere API error during rerank: {e}. Falling back to L1 results.")
                return documents[:top_n]
    
        def execute(self, query: str, retrieve_k: int = 100, rerank_n: int = 5) -> list[dict]:
            """Executes the full Retrieve-and-Re-rank pipeline."""
            if not query:
                return []
    
            # 1. Embed the query
            query_vector = self._embed(query)
    
            # 2. L1 Retrieval
            candidate_docs = self._retrieve_candidates(query_vector, retrieve_k)
            if not candidate_docs:
                return []
    
            # 3. L2 Re-ranking
            final_docs = self._rerank(query, candidate_docs, rerank_n)
    
            return final_docs
    
    # --- Main execution block ---
    if __name__ == '__main__':
        pipeline = AdvancedRAGPipeline(
            db_conn_string=os.getenv("DB_URL"),
            cohere_api_key=os.getenv("COHERE_API_KEY")
        )
    
        user_query = "What were the financial implications of the Q3 server migration?"
        
        final_documents = pipeline.execute(user_query)
    
        print(f"Final documents for query: '{user_query}'")
        for doc in final_documents:
            score = doc.get('relevance_score', 'N/A')
            print(f"  - ID: {doc['id']}, Score: {score:.4f}, Content: {doc['content'][:100]}...")
    

    Edge Case Handling and Production Considerations

  • Empty Candidate Set: If the L1 retrieval returns zero documents, the pipeline correctly short-circuits. This should be monitored; if it happens frequently, it may indicate issues with the embedding model or a corpus that is too narrow.
  • Re-ranker Latency: The re-ranker is the primary bottleneck. For a top_k of 100, Cohere's API typically responds in 200-400ms. If this is too slow, consider reducing the number of candidates (retrieve_k=50). This is another trade-off: a smaller candidate set for the re-ranker is faster and cheaper but increases the risk that the truly relevant document was never passed to it.
  • Cost Management: The Cohere API is a metered service. The cost is proportional to the number of documents re-ranked. Caching strategies (e.g., using Redis for identical queries) can mitigate costs for high-frequency queries.
  • Relevance Thresholding: In the final output, you might want to filter out documents where relevance_score is below a certain threshold (e.g., 0.5). This prevents the LLM from being fed low-quality context, even if it made it into the top N.
  • Performance and Cost Analysis

    Let's break down the expected performance of our end-to-end system:

    StageTypical Latency (ms)Notes
    1. Query Embedding30-50Depends on model and hardware (GPU vs CPU).
    2. PGVector L1 Retrieval30-40Based on our benchmark with probes=8 for top_k=100.
    3. Cohere L2 Re-ranking200-400For 100 documents. This is the main contributor to latency.
    Total Retrieval Time260-490msBefore passing context to the LLM for generation.

    This sub-500ms retrieval time is generally acceptable for interactive applications, especially given the immense quality gains. The cost is also a factor. If PGVector costs are part of your existing database infrastructure, the main new operational cost is the Cohere API, which should be factored into the per-query cost model of your application.

    Conclusion

    By moving from a naive single-stage vector search to a sophisticated two-stage Retrieve-and-Re-rank architecture, we solve the critical problem of the semantic gap in production RAG systems. This pattern allows us to leverage the strengths of different technologies: the raw speed and scalability of PGVector with a tuned IVFFlat index for high-recall candidate sourcing, and the deep contextual understanding of a cross-encoder model like Cohere Re-rank for high-precision final selection.

    The key takeaways for senior engineers implementing this pattern are:

  • Benchmark Your L1 Index: Don't guess your probes value. Measure the latency/recall trade-off to make a data-informed decision that fits your application's specific needs.
  • Isolate and Decouple: The L1 and L2 stages are decoupled. You can swap out the re-ranker, or even the entire vector database, without fundamentally altering the architecture.
  • Build for Resilience: Implement robust error handling and fallback logic, especially for external API calls. A degraded experience (falling back to L1 results) is better than a complete failure.
  • This architecture represents a significant step up in maturity for any RAG system, moving it from a promising prototype to a reliable, production-grade service capable of delivering consistently relevant and accurate results.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles