Optimizing RAG: Hybrid Search in Postgres with pgvector & FTS

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Brittle Nature of Pure Vector Search in Production RAG

As engineers scaling Retrieval-Augmented Generation (RAG) systems, we've moved past the initial excitement of semantic search. We're now in the trenches, dealing with production workloads, and the reality is that pure vector search, while powerful, is a blunt instrument. It's brilliant at finding documents that are conceptually similar to a query, but it often flounders when precision is paramount.

Consider these common failure modes in a production RAG system that relies solely on a vector database or an extension like pgvector:

  • Keyword Obliviousness: A user queries for a specific error code, ERR_CONNECTION_RESET, or a product SKU like QNAP-TS-453D. A pure semantic search might return documents about general network connectivity issues or NAS storage devices, completely missing the documents containing the exact string because the semantic meaning of the alphanumeric code is lost during the embedding process.
  • Acronym and Jargon Ambiguity: In specialized domains (medical, legal, financial), acronyms are critical. A search for "GDPR compliance" might not effectively retrieve a document that only uses the full phrase "General Data Protection Regulation."
  • Out-of-Vocabulary (OOV) Terms: Embedding models are trained on a finite corpus. Newly coined technical terms, project codenames, or specific identifiers might not have meaningful vector representations, causing them to be treated as noise.
  • Conversely, traditional Full-Text Search (FTS), based on sparse vector models like TF-IDF or BM25, is the polar opposite. It offers surgical precision for keywords but has zero semantic understanding. A query for "how to secure a web server" would miss a critical document titled "hardening an Nginx reverse proxy."

    This is not a theoretical problem. It's a daily reality that leads to poor quality RAG outputs and user frustration. The solution is not to choose one over the other, but to architect a system that leverages the strengths of both: Hybrid Search. This article provides a deep, implementation-focused guide to building a high-performance, production-ready hybrid search engine directly within PostgreSQL, combining pgvector for semantic search and built-in FTS for keyword search, and crucially, using an advanced fusion algorithm to intelligently merge the results.


    Architectural Blueprint: Parallel Search and Late Fusion

    A naive approach to hybrid search might involve running a vector query, then a separate FTS query, and clumsily concatenating the results in the application layer. This is inefficient, difficult to paginate, and makes relevance tuning a nightmare. A robust architecture performs these operations in parallel within the database and uses a sophisticated fusion strategy to rank the final results.

    Our target architecture within PostgreSQL will look like this:

  • Unified Data Model: A single table will store the document content, its pre-computed embedding vector, and its pre-computed FTS tsvector.
  • Parallel Query Execution: A single SQL query, using Common Table Expressions (CTEs), will execute the vector search and the FTS search concurrently.
  • Score Normalization: The raw outputs from each search (distance for vectors, rank for FTS) are not directly comparable. We will normalize these scores into a consistent range within the query itself.
  • Advanced Result Fusion: Instead of a simple weighted average (alpha score_vec + (1-alpha) score_fts), which is highly sensitive to score distribution, we will implement Reciprocal Rank Fusion (RRF). RRF is a rank-based fusion method that is more robust and often yields superior results without tedious manual tuning of weights.
  • Let's start by defining our database schema.

    1. Schema and Indexing Strategy

    We'll create a documents table to store our knowledge base. The schema is designed for efficient hybrid queries.

    sql
    -- Ensure the pgvector extension is installed
    CREATE EXTENSION IF NOT EXISTS vector;
    
    -- Create the documents table
    CREATE TABLE documents (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        tenant_id UUID, -- For multi-tenancy
        content TEXT NOT NULL,
        metadata JSONB,
        embedding VECTOR(768), -- Dimension depends on your model (e.g., 768 for all-MiniLM-L6-v2)
        content_tsvector TSVECTOR
    );
    
    -- Index for Full-Text Search on the tsvector
    -- A GIN index is generally preferred for FTS
    CREATE INDEX idx_documents_tsvector ON documents USING GIN(content_tsvector);
    
    -- Index for pgvector (HNSW)
    -- HNSW is often the best choice for high-recall, high-performance ANN search.
    -- m: max number of connections per layer (higher = better recall, slower build)
    -- ef_construction: size of the dynamic candidate list during build (higher = better recall, slower build)
    CREATE INDEX idx_documents_embedding ON documents USING HNSW (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
    
    -- Composite B-tree index for multi-tenancy filtering
    CREATE INDEX idx_documents_tenant ON documents(tenant_id);
    
    -- Optional: A trigger to automatically update the tsvector on content changes
    CREATE OR REPLACE FUNCTION update_tsvector()
    RETURNS TRIGGER AS $$
    BEGIN
        NEW.content_tsvector := to_tsvector('english', NEW.content);
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;
    
    CREATE TRIGGER tsvector_update_trigger
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_tsvector();
    

    Key Architectural Decisions:

    * VECTOR(768): The dimension must match your embedding model. Using a model like sentence-transformers/all-MiniLM-L6-v2 (384) or text-embedding-ada-002 (1536) would require changing this value.

    * HNSW vs. IVFFlat: We chose HNSW (Hierarchical Navigable Small World) as our vector index. It generally provides a better recall-latency trade-off than IVFFlat, especially on complex datasets. While IVFFlat can be faster if you have a well-clustered dataset and can tune the probes parameter perfectly, HNSW is more robust out-of-the-box. The ef_search parameter can be set at query time to tune the recall/performance trade-off without re-indexing.

    * vector_cosine_ops: We are indexing for cosine similarity. If you normalize your embeddings to unit length, cosine similarity is equivalent to dot product and inversely related to L2 distance. Using cosine distance (<=>) is conventional for semantic search.

    * Automatic tsvector Generation: The trigger is a classic production pattern. It ensures the FTS vector is always in sync with the content, decoupling this logic from your application layer and reducing the risk of stale data.

    2. The Data Ingestion Pipeline

    Your application needs to generate embeddings before inserting data. Here is a Python example demonstrating a robust ingestion function.

    python
    import psycopg2
    import uuid
    from sentence_transformers import SentenceTransformer
    
    # It's recommended to initialize the model once and reuse it.
    EMBEDDING_MODEL = SentenceTransformer('all-MiniLM-L6-v2')
    
    def add_document_to_db(tenant_id: str, content: str, metadata: dict):
        """
        Generates embedding and inserts a document into the PostgreSQL database.
        """
        conn = None
        try:
            # Connect to your PostgreSQL database
            conn = psycopg2.connect(
                dbname="your_db", 
                user="your_user", 
                password="your_password", 
                host="localhost"
            )
            cur = conn.cursor()
    
            # 1. Generate the embedding
            # The model outputs a numpy array, which needs to be converted to a list
            embedding = EMBEDDING_MODEL.encode(content).tolist()
    
            # 2. Prepare the SQL statement
            # The trigger will handle the tsvector automatically
            sql = """
            INSERT INTO documents (tenant_id, content, metadata, embedding)
            VALUES (%s, %s, %s, %s::VECTOR)
            RETURNING id;
            """
    
            # 3. Execute the insertion
            cur.execute(sql, (tenant_id, content, psycopg2.extras.Json(metadata), embedding))
            document_id = cur.fetchone()[0]
            
            conn.commit()
            cur.close()
            print(f"Successfully inserted document with ID: {document_id}")
            return document_id
    
        except (Exception, psycopg2.DatabaseError) as error:
            print(f"Error: {error}")
            if conn:
                conn.rollback()
        finally:
            if conn:
                conn.close()
    
    # Example Usage:
    add_document_to_db(
        tenant_id=str(uuid.uuid4()),
        content="PostgreSQL's HNSW index provides fast approximate nearest neighbor search.",
        metadata={'source': 'pgvector_docs'}
    )
    
    add_document_to_db(
        tenant_id=str(uuid.uuid4()),
        content="The GIN index is ideal for full-text search using tsvector columns.",
        metadata={'source': 'postgres_docs'}
    )
    
    add_document_to_db(
        tenant_id=str(uuid.uuid4()),
        content="A critical vulnerability, CVE-2023-4863, was found in libwebp.",
        metadata={'source': 'security_bulletin'}
    )
    

    Advanced Querying: The Unified Hybrid Search Function

    This is the core of our system. We will build a PL/pgSQL function that encapsulates the entire hybrid search logic. This provides a clean interface for your application, simplifies query logic, and allows the database's query planner to optimize the entire operation holistically.

    Our function will implement Reciprocal Rank Fusion (RRF). The formula for RRF is simple but powerful:

    RRF_Score(d) = Σ (1 / (k + rank_i(d)))

    Where:

    * d is a document.

    * The sum is over each result set i (in our case, vector search and FTS).

    * rank_i(d) is the rank of document d in result set i.

    * k is a constant to mitigate the impact of high ranks (a common value is 60).

    Unlike weighted score averaging, RRF doesn't care about the absolute score values, only their relative rank. This makes it incredibly robust when combining search systems with different scoring mechanisms, like our vector distance and FTS rank.

    Here is the complete PL/pgSQL function.

    sql
    CREATE OR REPLACE FUNCTION hybrid_search(
        p_tenant_id UUID,
        p_query_text TEXT,
        p_embedding VECTOR(768),
        p_match_limit INT,
        p_fts_weight REAL := 1.0, -- Not used for RRF, but useful for weighted sum
        p_vector_weight REAL := 1.0, -- Not used for RRF, but useful for weighted sum
        p_rrf_k INT := 60
    )
    RETURNS TABLE (id UUID, content TEXT, metadata JSONB, score REAL)
    AS $$
    BEGIN
        RETURN QUERY
        WITH vector_results AS (
            -- Perform vector search
            SELECT 
                d.id,
                -- Cosine distance is 0 for identical, 2 for opposite. Convert to similarity (1 to -1).
                -- We'll use 1 - (distance / 2) to get a 0-1 similarity score.
                1 - (d.embedding <=> p_embedding) AS score,
                ROW_NUMBER() OVER (ORDER BY d.embedding <=> p_embedding ASC) AS rank
            FROM documents d
            WHERE d.tenant_id = p_tenant_id
            ORDER BY d.embedding <=> p_embedding ASC
            LIMIT p_match_limit
        ), fts_results AS (
            -- Perform full-text search
            SELECT 
                d.id,
                -- ts_rank_cd normalizes the rank score from 0 to 1
                ts_rank_cd(d.content_tsvector, websearch_to_tsquery('english', p_query_text)) AS score,
                ROW_NUMBER() OVER (ORDER BY ts_rank_cd(d.content_tsvector, websearch_to_tsquery('english', p_query_text)) DESC) AS rank
            FROM documents d
            WHERE d.tenant_id = p_tenant_id
              AND d.content_tsvector @@ websearch_to_tsquery('english', p_query_text)
            ORDER BY score DESC
            LIMIT p_match_limit
        ), 
        -- Reciprocal Rank Fusion (RRF) implementation
        fused_results AS (
            SELECT 
                COALESCE(vr.id, fr.id) AS id,
                -- Calculate RRF score
                COALESCE(1.0 / (p_rrf_k + vr.rank), 0.0) + COALESCE(1.0 / (p_rrf_k + fr.rank), 0.0) AS rrf_score
            FROM vector_results vr
            FULL OUTER JOIN fts_results fr ON vr.id = fr.id
        )
        SELECT 
            d.id,
            d.content,
            d.metadata,
            fr.rrf_score
        FROM fused_results fr
        JOIN documents d ON fr.id = d.id
        ORDER BY fr.rrf_score DESC
        LIMIT p_match_limit;
    END;
    $$ LANGUAGE plpgsql;
    

    Deconstructing the Query

  • vector_results CTE: This performs the Approximate Nearest Neighbor (ANN) search.
  • * ORDER BY d.embedding <=> p_embedding is the core operation that utilizes the HNSW index.

    * We compute a rank using ROW_NUMBER(), which is essential for RRF.

    * Crucially, we also calculate a normalized similarity score 1 - (d.embedding <=> p_embedding). While not used by RRF, it's good practice and vital if you were to fall back to a weighted sum approach.

  • fts_results CTE: This performs the keyword search.
  • * websearch_to_tsquery is a user-friendly function that converts plain text into a valid tsquery, handling things like quotes and operators.

    * ts_rank_cd is used to score the relevance. It's a good general-purpose ranking function.

    * Again, ROW_NUMBER() provides the rank for RRF.

  • fused_results CTE: This is the heart of the RRF implementation.
  • * A FULL OUTER JOIN is used on the document IDs. This is critical because a document might appear in one result set but not the other. An INNER JOIN would discard these valuable unique results.

    * COALESCE(vr.id, fr.id) ensures we get the ID from whichever result set it appeared in.

    * The RRF score is calculated. COALESCE(..., 0.0) handles cases where a document is only in one list, correctly assigning a score of 0 from the list it's absent from.

  • Final SELECT: We join back to the documents table to retrieve the full content and metadata, order by the final rrf_score, and apply the final limit.
  • Calling the Function from Python

    python
    import psycopg2
    import psycopg2.extras
    from sentence_transformers import SentenceTransformer
    
    # Reuse the model
    EMBEDDING_MODEL = SentenceTransformer('all-MiniLM-L6-v2')
    
    def search(tenant_id: str, query: str, limit: int = 10):
        conn = None
        try:
            conn = psycopg2.connect(
                dbname="your_db", 
                user="your_user", 
                password="your_password", 
                host="localhost"
            )
            cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
    
            # Generate query embedding
            embedding = EMBEDDING_MODEL.encode(query).tolist()
    
            # Execute the function
            cur.execute(
                "SELECT * FROM hybrid_search(%s, %s, %s::VECTOR, %s)", 
                (tenant_id, query, embedding, limit)
            )
    
            results = cur.fetchall()
            cur.close()
            return results
    
        except (Exception, psycopg2.DatabaseError) as error:
            print(f"Error: {error}
    ")
        finally:
            if conn:
                conn.close()
    
    # Example Query:
    # This query has both a semantic component ("vulnerability") and a specific keyword ("CVE-2023-4863")
    query_term = "vulnerability in CVE-2023-4863"
    results = search(tenant_id="your_tenant_id", query=query_term)
    
    for row in results:
        print(f"ID: {row['id']}")
        print(f"Score: {row['score']:.4f}")
        print(f"Content: {row['content'][:100]}...")
        print("---")
    

    This architecture provides a clean, maintainable, and highly performant solution. The application layer is blissfully unaware of the complexity of hybrid search; it simply calls a function.


    Performance Tuning and Production Edge Cases

    Deploying this to production requires attention to performance and potential pitfalls.

    1. Analyzing Query Performance with `EXPLAIN ANALYZE`

    The most critical tool in your arsenal is EXPLAIN ANALYZE. Let's analyze our hybrid search query.

    sql
    EXPLAIN ANALYZE
    -- ... (the body of the function) ...

    You should see a query plan where:

    * The vector_results CTE uses the HNSW index (Index Scan using idx_documents_embedding).

    * The fts_results CTE uses the GIN index (Bitmap Heap Scan on documents using idx_documents_tsvector).

    * The database executes these two scans in parallel.

    * The FULL OUTER JOIN is performed on the limited, much smaller result sets, which is highly efficient.

    Tuning ef_search: If you notice low recall (relevant documents aren't being returned) from the vector search portion, you can tune the ef_search parameter at query time. This increases the size of the search graph explored during the HNSW scan, improving recall at the cost of latency.

    sql
    -- Set for the current session before running the query
    SET hnsw.ef_search = 100;
    SELECT * FROM hybrid_search(...);
    -- Reset to default
    RESET hnsw.ef_search;

    2. The Multi-Tenancy Pagination Problem

    Our schema includes a tenant_id, and our function correctly filters by it. This is crucial for security and data isolation. The WHERE d.tenant_id = p_tenant_id clause in both CTEs ensures that the index scans are constrained to the relevant data partition, which is highly efficient if you have a composite index or the table is partitioned by tenant_id.

    Pagination: Implementing pagination correctly is simple with this function-based approach. The final ORDER BY and LIMIT are applied after the fusion. To get the second page of results, you would use standard LIMIT and OFFSET on the function's output.

    sql
    -- Get page 2 (results 11-20)
    SELECT * FROM hybrid_search(...) 
    ORDER BY score DESC -- The function already orders, but explicit is good
    LIMIT 10 OFFSET 10;

    Why this is safe: Because the fusion happens before the final pagination, the ranking is consistent across all pages. A naive application-layer approach that fetches 10 vector results and 10 FTS results, merges them, and then tries to paginate will produce incorrect and unstable rankings from page to page.

    3. Asynchronous Embedding Generation

    The data ingestion script we wrote is synchronous. For every INSERT, the application waits for the embedding model to run. In a high-throughput write environment, this is a major bottleneck. A production architecture should offload embedding generation to a background worker queue.

  • Initial Insert: The application inserts the document with a NULL embedding and a status field (e.g., 'PENDING_EMBEDDING'). This is a very fast operation.
  • Job Queue: The application pushes the document_id to a message queue like RabbitMQ or a job queue like Celery.
  • Background Worker: A separate pool of workers consumes from this queue. A worker fetches the document content, generates the embedding, and then performs an UPDATE on the documents table to set the embedding vector and change the status to 'COMPLETED'.
  • This decouples the write path from the CPU-intensive ML inference, ensuring your application's write latency remains low and stable.


    Benchmark: RRF vs. Weighted Sum

    To demonstrate the superiority of RRF, let's consider a scenario.

    Query: Fixing libwebp bug CVE-2023-4863

    * Vector Search: Might rank documents about "image library security fixes" or "memory corruption vulnerabilities" highly. Let's say it ranks the true CVE document at position #8.

    * FTS: Will rank the document containing the exact string CVE-2023-4863 at position #1.

    Weighted Sum Approach:

    Let's say the CVE document gets a semantic similarity score of 0.6 (moderately relevant) and an FTS score of 1.0 (perfect match). A different, more general document about memory bugs gets a semantic score of 0.9 and an FTS score of 0.1. With a 50/50 weighting, their final scores would be:

    CVE Doc: 0.5 0.6 + 0.5 * 1.0 = 0.8

    General Doc: 0.5 0.9 + 0.5 * 0.1 = 0.5

    This works, but what if the FTS scores were on a different scale (e.g., 0-100)? Or if the vector scores were unusually high? The weighting (alpha) becomes a finicky parameter you have to constantly tune.

    RRF Approach (k=60):

    * CVE Doc: (1 / (60 + 8)) [from vector] + (1 / (60 + 1)) [from FTS] = 0.0147 + 0.0164 = 0.0311

    * General Doc (Rank #1 from vector, doesn't appear in FTS): (1 / (60 + 1)) + 0 = 0.0164

    The CVE document is correctly ranked higher. RRF rewards documents that appear high up in any of the rankings, and its score gracefully combines evidence from multiple sources without being sensitive to the magnitude of the scores themselves.

    Conclusion

    For senior engineers building sophisticated RAG systems, moving beyond simplistic single-mode retrieval is a necessity. By leveraging the power of PostgreSQL as a unified platform for structured data, full-text search, and vector search, we can build an exceptionally powerful and maintainable hybrid search engine.

    The key takeaways are:

    * Architecture Matters: Use a parallel search pattern with CTEs for efficiency.

    * Encapsulate Logic: A PL/pgSQL function provides a clean, high-performance abstraction for your application.

    * Fusion is Key: Reciprocal Rank Fusion (RRF) is a robust, production-proven method for merging results from disparate ranking systems, consistently outperforming naive weighted-score approaches.

    * Tune for Production: Use EXPLAIN ANALYZE to validate index usage, tune parameters like ef_search, and implement asynchronous pipelines for write-heavy workloads.

    This approach allows you to build RAG systems that are not only semantically aware but also precise, delivering a significantly better user experience by finding the exact information users are looking for, whether it's hidden in a concept or a keyword.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles