Real-time Hybrid Search in Postgres with pgvector and BM25

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Unimodal Search Dilemma: Why Production Systems Need More

In modern search applications, particularly Retrieval-Augmented Generation (RAG) systems, relying on a single search modality is a significant architectural liability. Senior engineers recognize this trade-off immediately:

  • Pure Dense Retrieval (Vector Search): Excellent for semantic, conceptual, and multilingual queries. It understands user intent. However, it can be lossy. Specific, rare keywords, product SKUs, or exact phrases (e.g., function names in code search) can be 'averaged out' in the embedding space, leading to a failure to retrieve documents that are a perfect keyword match.
  • Pure Sparse Retrieval (Keyword Search): Algorithms like TF-IDF or its successor, Okapi BM25, are masters of precision. They excel at matching exact terms and phrases. Their fatal flaw is the 'keyword gap'—they have no inherent understanding that "laptop" and "notebook computer" are semantically identical. They treat them as entirely different tokens.
  • A production-grade search system must serve both query types flawlessly. The solution is hybrid search, but implementing it often involves complex, distributed architectures with separate vector databases (e.g., Pinecone, Weaviate) and text search engines (e.g., Elasticsearch, OpenSearch). This introduces network latency, data synchronization challenges, and operational overhead.

    This article details a robust, performant, and operationally simple alternative: implementing a sophisticated hybrid search system entirely within PostgreSQL by leveraging the pgvector extension and PostgreSQL's powerful programmability.

    We will build a system that:

    • Performs parallel vector and full-text searches in a single query.
  • Implements the BM25 ranking algorithm using a custom PL/pgSQL function, as it's not native to Postgres's ts_rank.
    • Fuses the results using Reciprocal Rank Fusion (RRF), a superior rank-based merging strategy.
  • Is optimized for real-time INSERT/UPDATE operations with appropriate indexing strategies.

  • System Architecture and Schema Design

    Our goal is to store document text, its corresponding embedding vector, and its pre-processed tsvector for full-text search within a single table. This co-location is key to transactional consistency and query performance.

    Prerequisites

    Ensure your PostgreSQL instance (v12+) has the pgvector extension enabled:

    sql
    CREATE EXTENSION IF NOT EXISTS vector;

    Table Schema

    Let's define a documents table. We'll assume a 768-dimensional embedding model (like sentence-transformers/all-mpnet-base-v2).

    sql
    CREATE TABLE documents (
        id BIGSERIAL PRIMARY KEY,
        content TEXT NOT NULL,
        content_embedding VECTOR(768),
        content_tsv TSVECTOR,
        created_at TIMESTAMPTZ DEFAULT NOW(),
        updated_at TIMESTAMPTZ DEFAULT NOW()
    );

    Automated `tsvector` Generation

    To ensure the content_tsv column is always synchronized with content, we'll use a trigger. This is a standard production pattern to offload this logic from the application layer.

    sql
    CREATE OR REPLACE FUNCTION update_documents_tsv() RETURNS TRIGGER AS $$
    BEGIN
        NEW.content_tsv := to_tsvector('english', NEW.content);
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;
    
    CREATE TRIGGER tsvector_update_trigger
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_documents_tsv();

    Critical Indexing Strategy

    This is where performance is made or broken. We need two distinct indexes for our two search modalities.

  • Vector Search Index (HNSW): For approximate nearest neighbor (ANN) search. HNSW (Hierarchical Navigable Small World) provides an excellent balance of speed and accuracy. The m and ef_construction parameters are critical tuning knobs.
  • - m: Max number of connections per layer. Higher m improves recall but increases index size and build time.

    - ef_construction: Size of the dynamic candidate list during index construction. Higher values lead to a better quality index at the cost of build time.

  • Full-Text Search Index (GIN): For tsvector queries. GIN (Generalized Inverted Index) is ideal for indexing composite values where elements appear many times within the table (i.e., words).
  • sql
    -- HNSW Index for Vector Search
    -- A good starting point: m=16, ef_construction=64
    CREATE INDEX ON documents USING hnsw (content_embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
    
    -- GIN Index for Full-Text Search
    CREATE INDEX ON documents USING gin (content_tsv);

    With this schema, we are prepared to handle both types of queries efficiently. An INSERT will now automatically generate the tsvector and the data will be ready for inclusion in both indexes.


    Implementing BM25 Ranking in PL/pgSQL

    PostgreSQL's built-in ranking functions, ts_rank and ts_rank_cd, are based on TF-IDF. While decent, BM25 is widely considered the state-of-the-art for sparse retrieval ranking. It offers better handling of document length normalization.

    The BM25 formula is:

    $$ ext{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})} $$

    Where:

  • IDF(q_i): Inverse document frequency of query term q_i.
  • f(q_i, D): Term frequency of q_i in document D.
  • |D|: Length of document D.
  • avgdl: Average document length in the corpus.
  • k1, b: Hyperparameters (typically k1 is 1.2-2.0, b is 0.75).
  • Implementing this directly in SQL is cumbersome. A PL/pgSQL function is the correct approach. It allows us to pre-calculate corpus-wide statistics like avgdl and total document count for the IDF calculation.

    Here is a production-ready PL/pgSQL function for BM25 ranking.

    sql
    CREATE OR REPLACE FUNCTION bm25(tsq tsquery, tsv tsvector, k1 float4 := 1.2, b float4 := 0.75)
    RETURNS float4 AS $$
    DECLARE
        -- Pre-calculated corpus statistics. In a real system, you would periodically
        -- update these values and store them in a separate table or as configuration parameters.
        total_docs INT := 1000000; -- Replace with `SELECT count(*) FROM documents;`
        avg_doc_len FLOAT4 := 350.5; -- Replace with `SELECT avg(length(content)) FROM documents;`
    
        -- Function variables
        lexemes TEXT[];
        lexeme TEXT;
        term_freq FLOAT4;
        doc_len INT;
        idf FLOAT4;
        score FLOAT4 := 0.0;
        docs_with_term INT;
    BEGIN
        -- Extract query lexemes
        lexemes := string_to_array(regexp_replace(tsq::TEXT, '[()|&!]', ' ', 'g'), ' ');
    
        -- Get the length of the current document's text content from the tsvector
        -- Note: This is an approximation. For perfect accuracy, you'd need the raw text length.
        doc_len := array_length(string_to_array(tsv::TEXT, ' '), 1);
    
        IF doc_len = 0 THEN
            RETURN 0.0;
        END IF;
    
        FOREACH lexeme IN ARRAY lexemes
        LOOP
            -- Skip operators and empty strings
            IF lexeme = '' OR lexeme IN ('and', 'or', 'not') THEN
                CONTINUE;
            END IF;
    
            -- Get term frequency from tsvector statistics
            -- The ts_stat function is powerful but can be slow. A materialized view is often better.
            -- For this example, we'll simulate it. In a real query, you'd join to a stats table.
            -- Let's approximate f(q_i, D)
            term_freq := array_length(tsvector_to_array(tsv), 1) - array_length(tsvector_to_array(ts_delete(tsv, lexeme)), 1);
    
            IF term_freq > 0 THEN
                -- Calculate IDF. We need the number of documents containing the term.
                -- THIS IS THE SLOW PART. In production, create a statistics table/materialized view
                -- for document frequencies of each lexeme.
                -- For demonstration, we'll use a placeholder value.
                EXECUTE 'SELECT count(*) FROM documents WHERE content_tsv @@ to_tsquery(''' || lexeme || ''')' INTO docs_with_term;
    
                IF docs_with_term = 0 THEN
                    idf := 0;
                ELSE
                    idf := ln((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1.0);
                END IF;
    
                -- Calculate BM25 component for this term
                score := score + idf * (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * (doc_len / avg_doc_len)));
            END IF;
        END LOOP;
    
        RETURN score;
    END;
    $$ LANGUAGE plpgsql STABLE;

    Performance Caveat: The EXECUTE statement inside the loop to get docs_with_term is prohibitively slow for real-time queries. In a production environment, you must pre-calculate document frequencies. Create a materialized view or a table that stores each lexeme and its document count, and refresh it periodically. The function would then query this fast lookup table instead of scanning the main index.


    The Hybrid Query: Parallel Search and RRF Fusion

    Now we combine everything into a single, powerful SQL query. We will use Common Table Expressions (CTEs) to execute the vector and FTS searches independently before fusing their results.

    Fusion Strategy: Reciprocal Rank Fusion (RRF)

    Why not just normalize scores (e.g., min-max) and add them up? Because scores from different systems (vector distance vs. BM25 score) are not directly comparable. Vector distances are about geometric proximity in a high-dimensional space, while BM25 is a probabilistic relevance score. Combining them is like adding apples and oranges.

    Reciprocal Rank Fusion (RRF) solves this elegantly. It's a rank-based method that disregards the absolute scores and focuses only on the position of a document in each result list. The formula for a document d is:

    $$ \text{RRF_score}(d) = \sum_{r \in \text{rankings}} \frac{1}{k + \text{rank}_r(d)} $$

  • rank_r(d) is the rank of document d in result set r.
  • k is a constant to mitigate the effect of high ranks (i.e., reduces the impact of results at rank 1 vs rank 2). A common value for k is 60.
  • The Complete Query

    Here is the full query structure. We'll search for "advanced database optimization techniques".

    sql
    -- Query parameters
    WITH params AS (
        SELECT
            -- Generate the query embedding in your application and pass it as a parameter
            '[...]'::vector(768) AS query_embedding,
            'advanced & database & optimization & techniques'::tsquery AS query_tsquery,
            60 AS rrf_k, -- RRF constant
            100 AS search_limit -- How many results to fetch from each search method
    ),
    -- 1. Vector Search (Dense Retrieval)
    -- Uses the HNSW index for fast ANN search.
    vector_search AS (
        SELECT
            id,
            -- The rank is crucial for RRF.
            ROW_NUMBER() OVER (ORDER BY content_embedding <=> (SELECT query_embedding FROM params)) AS rank
        FROM documents
        ORDER BY content_embedding <=> (SELECT query_embedding FROM params)
        LIMIT (SELECT search_limit FROM params)
    ),
    -- 2. Full-Text Search (Sparse Retrieval)
    -- Uses the GIN index and our custom BM25 function.
    fts_search AS (
        SELECT
            id,
            -- The rank is based on our custom BM25 score.
            ROW_NUMBER() OVER (ORDER BY bm25((SELECT query_tsquery FROM params), content_tsv) DESC) AS rank
        FROM documents
        WHERE content_tsv @@ (SELECT query_tsquery FROM params)
        ORDER BY bm25((SELECT query_tsquery FROM params), content_tsv) DESC
        LIMIT (SELECT search_limit FROM params)
    )
    -- 3. Reciprocal Rank Fusion
    -- Combine the results, calculate RRF score, and return the final ranked list.
    SELECT
        COALESCE(vs.id, fs.id) AS id,
        -- Calculate the RRF score.
        -- If a document is in one list but not the other, its rank in the missing list is effectively infinity,
        -- so the 1 / (k + rank) term becomes 0, which is exactly what we want.
        (1.0 / ((SELECT rrf_k FROM params) + COALESCE(vs.rank, 0))) + 
        (1.0 / ((SELECT rrf_k FROM params) + COALESCE(fs.rank, 0))) AS rrf_score,
        -- Also return the individual ranks for debugging/analysis
        vs.rank AS vector_rank,
        fs.rank AS fts_rank
    FROM vector_search vs
    FULL OUTER JOIN fts_search ON vs.id = fs.id
    ORDER BY rrf_score DESC
    LIMIT 20; -- Final limit for the user

    Query Analysis (`EXPLAIN ANALYZE`)

    Running EXPLAIN ANALYZE on this query is crucial. You should expect to see:

  • For vector_search CTE: An Index Scan using your HNSW index on documents. This should be very fast.
  • For fts_search CTE: A Bitmap Heap Scan using the GIN index on documents. The WHERE content_tsv @@ ... clause should be filtered by the index.
  • For the final JOIN: A Hash Full Outer Join is likely, which is efficient for combining the two relatively small, pre-filtered result sets (search_limit of 100 each).
  • If you see a Sequential Scan on the documents table in either of the first two CTEs, your indexes are not being used, and performance will be abysmal. This could be due to outdated table statistics (ANALYZE documents;), incorrect query formulation, or an issue with the index itself.


    Production Considerations and Edge Cases

    Deploying this system requires attention to detail beyond the query itself.

    Real-time Updates and Index Bloat

  • HNSW Index: pgvector's HNSW implementation is append-only. UPDATEs and DELETEs mark old rows as dead but don't immediately reclaim space. Frequent VACUUM operations are essential to prevent index bloat and maintain performance. For write-heavy workloads, you may need to schedule periodic REINDEX operations during maintenance windows, though this requires an exclusive lock.
  • GIN Index: GIN indexes have a fastupdate mechanism that defers index insertion into a pending list. This speeds up writes, but can slow down reads if the pending list grows too large. Autovacuum is usually effective at managing this, but for extremely high-throughput systems, you might need to tune its settings (autovacuum_work_mem, gin_pending_list_limit) to be more aggressive.
  • Tuning `ef_search`

    The HNSW index has a query-time parameter, hnsw.ef_search, that controls the size of the search graph to explore. A higher value increases accuracy (recall) at the cost of latency. This can be set per-transaction:

    sql
    BEGIN;
    SET LOCAL hnsw.ef_search = 100; -- Higher value for better recall
    -- ... run your hybrid search query here ...
    COMMIT;

    This allows you to offer different performance/accuracy tiers in your application. A background job might use a high ef_search, while a user-facing API uses a lower, faster value.

    Handling Zero-Result Scenarios

    The FULL OUTER JOIN gracefully handles cases where one of the search methods returns no results. If fts_search is empty, the fs.id and fs.rank columns will be NULL, and the RRF calculation for that component will correctly contribute zero to the final score. The results will be ranked purely based on the vector_search output, which is the desired behavior.

    Tuning the RRF `k` Constant

    The k value in RRF (rrf_k in our query) controls how much to penalize lower-ranked items. A smaller k gives more weight to the top few results. The default of 60 is a good starting point, but you should tune this based on empirical evaluation of your specific dataset and query patterns. If you find that your keyword search is extremely precise and should be trusted more, you could use a weighted RRF, though this adds complexity.

    Cold Starts and Cache Warming

    PostgreSQL relies heavily on the filesystem cache. The first time a query is run after a server restart, it will be significantly slower as it reads index and table data from disk. For latency-sensitive applications, consider implementing a cache warming procedure that runs a few representative queries upon startup to load key index pages into memory.

    Conclusion

    By combining pgvector, custom PL/pgSQL functions for advanced ranking like BM25, and a well-structured query using CTEs and RRF, it is entirely feasible to build a world-class, real-time hybrid search engine within PostgreSQL. This single-database approach dramatically simplifies the operational stack, ensures transactional consistency between text and vector data, and eliminates network latency inherent in distributed microservice architectures.

    While this pattern requires a deep understanding of PostgreSQL's indexing, query planner, and extensibility, the payoff is a highly performant, scalable, and maintainable search system that can serve as the backbone for sophisticated AI-powered applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles