Hybrid Search in Postgres: Fusing pgvector and BM25 with RRF

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Dichotomy of Modern Search: Why Your Single-Method Approach is Flawed

In production search systems, particularly for e-commerce or document retrieval, the architectural choice between semantic vector search and traditional keyword-based full-text search (FTS) presents a false dichotomy. Senior engineers recognize that deploying either in isolation creates predictable, business-impacting failure modes.

Vector Search's Achilles' Heel: While transformative for understanding user intent and semantic relationships (e.g., mapping "durable phone cover" to a product named "rugged smartphone case"), vector search models often struggle with specificity. They can fail to prioritize exact matches for model numbers, SKUs, brand names, or technical terms that were underrepresented in their training data. A query for "iPhone 15 Pro Max case" might surface a semantically similar "iPhone 14 Pro case" if the model prioritizes the concept of "Pro Max case" over the specific model number 15.

FTS's Semantic Blindness: Conversely, algorithms like BM25, the workhorse of lexical search, are ruthlessly precise but semantically naive. They excel at finding documents containing the exact query terms, ranked by term frequency (TF) and inverse document frequency (IDF). However, they have no inherent understanding that "laptop sleeve" and "notebook protector" are conceptually identical. This lexical gap leads to missed recall and a frustrating user experience.

This article architect a superior solution: a hybrid search system implemented entirely within PostgreSQL. We will leverage the pgvector extension for state-of-the-art approximate nearest neighbor (ANN) search and combine it with a robust FTS implementation using a custom, from-scratch BM25 ranking function. The critical innovation lies in fusing these two distinct, ranked result sets into a single, superior list using Reciprocal Rank Fusion (RRF), all within a single, albeit complex, SQL query. This approach provides the best of both worlds while avoiding the operational overhead of synchronizing data with an external search engine like Elasticsearch or OpenSearch.


Architectural Blueprint: Schema, Indexing, and Data Flow

Our foundation is a well-designed PostgreSQL schema that accommodates both lexical and semantic data representations. We'll use a trigger to maintain data integrity and select specialized index types for high-performance queries.

1. Table and Extension Setup

First, ensure the necessary extensions are enabled. We need vector for pgvector and btree_gin can sometimes be useful for FTS, though we will primarily use standard GIN.

sql
-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

Next, we define our products table. The key columns are embedding for our vectors and fts_document for our pre-processed tsvector.

sql
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    -- Storing the 384-dimensional vector from our chosen model
    embedding vector(384),
    -- Pre-calculated tsvector for full-text search
    fts_document tsvector
);

2. Automated `tsvector` Generation with a Trigger

Manually maintaining the fts_document column is an anti-pattern prone to error and data drift. A trigger function is the production-standard approach to ensure the tsvector is always synchronized with the textual content.

We'll combine the name and description fields, assigning different weights. A match in the product name is typically more relevant than a match in its description. Here, we assign an 'A' weight to the name and a 'B' weight to the description.

sql
CREATE OR REPLACE FUNCTION update_products_tsvector() RETURNS trigger AS $$
BEGIN
    -- Coalesce fields to handle potential NULLs
    -- Set weights: 'A' for name, 'B' for description. 'A' is the highest weight.
    NEW.fts_document := 
        setweight(to_tsvector('english', coalesce(NEW.name, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(NEW.description, '')), 'B');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create the trigger to execute the function before any insert or update
CREATE TRIGGER products_tsvector_update
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION update_products_tsvector();

3. The Crucial Indexing Strategy

This is where performance is won or lost. We need two distinct, highly specialized indexes:

  • HNSW for Vector Search: For high-speed Approximate Nearest Neighbor (ANN) search, a Hierarchical Navigable Small World (HNSW) index is the state-of-the-art. It's vastly superior to flat indexes (IVFFlat) for many workloads, offering better recall-performance trade-offs. We'll tune m (connections per layer) and ef_construction (size of the dynamic candidate list during build).
  • GIN for Full-Text Search: A Generalized Inverted Index (GIN) is the standard for tsvector columns. It efficiently maps lexemes to the rows containing them, making it ideal for FTS queries.
  • sql
    -- HNSW index for fast ANN search on the embedding vector
    -- m = 16 and ef_construction = 64 are reasonable starting points.
    CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
    
    -- GIN index for fast full-text search on the tsvector
    CREATE INDEX ON products USING gin (fts_document);

    A Note on vector_cosine_ops: We are explicitly creating the index for cosine similarity. If your embedding model is optimized for L2 distance or inner product, you must use vector_l2_ops or vector_ip_ops respectively. Mismatching the distance metric between your model and your index will produce incorrect results.


    Populating the System: Embedding Generation

    With the database schema ready, we need to populate it. This involves an offline process where we take our product data, generate vector embeddings using a sentence-transformer model, and insert it into the table. The trigger we created will handle the tsvector automatically.

    For this example, we'll use the all-MiniLM-L6-v2 model, which produces 384-dimensional vectors and offers a good balance of performance and quality.

    python
    # requirements: sentence-transformers, psycopg2-binary, numpy
    import psycopg2
    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    # --- Configuration ---
    DB_CONFIG = {
        "dbname": "search_db",
        "user": "user",
        "password": "password",
        "host": "localhost",
        "port": "5432"
    }
    MODEL_NAME = 'all-MiniLM-L6-v2' # 384-dimensional vectors
    
    # --- Sample Data ---
    products_data = [
        {"sku": "A-101", "name": "Durable Waterproof Hiking Backpack", "description": "A 50L capacity backpack designed for multi-day treks. Features reinforced stitching and a built-in rain cover."},
        {"sku": "B-202", "name": "Lightweight Running Shoes", "description": "Breathable mesh shoes with cushioned soles, perfect for marathon training or daily jogs."},
        {"sku": "C-303", "name": "Professional DSLR Camera Kit", "description": "Includes a 24MP camera body, 18-55mm lens, and a carrying case. Ideal for photographers."}, 
        {"sku": "D-404", "name": "Ergonomic Office Chair", "description": "A comfortable mesh chair with lumbar support and adjustable armrests for long work hours."},
        {"sku": "E-505", "name": "Smart LED TV 55-inch 4K UHD", "description": "A television with vibrant colors, HDR support, and built-in streaming apps. Model X-VISION-55."},
        {"sku": "F-606", "name": "Noise-Cancelling Bluetooth Headphones", "description": "Over-ear headphones with 30-hour battery life and superior sound isolation technology."},
    ]
    
    # --- Main Script ---
    def main():
        print(f"Loading sentence transformer model: {MODEL_NAME}...")
        model = SentenceTransformer(MODEL_NAME)
    
        try:
            with psycopg2.connect(**DB_CONFIG) as conn:
                with conn.cursor() as cur:
                    print("Connection to PostgreSQL successful.")
    
                    for product in products_data:
                        print(f"Processing product SKU: {product['sku']}")
                        
                        # Combine text fields for embedding generation
                        text_to_embed = f"{product['name']}. {product['description']}"
                        
                        # Generate embedding
                        embedding = model.encode(text_to_embed).tolist()
                        
                        # Insert into database
                        # The trigger will automatically populate fts_document
                        cur.execute(
                            """INSERT INTO products (sku, name, description, embedding) 
                               VALUES (%s, %s, %s, %s) ON CONFLICT (sku) DO NOTHING;""",
                            (product['sku'], product['name'], product['description'], np.array(embedding))
                        )
                    conn.commit()
                    print("\nAll products have been processed and inserted.")
    
        except psycopg2.OperationalError as e:
            print(f"Could not connect to the database: {e}")
        except Exception as e:
            print(f"An error occurred: {e}")
    
    if __name__ == "__main__":
        main()
    

    Implementing the Ranking Algorithms

    Now we move to the core of the retrieval logic. We need two separate ranking functions: one for vector similarity and one for keyword relevance (BM25).

    1. Vector Similarity Search

    pgvector makes this straightforward. The <=> operator calculates cosine distance (1 - cosine similarity). We query for the lowest distance to find the most similar items.

    sql
    -- Example Vector Search for 'travel bag'
    -- Assume :query_embedding is a placeholder for the 384-dim vector
    SELECT 
        id, 
        name, 
        embedding <=> :query_embedding AS distance
    FROM products
    ORDER BY distance
    LIMIT 10;

    2. Implementing BM25 in PL/pgSQL

    PostgreSQL's built-in ts_rank and ts_rank_cd functions are based on the TF-IDF algorithm, not Okapi BM25. BM25 is generally considered superior as it addresses issues like term saturation. To use it, we must implement it ourselves as a PL/pgSQL function.

    BM25's formula is:

    Score(q, d) = Σ [ IDF(qi) ( TF(qi, d) (k1 + 1) ) / ( TF(qi, d) + k1 (1 - b + b |d| / avgdl) ) ]

    Where:

  • IDF(qi) is the inverse document frequency of query term qi.
  • TF(qi, d) is the term frequency of qi in document d.
  • |d| is the length of document d.
  • avgdl is the average document length in the corpus.
  • k1 and b are tuning parameters (typically k1 between 1.2-2.0, b around 0.75).
  • Here is a production-ready PL/pgSQL implementation. It pre-calculates average document length for performance and uses ts_stat for IDF components.

    sql
    CREATE OR REPLACE FUNCTION bm25(
        ts_query text, 
        ts_doc tsvector, 
        k1 float4 := 1.5, 
        b float4 := 0.75
    ) RETURNS float4 AS $$
    DECLARE
        query_tsquery tsquery := to_tsquery('english', ts_query);
        doc_len int := length(ts_doc);
        -- Pre-calculate average document length. In a real system, store this in a separate table and update periodically.
        -- For this example, we calculate it on the fly, which is inefficient.
        avg_doc_len float4 := (SELECT avg(length(fts_document)) FROM products);
        score float4 := 0.0;
        term text;
        term_tf float4;
        term_idf float4;
        total_docs int := (SELECT count(*) FROM products);
        docs_with_term int;
    BEGIN
        IF avg_doc_len = 0 THEN
            RETURN 0.0;
        END IF;
    
        -- Loop through each lexeme in the query
        FOR term IN SELECT unnest(string_to_array(ts_query, ' & ')) LOOP
            -- Get term frequency from the document's ts_stat
            SELECT nentry INTO docs_with_term FROM pg_catalog.ts_stat('SELECT fts_document FROM products') WHERE word = term;
            IF docs_with_term IS NULL OR docs_with_term = 0 THEN
                CONTINUE;
            END IF;
    
            -- Calculate IDF (using a common variant)
            term_idf := ln((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1);
    
            -- Calculate TF for the current term in the document
            -- This is a simplified TF; a more complex implementation could parse the tsvector positions.
            term_tf := (SELECT count(*) FROM unnest(ts_doc) AS t(lexeme) WHERE t.lexeme = term);
            
            -- BM25 term score calculation
            score := score + (term_idf * (term_tf * (k1 + 1)) / (term_tf + k1 * (1 - b + b * doc_len / avg_doc_len)));
        END LOOP;
    
        RETURN score;
    END;
    $$ LANGUAGE plpgsql STABLE;

    Performance Warning: The on-the-fly calculation of avg_doc_len and ts_stat inside the function is highly inefficient for production. A robust implementation would store these corpus-level statistics in a separate table and update them periodically via a background job.

    Now we can run a BM25-ranked FTS query:

    sql
    -- Example BM25 Search for 'camera lens'
    SELECT 
        id, 
        name, 
        bm25('camera & lens', fts_document) AS score
    FROM products
    WHERE fts_document @@ to_tsquery('english', 'camera & lens')
    ORDER BY score DESC
    LIMIT 10;

    The Fusion Algorithm: Reciprocal Rank Fusion (RRF) in SQL

    We now have two independent, ranked lists of results. The challenge is to combine them intelligently. Naively adding or multiplying scores is problematic because the score ranges are completely different (vector distance is 0-2, BM25 is unbounded). Normalization is an option, but it can be brittle.

    Reciprocal Rank Fusion (RRF) provides an elegant, parameter-free solution. It disregards the raw scores and focuses solely on the rank of each item in its respective list. The RRF score for a document is the sum of the reciprocals of its ranks across all lists.

    RRF_Score(d) = Σ (1 / (k + rank(d)))

  • rank(d) is the rank of document d in a result list.
  • k is a constant to mitigate the impact of high ranks (a common value is 60).
  • We can implement this entire logic in a single, powerful SQL query using Common Table Expressions (CTEs).

    The Unified Hybrid Search Query

    This query is the centerpiece of our system. It performs both searches in parallel, ranks them, and then fuses the results.

    sql
    -- Full Hybrid Search Query for: "rugged outdoor camera"
    -- Placeholders: :query_embedding (vector), :query_text ('rugged & outdoor & camera'), :fts_query_text ('rugged | outdoor | camera')
    
    WITH 
    -- Step 1: Vector Search Results
    vector_results AS (
        SELECT 
            id,
            -- Rank is determined by the ordering
            ROW_NUMBER() OVER (ORDER BY embedding <=> :query_embedding) as rank,
            -- The raw score/distance can be useful for debugging
            embedding <=> :query_embedding as distance
        FROM products
        -- Optional: Use an approximate search for speed on massive datasets
        -- ORDER BY embedding <=> :query_embedding
        -- LIMIT 200
    ),
    -- Step 2: Full-Text Search (BM25) Results
    fts_results AS (
        SELECT 
            id, 
            ROW_NUMBER() OVER (ORDER BY bm25(:query_text, fts_document) DESC) as rank,
            bm25(:query_text, fts_document) as score
        FROM products
        -- The @@ operator is the filter; it uses the GIN index efficiently
        WHERE fts_document @@ to_tsquery('english', :fts_query_text)
    ),
    -- Step 3: Reciprocal Rank Fusion (RRF)
    -- k is the ranking constant, typically 60.
    -- We use a FULL OUTER JOIN to include documents that appear in only one result set.
    fused_results AS (
        SELECT
            COALESCE(v.id, f.id) as id,
            -- Calculate RRF score. If a doc is not in a list, its rank component is 0.
            (CASE WHEN v.rank IS NULL THEN 0.0 ELSE 1.0 / (60 + v.rank) END) + 
            (CASE WHEN f.rank IS NULL THEN 0.0 ELSE 1.0 / (60 + f.rank) END) as rrf_score
        FROM vector_results v
        FULL OUTER JOIN fts_results f ON v.id = f.id
    )
    -- Step 4: Final Selection and Ordering
    SELECT 
        p.id,
        p.name,
        p.sku,
        fr.rrf_score
    FROM fused_results fr
    JOIN products p ON fr.id = p.id
    ORDER BY fr.rrf_score DESC
    LIMIT 20;

    Analysis of the Query Plan:

  • The vector_results CTE will perform an Index Scan on the HNSW index.
  • The fts_results CTE will perform a Bitmap Heap Scan using the GIN index.
    • PostgreSQL's query planner is often smart enough to execute these two CTEs in parallel.
  • The FULL OUTER JOIN is the fusion step, combining the two lists.
  • The final JOIN back to the products table retrieves the full document data for the top-ranked results.
  • This architecture ensures that a product like the "Professional DSLR Camera Kit" (SKU C-303) will rank highly for a semantic query like "photo taking gear" (via vector search) and also for a specific keyword query like "24MP DSLR" (via FTS), with the RRF score reflecting its relevance in both contexts.


    Advanced Considerations and Edge Cases

    Building a production-ready system requires addressing several edge cases.

    1. Performance Tuning and `EXPLAIN ANALYZE`

    Always profile your queries with EXPLAIN ANALYZE. Key parameters to tune:

  • hnsw.ef_search: This session-level parameter controls the size of the dynamic candidate list during HNSW search. Higher values increase accuracy (recall) at the cost of latency. You can set it before your query: SET LOCAL hnsw.ef_search = 100;.
  • work_mem: Sufficient work_mem is critical for GIN index performance and for the hash join that will likely be used in the fusion step.
  • Pre-filtering: If the user applies filters (e.g., category = 'electronics'), it's crucial to apply these before the expensive search operations. Add the WHERE clause to both the vector_results and fts_results CTEs. This dramatically reduces the search space.
  • 2. Weighting Vector vs. FTS Results

    While RRF is parameter-free, you might want to give more importance to one search method. You can introduce a weighting factor into the fusion formula:

    sql
    -- Weighted RRF example
    (w_vector * (CASE ... END)) + (w_fts * (CASE ... END)) as rrf_score

    Where w_vector + w_fts = 1.0. For example, for queries containing model numbers or SKUs (detectable via regex), you might dynamically set w_fts to 0.7 and w_vector to 0.3.

    3. Handling Pagination

    Standard OFFSET/LIMIT pagination becomes inefficient on large result sets. Keyset pagination (or "cursor-based" pagination) is the superior pattern. To implement it, you would order by rrf_score DESC, id ASC and for subsequent pages, add a WHERE clause like WHERE (rrf_score, id) < (:last_score, :last_id).

    4. Zero-Result Edge Cases

    The FULL OUTER JOIN gracefully handles cases where one of the search methods returns no results. If FTS finds nothing, the f.rank will be NULL, its RRF component will be zero, and the final ranking will depend solely on the vector results, and vice-versa. This makes the system resilient.

    Conclusion: An Integrated Search Powerhouse

    By moving beyond monolithic search strategies and embracing a hybrid model, you can build a significantly more intelligent and resilient search experience. This deep dive has demonstrated that it's not only possible but also highly effective to construct this sophisticated system entirely within PostgreSQL, leveraging pgvector for semantic understanding, a custom BM25 function for lexical precision, and Reciprocal Rank Fusion for elegant result combination.

    This architecture avoids the latency and operational complexity of external search clusters for many common use cases. It centralizes your data and logic, simplifies your stack, and provides a powerful, tunable foundation for a production-grade search API. The provided SQL patterns, while complex, are a template for building a state-of-the-art search engine directly on top of your primary application database.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles