Hybrid Search with pgvector & BM25 for Advanced RAG Systems

October 5, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Gap in Production RAG Systems

Retrieval-Augmented Generation (RAG) has become a cornerstone of modern LLM applications. The standard implementation, however, relies almost exclusively on dense vector retrieval using embeddings. While powerful for capturing semantic similarity, this approach exhibits a critical weakness in production environments: a surprising fragility to keyword-specific queries. Acronyms, product SKUs, specific function names, or unique identifiers are often poorly represented in the dense vector space, leading to a frustrating semantic gap where the model understands the concept but misses the specifics.

For senior engineers tasked with building robust, reliable AI systems, this is an unacceptable limitation. The solution is not to abandon vector search, but to augment it. This is where hybrid search—the fusion of dense (semantic) and sparse (lexical) retrieval—becomes essential. This article bypasses introductory concepts and dives directly into a production-grade implementation of a sophisticated hybrid search system entirely within PostgreSQL, leveraging the pgvector extension for dense search and advanced SQL techniques to simulate a BM25-like sparse search.

We will cover:

Architecting a Unified Schema: Designing a PostgreSQL table to handle structured metadata, text content, dense vectors, and pre-computed sparse vector representations.

Advanced Indexing: Combining HNSW indexes for approximate nearest neighbor search with GIN indexes for high-performance full-text search.

Simulating BM25 in SQL: A detailed walkthrough of creating custom functions and queries to calculate BM25 scores without an external search engine like Elasticsearch.

Reciprocal Rank Fusion (RRF): Implementing the fusion algorithm to intelligently merge and re-rank results from both retrieval methods.

Performance Tuning and Scalability: Analyzing query plans, tuning index parameters, and strategies for managing large-scale document collections.

1. Architectural Foundation: A Unified Schema in PostgreSQL

Consolidating data into a single system is a significant architectural advantage. It simplifies operations, reduces data synchronization issues, and enables powerful transactional queries that combine different data types. Here's a schema designed for a multi-modal hybrid search system.

sql

-- Ensure required extensions are enabled
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS btree_gin;

-- Main table for documents
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    source_id TEXT NOT NULL UNIQUE,  -- An external identifier
    document_type TEXT NOT NULL DEFAULT 'text', -- e.g., 'text', 'image', 'pdf_page'
    content TEXT, -- Full text content, can be NULL for non-textual data
    metadata JSONB DEFAULT '{}'::jsonb, -- Structured data, e.g., {'image_url': '...', 'alt_text': '...'} 
    
    -- Dense Vector Representation (for pgvector)
    embedding VECTOR(768), -- Dimension depends on your model, e.g., 768 for MiniLM-L6-v2
    
    -- Sparse Vector Representation (for BM25)
    -- Storing term frequencies as a JSONB object: { 'term': count, ... }
    term_frequencies JSONB,
    
    -- Full-Text Search Vector (for PostgreSQL's native FTS)
    content_tsv TSVECTOR,

    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexing Strategy

-- 1. HNSW Index for Approximate Nearest Neighbor (ANN) search on embeddings
-- This is the workhorse for semantic search. ef_construction and m are build-time params.
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- 2. GIN Index for Full-Text Search on the tsvector
-- This powers the keyword-based part of our BM25 simulation.
CREATE INDEX ON documents USING gin (content_tsv);

-- 3. GIN Index for term frequencies in our sparse vector
-- Crucial for efficiently finding documents containing specific query terms.
CREATE INDEX ON documents USING gin (term_frequencies jsonb_path_ops);

-- 4. B-Tree index on metadata for filtering (example)
CREATE INDEX ON documents USING gin ((metadata->'tags'));

Architectural Decisions Explained:

* embedding VECTOR(768): This column stores the dense vector generated by a sentence-transformer model. We use vector_cosine_ops for the HNSW index, as cosine similarity is standard for normalized embeddings from these models.

* term_frequencies JSONB: This is the core of our BM25 simulation. Instead of storing a full sparse vector, we pre-calculate and store the term frequencies (TF) for each document. The format {'term': count, ...} is highly indexable with a GIN index.

* content_tsv TSVECTOR: This is a pre-processed representation of the content for PostgreSQL's native Full-Text Search (FTS). It handles stemming and stop-word removal. We'll use this to calculate Inverse Document Frequency (IDF) and for initial document filtering.

* HNSW vs. IVFFlat: We choose HNSW (Hierarchical Navigable Small World) over IVFFlat for our vector index. While IVFFlat can be faster for very high-recall requirements, HNSW generally provides a better balance of speed and accuracy, especially on modern hardware, and doesn't require ivf.probes tuning at query time. The ef_construction and m parameters are tuned during index creation to balance build time and graph quality.

To populate the content_tsv and term_frequencies columns automatically, we'll use a trigger.

sql

-- Function to calculate term frequencies from text
CREATE OR REPLACE FUNCTION calculate_term_frequencies(text_content TEXT) 
RETURNS JSONB AS $$
DECLARE
    lexemes TEXT[];
    lexeme TEXT;
    tf_map JSONB := '{}'::jsonb;
BEGIN
    IF text_content IS NULL THEN
        RETURN '{}'::jsonb;
    END IF;

    lexemes := array(SELECT unnest(string_to_array(regexp_replace(lower(text_content), '[^a-z0-9\s]+', '', 'g'), ' ')));
    
    FOREACH lexeme IN ARRAY lexemes
    LOOP
        IF lexeme <> '' THEN
            tf_map := jsonb_set(
                tf_map,
                ARRAY[lexeme],
                to_jsonb(COALESCE((tf_map->>lexeme)::INT, 0) + 1),
                true
            );
        END IF;
    END LOOP;

    RETURN tf_map;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Trigger to auto-populate derived columns
CREATE OR REPLACE FUNCTION update_document_derived_columns() 
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' OR NEW.content <> OLD.content THEN
        NEW.content_tsv := to_tsvector('english', NEW.content);
        NEW.term_frequencies := calculate_term_frequencies(NEW.content);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_update_document_derived_columns
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW
EXECUTE FUNCTION update_document_derived_columns();

This setup ensures our data is always ready for both dense and sparse retrieval queries upon insertion or update.

2. Implementing the Retrieval Components

With the architecture in place, we can now build the Python and SQL logic for each retrieval component.

Component A: Dense Retrieval with `pgvector`

This is the straightforward part. We generate a query embedding and use the <=> operator to find the k-nearest neighbors.

python

import psycopg2
from sentence_transformers import SentenceTransformer

# It's highly recommended to cache the model loading
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_dense_results(query_text: str, conn, limit: int = 10):
    """Performs dense vector search in PostgreSQL."""
    embedding = model.encode(query_text).tolist()
    
    with conn.cursor() as cur:
        cur.execute(
            """SELECT id, 1 - (embedding <=> %s) AS similarity
               FROM documents
               ORDER BY embedding <=> %s
               LIMIT %s;""",
            (str(embedding), str(embedding), limit)
        )
        results = cur.fetchall()
    return results

# Example Usage
# conn = psycopg2.connect(DSN)
# dense_res = get_dense_results("challenges in distributed systems", conn)
# print(dense_res)

Performance Consideration: The query planner will use the HNSW index effectively. However, performance is sensitive to the hnsw.ef_search parameter, which can be set per-transaction. A higher ef_search increases accuracy (recall) at the cost of latency.

sql

-- For higher accuracy queries
BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT id, 1 - (embedding <=> '[...embedding...]') AS similarity
FROM documents ORDER BY embedding <=> '[...embedding...]' LIMIT 10;
COMMIT;

This dynamic tuning is crucial for applications that may have different latency/accuracy requirements for different types of queries.

Component B: Sparse Retrieval with BM25 Simulation in SQL

This is where the complexity lies. The Okapi BM25 ranking function is:

$$ ext{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$

Where:

$q_i$ is the i-th query term.
$f(q_i, D)$ is the term frequency of $q_i$ in document $D$.
$|D|$ is the length of document $D$.
$avgdl$ is the average document length in the collection.
$k_1$ and $b$ are hyperparameters (typically $k_1 \in [1.2, 2.0]$ and $b=0.75$).

We need to compute this within a single SQL query. It's a multi-step process.

Step 1: Create a SQL function for the BM25 score calculation.

This function will encapsulate the complex formula, making our final query cleaner.

sql

CREATE OR REPLACE FUNCTION bm25_score(
    term_freq INTEGER,           -- f(q_i, D)
    doc_length INTEGER,          -- |D|
    avg_doc_length FLOAT,        -- avgdl
    doc_count_with_term BIGINT,  -- N(q_i)
    total_doc_count BIGINT,      -- N
    k1 FLOAT = 1.5,
    b FLOAT = 0.75
)
RETURNS FLOAT AS $$
DECLARE
    idf FLOAT;
BEGIN
    -- Calculate Inverse Document Frequency (IDF) using a common variant
    idf := log((total_doc_count - doc_count_with_term + 0.5) / (doc_count_with_term + 0.5) + 1.0);
    
    RETURN idf * (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length)));
END;
$$ LANGUAGE plpgsql IMMUTABLE;

Step 2: The Master Query for Sparse Retrieval.

This query is advanced. It needs to tokenize the user query, fetch statistics (like avgdl and total_doc_count), and then join back to the documents table to calculate the score for each document that contains at least one query term.

python

import psycopg2
import re

def get_sparse_results(query_text: str, conn, limit: int = 10):
    """Performs BM25-like sparse search in PostgreSQL."""
    # Simple whitespace and punctuation tokenizer
    query_terms = list(set(re.sub(r'[^a-z0-9\s]+', '', query_text.lower()).split()))
    
    if not query_terms:
        return []

    sql_query = """
    WITH query_terms AS (
        SELECT unnest(%(query_terms)s::TEXT[]) as term
    ),
    collection_stats AS (
        -- Pre-calculate average document length and total document count
        SELECT 
            avg(length(content)) as avgdl,
            count(1) as total_docs
        FROM documents
    ),
    doc_term_stats AS (
        -- For each query term, find how many documents contain it
        SELECT
            qt.term,
            count(d.id) as doc_count_with_term
        FROM query_terms qt
        JOIN documents d ON d.term_frequencies ? qt.term
        GROUP BY qt.term
    )
    SELECT 
        d.id,
        -- Sum the BM25 score for each query term present in the document
        SUM(
            bm25_score(
                (d.term_frequencies->>qts.term)::INT, -- f(q_i, D)
                length(d.content), -- |D|
                cs.avgdl,          -- avgdl
                dts.doc_count_with_term, -- N(q_i)
                cs.total_docs,      -- N
                1.5, -- k1
                0.75 -- b
            )
        ) as score
    FROM documents d
    -- Join with a filtered list of terms relevant to this document
    JOIN (
        SELECT term FROM query_terms
    ) qts ON d.term_frequencies ? qts.term
    CROSS JOIN collection_stats cs
    JOIN doc_term_stats dts ON qts.term = dts.term
    -- CRITICAL: Filter to only documents that have at least one query term.
    -- The GIN index on term_frequencies makes this fast.
    WHERE d.term_frequencies ?| (SELECT array_agg(term) FROM query_terms)
    GROUP BY d.id, cs.avgdl, cs.total_docs
    ORDER BY score DESC
    LIMIT %(limit)s;
    """

    with conn.cursor() as cur:
        cur.execute(sql_query, {'query_terms': query_terms, 'limit': limit})
        results = cur.fetchall()
    return results

# Example Usage
# conn = psycopg2.connect(DSN)
# sparse_res = get_sparse_results("PostgreSQL HNSW index tuning", conn)
# print(sparse_res)

Query Breakdown and Performance:

* CTEs (WITH clauses): We use Common Table Expressions to structure the query logically. collection_stats and doc_term_stats compute global values once.

* ?| operator: The WHERE d.term_frequencies ?| array['term1', 'term2'] clause is the performance linchpin. It translates to "does the term_frequencies JSONB contain any key from this array?". This operation is heavily accelerated by the GIN index on term_frequencies, allowing PostgreSQL to rapidly identify a candidate set of documents before performing expensive scoring calculations.

* EXPLAIN ANALYZE: Running EXPLAIN ANALYZE on this query would show an initial fast Bitmap Heap Scan using the GIN index, followed by joins and aggregations on that much smaller subset of documents. Without the GIN index, this query would perform a full table scan and be prohibitively slow.

3. The Fusion Layer: Reciprocal Rank Fusion (RRF)

Now we have two ranked lists of document IDs: one from dense search, one from sparse. We cannot simply add their scores, as they are on different scales and represent different concepts (semantic similarity vs. lexical relevance). Reciprocal Rank Fusion (RRF) is a simple yet remarkably effective algorithm for combining ranked lists without needing score normalization.

The formula for a document's RRF score is:

$$ ext{RRF_score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)} $$

Where:

$R$ is the set of result lists (in our case, dense and sparse).
$ ext{rank}_r(d)$ is the rank of document $d$ in result list $r$. If $d$ is not in the list, its contribution is 0.

$k$ is a constant to mitigate the impact of high ranks (small rank numbers). A common value is 60.

python

from collections import defaultdict

def reciprocal_rank_fusion(results_lists, k=60):
    """
    Performs Reciprocal Rank Fusion on a list of search result lists.

    Args:
        results_lists: A list of lists, where each inner list contains (id, score) tuples.
        k: The constant used in the RRF formula.

    Returns:
        A sorted list of (id, score) tuples.
    """
    ranked_items = defaultdict(float)
    
    for results in results_lists:
        for rank, (doc_id, _) in enumerate(results, 1):
            ranked_items[doc_id] += 1 / (k + rank)
            
    # Sort by the fused score in descending order
    fused_results = sorted(ranked_items.items(), key=lambda item: item[1], reverse=True)
    return fused_results

# --- Putting It All Together ---
def hybrid_search(query_text: str, conn, limit: int = 10):
    # In a production system, these two calls should be run concurrently
    # e.g., using asyncio or threading, to reduce latency.
    dense_results = get_dense_results(query_text, conn, limit=50) # Fetch more to allow for re-ranking
    sparse_results = get_sparse_results(query_text, conn, limit=50)
    
    fused = reciprocal_rank_fusion([dense_results, sparse_results])
    
    # You might want to fetch the full document content for the top N results
    top_ids = [doc_id for doc_id, score in fused[:limit]]
    
    if not top_ids:
        return []
        
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, content, metadata FROM documents WHERE id = ANY(%s)",
            (top_ids,)
        )
        # Re-order the fetched docs according to the fused rank
        docs_by_id = {doc[0]: doc for doc in cur.fetchall()}
        final_results = [(doc_id, score, docs_by_id[doc_id]) for doc_id, score in fused[:limit] if doc_id in docs_by_id]

    return final_results

# Example Usage
# conn = psycopg2.connect(DSN)
# final_res = hybrid_search("how to fix CVE-2021-44228 in our java services", conn)
# for doc_id, score, doc_data in final_res:
#     print(f"ID: {doc_id}, Score: {score:.4f}, Content: {doc_data[1][:100]}...")

This hybrid approach shines on a query like "how to fix CVE-2021-44228 in our java services". The dense search will find documents about Java vulnerabilities and logging frameworks, while the sparse search will ensure that documents containing the exact identifier "CVE-2021-44228" are ranked extremely highly. RRF will then balance these signals to produce a final ranking that is superior to either method alone.

4. Advanced Considerations and Production Patterns

Edge Case: Score Normalization and Weighted Fusion

While RRF avoids the need for score normalization, sometimes you may want to give more weight to one retriever. A simple modification to RRF is to add a weight multiplier:

$$ ext{Weighted_RRF_score}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + \text{rank}_r(d)} $$

For example, you could set w_dense = 1.0 and w_sparse = 1.5 to give a slight boost to keyword matches. This requires empirical tuning based on your specific dataset and query patterns.

Asynchronous Indexing Pipeline

The trigger-based approach is simple but can add latency to INSERT operations, especially with the overhead of embedding generation. A robust production architecture uses a message queue (e.g., RabbitMQ, Kafka) for an asynchronous indexing pipeline:

API Service: Receives a new document, saves the raw content to the documents table with embedding and term_frequencies as NULL, and publishes the document id to a message queue.

Indexing Worker: A separate service consumes messages from the queue.

For each id, the worker:

* Fetches the document content from the database.

* Generates the dense embedding using a GPU-accelerated model service.

* Calculates term frequencies.

* Updates the corresponding row in the documents table with the new data.

This decouples the write path from the expensive computation, ensuring low-latency ingestion.

Scaling with Table Partitioning

For collections with hundreds of millions or billions of documents, a single table and its indexes can become unwieldy. PostgreSQL's declarative partitioning can be a powerful tool.

For example, you could partition the documents table by a tenant_id for multi-tenant applications or by a date range (created_at) for time-series data.

sql

-- Example of partitioning by tenant_id
CREATE TABLE documents (
    -- columns as before
    tenant_id INT NOT NULL
) PARTITION BY LIST (tenant_id);

CREATE TABLE documents_tenant_1 PARTITION OF documents FOR VALUES IN (1);
CREATE TABLE documents_tenant_2 PARTITION OF documents FOR VALUES IN (2);
-- etc.

When a query includes a WHERE tenant_id = X clause, PostgreSQL's query planner will perform partition pruning, only scanning the relevant partition and its indexes. This dramatically reduces the search space and improves query performance at scale.

Conclusion

By moving beyond pure vector search and implementing a sophisticated hybrid retrieval system, we can build RAG applications that are significantly more robust and accurate. This approach, which marries the semantic understanding of dense vectors with the lexical precision of BM25, directly addresses the shortcomings of vector-only systems.

Implementing this entire stack within PostgreSQL offers compelling operational advantages, creating a unified, transaction-safe data store for complex AI applications. The techniques explored here—combining HNSW and GIN indexes, simulating BM25 with advanced SQL, and fusing results with RRF—are not just theoretical. They represent a production-ready blueprint for senior engineers to build next-generation search and retrieval systems that are powerful, scalable, and resilient to the nuances of real-world user queries.

The Semantic Gap in Production RAG Systems

1. Architectural Foundation: A Unified Schema in PostgreSQL

Architectural Decisions Explained:

2. Implementing the Retrieval Components

Component A: Dense Retrieval with `pgvector`

Component B: Sparse Retrieval with BM25 Simulation in SQL

3. The Fusion Layer: Reciprocal Rank Fusion (RRF)

4. Advanced Considerations and Production Patterns

Edge Case: Score Normalization and Weighted Fusion

Asynchronous Indexing Pipeline

Scaling with Table Partitioning

Conclusion

Found this article helpful?