Real-time Hybrid Search in Postgres with pgvector and BM25
The Unimodal Search Dilemma: Why Production Systems Need More
In modern search applications, particularly Retrieval-Augmented Generation (RAG) systems, relying on a single search modality is a significant architectural liability. Senior engineers recognize this trade-off immediately:
A production-grade search system must serve both query types flawlessly. The solution is hybrid search, but implementing it often involves complex, distributed architectures with separate vector databases (e.g., Pinecone, Weaviate) and text search engines (e.g., Elasticsearch, OpenSearch). This introduces network latency, data synchronization challenges, and operational overhead.
This article details a robust, performant, and operationally simple alternative: implementing a sophisticated hybrid search system entirely within PostgreSQL by leveraging the pgvector extension and PostgreSQL's powerful programmability.
We will build a system that:
- Performs parallel vector and full-text searches in a single query.
ts_rank.- Fuses the results using Reciprocal Rank Fusion (RRF), a superior rank-based merging strategy.
INSERT/UPDATE operations with appropriate indexing strategies.System Architecture and Schema Design
Our goal is to store document text, its corresponding embedding vector, and its pre-processed tsvector for full-text search within a single table. This co-location is key to transactional consistency and query performance.
Prerequisites
Ensure your PostgreSQL instance (v12+) has the pgvector extension enabled:
CREATE EXTENSION IF NOT EXISTS vector;
Table Schema
Let's define a documents table. We'll assume a 768-dimensional embedding model (like sentence-transformers/all-mpnet-base-v2).
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
content_embedding VECTOR(768),
content_tsv TSVECTOR,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
Automated `tsvector` Generation
To ensure the content_tsv column is always synchronized with content, we'll use a trigger. This is a standard production pattern to offload this logic from the application layer.
CREATE OR REPLACE FUNCTION update_documents_tsv() RETURNS TRIGGER AS $$
BEGIN
NEW.content_tsv := to_tsvector('english', NEW.content);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tsvector_update_trigger
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION update_documents_tsv();
Critical Indexing Strategy
This is where performance is made or broken. We need two distinct indexes for our two search modalities.
m and ef_construction parameters are critical tuning knobs. - m: Max number of connections per layer. Higher m improves recall but increases index size and build time.
- ef_construction: Size of the dynamic candidate list during index construction. Higher values lead to a better quality index at the cost of build time.
tsvector queries. GIN (Generalized Inverted Index) is ideal for indexing composite values where elements appear many times within the table (i.e., words).-- HNSW Index for Vector Search
-- A good starting point: m=16, ef_construction=64
CREATE INDEX ON documents USING hnsw (content_embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
-- GIN Index for Full-Text Search
CREATE INDEX ON documents USING gin (content_tsv);
With this schema, we are prepared to handle both types of queries efficiently. An INSERT will now automatically generate the tsvector and the data will be ready for inclusion in both indexes.
Implementing BM25 Ranking in PL/pgSQL
PostgreSQL's built-in ranking functions, ts_rank and ts_rank_cd, are based on TF-IDF. While decent, BM25 is widely considered the state-of-the-art for sparse retrieval ranking. It offers better handling of document length normalization.
The BM25 formula is:
$$ ext{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})} $$
Where:
IDF(q_i): Inverse document frequency of query term q_i.f(q_i, D): Term frequency of q_i in document D.|D|: Length of document D.avgdl: Average document length in the corpus.k1, b: Hyperparameters (typically k1 is 1.2-2.0, b is 0.75).Implementing this directly in SQL is cumbersome. A PL/pgSQL function is the correct approach. It allows us to pre-calculate corpus-wide statistics like avgdl and total document count for the IDF calculation.
Here is a production-ready PL/pgSQL function for BM25 ranking.
CREATE OR REPLACE FUNCTION bm25(tsq tsquery, tsv tsvector, k1 float4 := 1.2, b float4 := 0.75)
RETURNS float4 AS $$
DECLARE
-- Pre-calculated corpus statistics. In a real system, you would periodically
-- update these values and store them in a separate table or as configuration parameters.
total_docs INT := 1000000; -- Replace with `SELECT count(*) FROM documents;`
avg_doc_len FLOAT4 := 350.5; -- Replace with `SELECT avg(length(content)) FROM documents;`
-- Function variables
lexemes TEXT[];
lexeme TEXT;
term_freq FLOAT4;
doc_len INT;
idf FLOAT4;
score FLOAT4 := 0.0;
docs_with_term INT;
BEGIN
-- Extract query lexemes
lexemes := string_to_array(regexp_replace(tsq::TEXT, '[()|&!]', ' ', 'g'), ' ');
-- Get the length of the current document's text content from the tsvector
-- Note: This is an approximation. For perfect accuracy, you'd need the raw text length.
doc_len := array_length(string_to_array(tsv::TEXT, ' '), 1);
IF doc_len = 0 THEN
RETURN 0.0;
END IF;
FOREACH lexeme IN ARRAY lexemes
LOOP
-- Skip operators and empty strings
IF lexeme = '' OR lexeme IN ('and', 'or', 'not') THEN
CONTINUE;
END IF;
-- Get term frequency from tsvector statistics
-- The ts_stat function is powerful but can be slow. A materialized view is often better.
-- For this example, we'll simulate it. In a real query, you'd join to a stats table.
-- Let's approximate f(q_i, D)
term_freq := array_length(tsvector_to_array(tsv), 1) - array_length(tsvector_to_array(ts_delete(tsv, lexeme)), 1);
IF term_freq > 0 THEN
-- Calculate IDF. We need the number of documents containing the term.
-- THIS IS THE SLOW PART. In production, create a statistics table/materialized view
-- for document frequencies of each lexeme.
-- For demonstration, we'll use a placeholder value.
EXECUTE 'SELECT count(*) FROM documents WHERE content_tsv @@ to_tsquery(''' || lexeme || ''')' INTO docs_with_term;
IF docs_with_term = 0 THEN
idf := 0;
ELSE
idf := ln((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1.0);
END IF;
-- Calculate BM25 component for this term
score := score + idf * (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * (doc_len / avg_doc_len)));
END IF;
END LOOP;
RETURN score;
END;
$$ LANGUAGE plpgsql STABLE;
Performance Caveat: The EXECUTE statement inside the loop to get docs_with_term is prohibitively slow for real-time queries. In a production environment, you must pre-calculate document frequencies. Create a materialized view or a table that stores each lexeme and its document count, and refresh it periodically. The function would then query this fast lookup table instead of scanning the main index.
The Hybrid Query: Parallel Search and RRF Fusion
Now we combine everything into a single, powerful SQL query. We will use Common Table Expressions (CTEs) to execute the vector and FTS searches independently before fusing their results.
Fusion Strategy: Reciprocal Rank Fusion (RRF)
Why not just normalize scores (e.g., min-max) and add them up? Because scores from different systems (vector distance vs. BM25 score) are not directly comparable. Vector distances are about geometric proximity in a high-dimensional space, while BM25 is a probabilistic relevance score. Combining them is like adding apples and oranges.
Reciprocal Rank Fusion (RRF) solves this elegantly. It's a rank-based method that disregards the absolute scores and focuses only on the position of a document in each result list. The formula for a document d is:
$$ \text{RRF_score}(d) = \sum_{r \in \text{rankings}} \frac{1}{k + \text{rank}_r(d)} $$
rank_r(d) is the rank of document d in result set r.k is a constant to mitigate the effect of high ranks (i.e., reduces the impact of results at rank 1 vs rank 2). A common value for k is 60.The Complete Query
Here is the full query structure. We'll search for "advanced database optimization techniques".
-- Query parameters
WITH params AS (
SELECT
-- Generate the query embedding in your application and pass it as a parameter
'[...]'::vector(768) AS query_embedding,
'advanced & database & optimization & techniques'::tsquery AS query_tsquery,
60 AS rrf_k, -- RRF constant
100 AS search_limit -- How many results to fetch from each search method
),
-- 1. Vector Search (Dense Retrieval)
-- Uses the HNSW index for fast ANN search.
vector_search AS (
SELECT
id,
-- The rank is crucial for RRF.
ROW_NUMBER() OVER (ORDER BY content_embedding <=> (SELECT query_embedding FROM params)) AS rank
FROM documents
ORDER BY content_embedding <=> (SELECT query_embedding FROM params)
LIMIT (SELECT search_limit FROM params)
),
-- 2. Full-Text Search (Sparse Retrieval)
-- Uses the GIN index and our custom BM25 function.
fts_search AS (
SELECT
id,
-- The rank is based on our custom BM25 score.
ROW_NUMBER() OVER (ORDER BY bm25((SELECT query_tsquery FROM params), content_tsv) DESC) AS rank
FROM documents
WHERE content_tsv @@ (SELECT query_tsquery FROM params)
ORDER BY bm25((SELECT query_tsquery FROM params), content_tsv) DESC
LIMIT (SELECT search_limit FROM params)
)
-- 3. Reciprocal Rank Fusion
-- Combine the results, calculate RRF score, and return the final ranked list.
SELECT
COALESCE(vs.id, fs.id) AS id,
-- Calculate the RRF score.
-- If a document is in one list but not the other, its rank in the missing list is effectively infinity,
-- so the 1 / (k + rank) term becomes 0, which is exactly what we want.
(1.0 / ((SELECT rrf_k FROM params) + COALESCE(vs.rank, 0))) +
(1.0 / ((SELECT rrf_k FROM params) + COALESCE(fs.rank, 0))) AS rrf_score,
-- Also return the individual ranks for debugging/analysis
vs.rank AS vector_rank,
fs.rank AS fts_rank
FROM vector_search vs
FULL OUTER JOIN fts_search ON vs.id = fs.id
ORDER BY rrf_score DESC
LIMIT 20; -- Final limit for the user
Query Analysis (`EXPLAIN ANALYZE`)
Running EXPLAIN ANALYZE on this query is crucial. You should expect to see:
vector_search CTE: An Index Scan using your HNSW index on documents. This should be very fast.fts_search CTE: A Bitmap Heap Scan using the GIN index on documents. The WHERE content_tsv @@ ... clause should be filtered by the index.JOIN: A Hash Full Outer Join is likely, which is efficient for combining the two relatively small, pre-filtered result sets (search_limit of 100 each).If you see a Sequential Scan on the documents table in either of the first two CTEs, your indexes are not being used, and performance will be abysmal. This could be due to outdated table statistics (ANALYZE documents;), incorrect query formulation, or an issue with the index itself.
Production Considerations and Edge Cases
Deploying this system requires attention to detail beyond the query itself.
Real-time Updates and Index Bloat
pgvector's HNSW implementation is append-only. UPDATEs and DELETEs mark old rows as dead but don't immediately reclaim space. Frequent VACUUM operations are essential to prevent index bloat and maintain performance. For write-heavy workloads, you may need to schedule periodic REINDEX operations during maintenance windows, though this requires an exclusive lock.fastupdate mechanism that defers index insertion into a pending list. This speeds up writes, but can slow down reads if the pending list grows too large. Autovacuum is usually effective at managing this, but for extremely high-throughput systems, you might need to tune its settings (autovacuum_work_mem, gin_pending_list_limit) to be more aggressive.Tuning `ef_search`
The HNSW index has a query-time parameter, hnsw.ef_search, that controls the size of the search graph to explore. A higher value increases accuracy (recall) at the cost of latency. This can be set per-transaction:
BEGIN;
SET LOCAL hnsw.ef_search = 100; -- Higher value for better recall
-- ... run your hybrid search query here ...
COMMIT;
This allows you to offer different performance/accuracy tiers in your application. A background job might use a high ef_search, while a user-facing API uses a lower, faster value.
Handling Zero-Result Scenarios
The FULL OUTER JOIN gracefully handles cases where one of the search methods returns no results. If fts_search is empty, the fs.id and fs.rank columns will be NULL, and the RRF calculation for that component will correctly contribute zero to the final score. The results will be ranked purely based on the vector_search output, which is the desired behavior.
Tuning the RRF `k` Constant
The k value in RRF (rrf_k in our query) controls how much to penalize lower-ranked items. A smaller k gives more weight to the top few results. The default of 60 is a good starting point, but you should tune this based on empirical evaluation of your specific dataset and query patterns. If you find that your keyword search is extremely precise and should be trusted more, you could use a weighted RRF, though this adds complexity.
Cold Starts and Cache Warming
PostgreSQL relies heavily on the filesystem cache. The first time a query is run after a server restart, it will be significantly slower as it reads index and table data from disk. For latency-sensitive applications, consider implementing a cache warming procedure that runs a few representative queries upon startup to load key index pages into memory.
Conclusion
By combining pgvector, custom PL/pgSQL functions for advanced ranking like BM25, and a well-structured query using CTEs and RRF, it is entirely feasible to build a world-class, real-time hybrid search engine within PostgreSQL. This single-database approach dramatically simplifies the operational stack, ensures transactional consistency between text and vector data, and eliminates network latency inherent in distributed microservice architectures.
While this pattern requires a deep understanding of PostgreSQL's indexing, query planner, and extensibility, the payoff is a highly performant, scalable, and maintainable search system that can serve as the backbone for sophisticated AI-powered applications.