Hybrid Search in Postgres: Combining pgvector HNSW with BM25 FTS
The Relevance Problem: Semantic vs. Lexical Search
In modern search applications, the chasm between user intent and query results is a constant battle. We have two powerful, yet fundamentally flawed, paradigms at our disposal:
XG-500-PRO
might get lost in the semantic noise, returning conceptually similar but incorrect items.XG-500-PRO
with perfect precision. Its weakness is the inverse of semantic search; it has no understanding that "summer clothing" and "sun dresses" are related concepts.For senior engineers, the goal is clear: we need a system that leverages the strengths of both. This is the domain of hybrid search. While many solutions involve complex, multi-system architectures (e.g., PostgreSQL for metadata, Elasticsearch for FTS, and a dedicated vector database like Pinecone or Weaviate), this post demonstrates a powerful, single-database solution using modern PostgreSQL capabilities.
We will implement a hybrid search system that combines pgvector
's high-performance HNSW index for semantic search with a robust, custom-built BM25 ranking function for lexical search. The final, crucial step will be to fuse these two result sets using Reciprocal Rank Fusion (RRF) to produce a unified, superior ranking.
This is not a beginner's guide. We assume you understand what vector embeddings are, the basics of SQL, and the fundamental concepts of full-text search.
Our Technical Stack
* Database: PostgreSQL 15+ (for MERGE command and other improvements)
* Vector Extension: pgvector
0.5.0+ (for HNSW index support)
* Language for Data Prep: Python 3.9+ with psycopg2
, sentence-transformers
, numpy
.
1. Schema Design and Data Population
Our foundation is a well-designed table that accommodates both lexical and semantic data. Let's model a product catalog for an e-commerce site.
-- Ensure the pgvector extension is installed
CREATE EXTENSION IF NOT EXISTS vector;
-- A table to store global statistics for our custom BM25 implementation
CREATE TABLE IF NOT EXISTS product_search_stats (
stat_key TEXT PRIMARY KEY,
stat_value NUMERIC NOT NULL
);
-- Our main products table
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
sku TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
description TEXT,
-- For pgvector semantic search
embedding VECTOR(384) NOT NULL,
-- For FTS lexical search
fts_document TSVECTOR
);
-- Create a trigger to update the tsvector automatically
CREATE OR REPLACE FUNCTION update_fts_document() RETURNS TRIGGER AS $$
BEGIN
NEW.fts_document :=
setweight(to_tsvector('english', coalesce(NEW.name, '')), 'A') ||
setweight(to_tsvector('english', coalesce(NEW.description, '')), 'B');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER products_fts_update_trigger
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION update_fts_document();
Key Design Decisions:
embedding VECTOR(384)
: We're using the all-MiniLM-L6-v2
model from sentence-transformers
, which produces 384-dimension vectors. The dimension must* match your embedding model.
* fts_document TSVECTOR
: Instead of creating a tsvector
on the fly during queries, we pre-materialize it in a dedicated column. This is a critical performance optimization.
* Trigger-based tsvector
Update: A trigger ensures the fts_document
is always synchronized with the name
and description
fields. We use setweight
to give matches in the product name
(Weight 'A') higher importance than matches in the description
(Weight 'B').
* product_search_stats
: This small table will hold pre-calculated statistics like the average document length, which is required for our BM25 implementation. This avoids costly table scans during search queries.
Populating with Embeddings
Next, we'll write a Python script to populate the products
table. This involves generating embeddings for our product data.
# populate_products.py
import psycopg2
import numpy as np
from sentence_transformers import SentenceTransformer
def get_db_connection():
return psycopg2.connect(
dbname="your_db",
user="your_user",
password="your_password",
host="localhost"
)
def main():
products_data = [
{'sku': 'TS-001', 'name': 'Quantum Weave T-Shirt', 'description': 'A comfortable cotton t-shirt with a futuristic design.'},
{'sku': 'JN-001', 'name': 'Starlight Denim Jeans', 'description': 'Durable and stylish jeans, perfect for any occasion.'},
{'sku': 'DR-001', 'name': 'Summer Breeze Sundress', 'description': 'A light and airy dress for warm sunny days.'},
{'sku': 'SH-001', 'name': 'Linen Holiday Shorts', 'description': 'Breathable shorts ideal for a summer vacation.'},
{'sku': 'XG-500', 'name': 'Pro-Grade Graphics Card', 'description': 'High-performance GPU for gaming and professional workloads.'},
{'sku': 'XG-500-PRO', 'name': 'Pro-Grade Graphics Card - Pro Edition', 'description': 'The ultimate GPU for 4K gaming and demanding creative tasks.'},
]
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Generating embeddings...")
texts_to_embed = [f"{p['name']}: {p['description']}" for p in products_data]
embeddings = model.encode(texts_to_embed, show_progress_bar=True)
conn = get_db_connection()
try:
with conn.cursor() as cur:
print("Inserting data into database...")
for i, p in enumerate(products_data):
embedding_list = embeddings[i].tolist()
cur.execute(
"""INSERT INTO products (sku, name, description, embedding)
VALUES (%s, %s, %s, %s)
ON CONFLICT (sku) DO NOTHING""",
(p['sku'], p['name'], p['description'], embedding_list)
)
print("Updating search statistics...")
cur.execute("""
INSERT INTO product_search_stats (stat_key, stat_value)
SELECT 'total_docs', COUNT(*) FROM products
ON CONFLICT (stat_key) DO UPDATE SET stat_value = EXCLUDED.stat_value;
""")
cur.execute("""
INSERT INTO product_search_stats (stat_key, stat_value)
SELECT 'avg_doc_length', AVG(pg_column_size(fts_document)) FROM products
ON CONFLICT (stat_key) DO UPDATE SET stat_value = EXCLUDED.stat_value;
""")
conn.commit()
print("Data population complete.")
finally:
conn.close()
if __name__ == '__main__':
main()
After running this script, our products
table will be populated, and the fts_document
for each row will be automatically generated by our trigger.
2. Component 1: Semantic Search with `pgvector` HNSW
The pgvector
extension provides several index types. For production workloads where query latency is critical, HNSW (Hierarchical Navigable Small World) is the superior choice over IVFFlat. While HNSW has a longer build time and higher memory usage, its query performance is significantly better, especially on large datasets.
-- Create the HNSW index on our embedding column
-- The cosine operator is recommended for sentence-transformer models
CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops);
Tuning HNSW Index Creation:
For larger datasets, you can tune the index creation parameters:
CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
* m
: The maximum number of connections per layer (default 16). Higher m
improves recall but increases index size and build time.
* ef_construction
: The size of the dynamic candidate list for constructing the graph (default 64). Higher values lead to a better-quality graph (and better recall) at the cost of a longer build time.
Executing a Semantic Search Query:
A pure semantic search query uses a vector similarity operator. For all-MiniLM-L6-v2
, cosine similarity is the standard. pgvector
provides the <=>
operator for this.
Let's find products semantically similar to "clothes for warm weather".
# search_semantic.py
import psycopg2
import numpy as np
from sentence_transformers import SentenceTransformer
def search(query_text, limit=5):
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query_text).tolist()
conn = psycopg2.connect(...) # connection details
try:
with conn.cursor() as cur:
cur.execute(
"""SELECT id, name, description, 1 - (embedding <=> %s) AS similarity
FROM products
ORDER BY embedding <=> %s
LIMIT %s""",
(query_embedding, query_embedding, limit)
)
results = cur.fetchall()
for row in results:
print(f"ID: {row[0]}, Name: {row[1]}, Similarity: {row[3]:.4f}")
finally:
conn.close()
if __name__ == '__main__':
search("clothes for warm weather")
Expected Output:
ID: 3, Name: Summer Breeze Sundress, Similarity: 0.7581
ID: 4, Name: Linen Holiday Shorts, Similarity: 0.7345
ID: 1, Name: Quantum Weave T-Shirt, Similarity: 0.5123
...
This works perfectly for the conceptual query. But if we search for XG-500-PRO
, the results will be poor because the semantic meaning is weak.
Query Performance Analysis (EXPLAIN ANALYZE
):
EXPLAIN ANALYZE SELECT id FROM products ORDER BY embedding <=> '[...]' LIMIT 5;
Limit (cost=...) (actual time=0.852..0.853 rows=5 loops=1)
-> Index Scan using products_embedding_idx on products ...
Order By: (embedding <=> '[...]')
Planning Time: ...
Execution Time: 0.912 ms
The key is Index Scan
. This confirms PostgreSQL is using our HNSW index, resulting in sub-millisecond query times even on millions of rows, rather than performing a slow, full table scan.
3. Component 2: Lexical Search with a Custom BM25 Function
PostgreSQL's built-in FTS provides ranking functions like ts_rank
and ts_rank_cd
. These are based on TF-IDF and are decent, but BM25 is the de-facto standard for modern lexical search, offering better relevance tuning.
Unfortunately, PostgreSQL does not have a native BM25 implementation. We must build our own using PL/pgSQL
.
The BM25 formula is:
$$ ext{score}(q, d) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} $$
Where:
* IDF(q_i)
: Inverse Document Frequency of term q_i
.
* f(q_i, d)
: Term frequency of q_i
in document d
.
* |d|
: Length of document d
.
* avgdl
: Average document length across the entire collection.
* k1
and b
: Hyperparameters (typically k1
is 1.2-2.0, b
is 0.75).
We can implement this in PL/pgSQL.
CREATE OR REPLACE FUNCTION bm25(
query_tsquery TSQUERY,
document_tsvector TSVECTOR,
k1 REAL DEFAULT 1.2,
b REAL DEFAULT 0.75
) RETURNS REAL AS $$
DECLARE
query_lexemes TEXT[];
lexeme TEXT;
-- These should be pre-calculated and stored for performance
total_docs NUMERIC;
avg_doc_length NUMERIC;
-- Document specific
doc_length INT;
term_freq INT;
-- Calculation variables
idf REAL;
score REAL := 0.0;
BEGIN
-- Retrieve pre-calculated stats
SELECT stat_value INTO total_docs FROM product_search_stats WHERE stat_key = 'total_docs';
SELECT stat_value INTO avg_doc_length FROM product_search_stats WHERE stat_key = 'avg_doc_length';
IF total_docs IS NULL OR avg_doc_length IS NULL THEN
-- Fallback or raise an error if stats are not available
RETURN 0.0;
END IF;
-- Extract individual lexemes from the query
query_lexemes := string_to_array(regexp_replace(tsquery_to_string(query_tsquery), '[()|&!]', '', 'g'), ' ');
-- Get the length of the current document's tsvector
doc_length := array_length(string_to_array(tsvector_to_string(document_tsvector), ' '), 1);
-- Loop through each term in the query
FOREACH lexeme IN ARRAY query_lexemes
LOOP
-- Skip if lexeme is empty
CONTINUE WHEN lexeme = '';
-- Calculate IDF for the current lexeme
-- ts_stat provides the nentry (number of documents the term appears in)
DECLARE
n_docs_with_term INT;
BEGIN
SELECT nentry INTO n_docs_with_term
FROM ts_stat('SELECT fts_document FROM products')
WHERE word = lexeme;
IF n_docs_with_term IS NULL OR n_docs_with_term = 0 THEN
idf := 0;
ELSE
idf := ln((total_docs - n_docs_with_term + 0.5) / (n_docs_with_term + 0.5) + 1.0);
END IF;
END;
-- Calculate term frequency in the current document
-- A trick is to convert tsvector to string and count occurrences
term_freq := array_length(ts_lexize('english', lexeme), 1) > 0 AND
array_length(string_to_array(tsvector_to_string(document_tsvector), lexeme), 1) - 1;
-- Calculate the BM25 term score
IF term_freq > 0 THEN
score := score + idf * (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length)));
END IF;
END LOOP;
RETURN score;
END;
$$ LANGUAGE plpgsql STABLE;
Production Caveat: The ts_stat
call inside the function is a performance bottleneck as it scans the index/table. In a high-throughput production environment, you would pre-calculate IDF scores for your entire vocabulary and store them in a lookup table, which this function would query instead.
Now, let's create the necessary GIN index for FTS.
CREATE INDEX ON products USING GIN (fts_document);
Executing a Lexical Search Query:
Let's search for the specific SKU XG-500-PRO
.
-- The plainto_tsquery function is good for user-provided search terms
SELECT
id,
name,
bm25(plainto_tsquery('english', 'XG-500-PRO'), fts_document) AS score
FROM products
WHERE fts_document @@ plainto_tsquery('english', 'XG-500-PRO')
ORDER BY score DESC
LIMIT 5;
Expected Output:
id | name | score
----+------------------------------------------+------------------
6 | Pro-Grade Graphics Card - Pro Edition | 2.02732563018799
5 | Pro-Grade Graphics Card | 1.01366281509399
...
This correctly and precisely identifies the product with the exact SKU, ranking it highest. This is something pure semantic search would struggle with.
4. Fusing the Results: Reciprocal Rank Fusion (RRF)
We now have two ordered lists of results, each with its own scoring system that is not directly comparable. How do we merge them?
A naive approach might be to normalize scores (e.g., min-max scaling) and add them. This is fragile and highly sensitive to the distribution of scores in each system. A much more robust and parameter-free method is Reciprocal Rank Fusion (RRF).
The RRF formula is simple. For each document, its final score is the sum of the reciprocal of its ranks in each result set:
$$ RRF_{score}(d) = \sum_{k \in \text{result_sets}} \frac{1}{c + \text{rank}_k(d)} $$
* rank_k(d)
is the rank of document d
in result set k
.
* c
is a constant to de-emphasize lower-ranked results (a common value is 60).
We can implement this elegantly in SQL using Common Table Expressions (CTEs).
The Complete Hybrid Search Query:
Here is the full query that performs vector search, FTS search, and fuses them with RRF.
WITH vector_results AS (
-- Get top 100 results from semantic search
SELECT
id,
-- Rank is needed for RRF
ROW_NUMBER() OVER (ORDER BY embedding <=> '[...your_query_vector...]' ASC) as rank
FROM products
ORDER BY embedding <=> '[...your_query_vector...]'
LIMIT 100
),
fts_results AS (
-- Get top 100 results from lexical search
SELECT
id,
-- Rank is needed for RRF
ROW_NUMBER() OVER (ORDER BY bm25(plainto_tsquery('english', 'your search query'), fts_document) DESC) as rank
FROM products
WHERE fts_document @@ plainto_tsquery('english', 'your search query')
ORDER BY bm25(plainto_tsquery('english', 'your search query'), fts_document) DESC
LIMIT 100
)
SELECT
p.id,
p.name,
p.description,
-- Calculate the RRF score. The COALESCE handles documents present in only one result set.
COALESCE(1.0 / (60 + vr.rank), 0.0) + COALESCE(1.0 / (60 + fr.rank), 0.0) AS rrf_score
FROM products p
LEFT JOIN vector_results vr ON p.id = vr.id
LEFT JOIN fts_results fr ON p.id = fr.id
-- We only want to see results that appeared in at least one of the searches
WHERE vr.id IS NOT NULL OR fr.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 10;
Let's test this with a query that has both semantic and lexical components: stylish summer clothes
.
* Vector Search will find Summer Breeze Sundress
and Linen Holiday Shorts
.
* FTS will also find these because of the word summer
.
If we search for pro-grade gpu
, a query where both systems can contribute:
* Vector Search will find both graphics cards due to semantic similarity to gpu
.
* FTS will strongly prefer the results with the exact words pro-grade
.
RRF will combine these signals, pushing the Pro Edition
to the top because it gets a high rank from both systems, demonstrating the power of the hybrid approach.
5. Performance Tuning and Production Considerations
This query is complex. To make it production-ready, we must optimize it.
HNSW Query-Time Tuning: `ef_search`
The most important parameter for HNSW query performance is ef_search
. It controls the size of the candidate list during the search. A higher ef_search
increases recall (accuracy) at the cost of latency.
-- Can be set per-transaction
BEGIN;
SET LOCAL hnsw.ef_search = 100;
-- Your hybrid query here...
COMMIT;
You must benchmark this value for your specific dataset and latency requirements. For a 1 million vector dataset, you might test values from 40 to 200.
Example Benchmark:
ef_search | p95 Latency (ms) | Recall@10 |
---|---|---|
40 | 5.2 | 0.96 |
100 | 11.8 | 0.985 |
200 | 25.1 | 0.991 |
This table (illustrative) shows the classic trade-off. A value of 100 might be a good balance.
CTE Result Set Size
In our query, we used LIMIT 100
in both CTEs. This is a critical tuning parameter.
* Too small (e.g., LIMIT 10
): You might miss out on documents that rank low in one search but high in the other, which could have a high final RRF score.
* Too large (e.g., LIMIT 1000
): You increase the amount of work the database has to do for ranking and joining, increasing latency.
A value between 50 and 200 is typically a good starting point. This depends heavily on how "spiky" your relevance scores are.
Managing BM25 Statistics
As noted, our BM25 function has a performance issue with ts_stat
. The solution is to pre-calculate and cache these statistics.
CREATE TABLE term_idf_stats (
term TEXT PRIMARY KEY,
idf REAL NOT NULL
);
Populate this table with a periodic script:
-- This can be a slow query, run it offline.
INSERT INTO term_idf_stats (term, idf)
SELECT
word,
ln(((SELECT stat_value FROM product_search_stats WHERE stat_key = 'total_docs') - nentry + 0.5) / (nentry + 0.5) + 1.0)
FROM ts_stat('SELECT fts_document FROM products')
ON CONFLICT (term) DO UPDATE SET idf = EXCLUDED.idf;
Then, modify the bm25
function to read from this term_idf_stats
table instead of calling ts_stat
in a loop. This will provide a massive performance boost.
Edge Case: Updating Embeddings and Re-indexing
Updating a row's embedding is costly because it requires updating the HNSW index. HNSW is not designed for frequent updates.
* Strategy: For systems with frequent updates, consider partitioning your table (e.g., by date). New data goes into a new partition with its own index. Queries can UNION ALL
across recent, active partitions. Older, static partitions remain untouched. Alternatively, batch updates and perform a full REINDEX
during off-peak hours.
Conclusion
We have successfully designed and implemented a sophisticated hybrid search system entirely within PostgreSQL. By combining the semantic power of pgvector
with the lexical precision of a custom BM25 FTS implementation, and fusing the results with Reciprocal Rank Fusion, we've built a system that provides far superior relevance than either method alone.
This approach offers significant architectural simplification by avoiding the need for multiple disparate data stores. While it requires a deep understanding of SQL, indexing, and the underlying search algorithms, the result is a powerful, maintainable, and highly performant search backend. For senior engineers looking to push the boundaries of in-database capabilities, this pattern represents a compelling and production-ready solution.