Hybrid Search with pgvector & BM25 for Advanced RAG Systems
The Semantic Gap in Production RAG Systems
Retrieval-Augmented Generation (RAG) has become a cornerstone of modern LLM applications. The standard implementation, however, relies almost exclusively on dense vector retrieval using embeddings. While powerful for capturing semantic similarity, this approach exhibits a critical weakness in production environments: a surprising fragility to keyword-specific queries. Acronyms, product SKUs, specific function names, or unique identifiers are often poorly represented in the dense vector space, leading to a frustrating semantic gap where the model understands the concept but misses the specifics.
For senior engineers tasked with building robust, reliable AI systems, this is an unacceptable limitation. The solution is not to abandon vector search, but to augment it. This is where hybrid search—the fusion of dense (semantic) and sparse (lexical) retrieval—becomes essential. This article bypasses introductory concepts and dives directly into a production-grade implementation of a sophisticated hybrid search system entirely within PostgreSQL, leveraging the pgvector extension for dense search and advanced SQL techniques to simulate a BM25-like sparse search.
We will cover:
1. Architectural Foundation: A Unified Schema in PostgreSQL
Consolidating data into a single system is a significant architectural advantage. It simplifies operations, reduces data synchronization issues, and enables powerful transactional queries that combine different data types. Here's a schema designed for a multi-modal hybrid search system.
-- Ensure required extensions are enabled
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS btree_gin;
-- Main table for documents
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source_id TEXT NOT NULL UNIQUE, -- An external identifier
document_type TEXT NOT NULL DEFAULT 'text', -- e.g., 'text', 'image', 'pdf_page'
content TEXT, -- Full text content, can be NULL for non-textual data
metadata JSONB DEFAULT '{}'::jsonb, -- Structured data, e.g., {'image_url': '...', 'alt_text': '...'}
-- Dense Vector Representation (for pgvector)
embedding VECTOR(768), -- Dimension depends on your model, e.g., 768 for MiniLM-L6-v2
-- Sparse Vector Representation (for BM25)
-- Storing term frequencies as a JSONB object: { 'term': count, ... }
term_frequencies JSONB,
-- Full-Text Search Vector (for PostgreSQL's native FTS)
content_tsv TSVECTOR,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Indexing Strategy
-- 1. HNSW Index for Approximate Nearest Neighbor (ANN) search on embeddings
-- This is the workhorse for semantic search. ef_construction and m are build-time params.
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
-- 2. GIN Index for Full-Text Search on the tsvector
-- This powers the keyword-based part of our BM25 simulation.
CREATE INDEX ON documents USING gin (content_tsv);
-- 3. GIN Index for term frequencies in our sparse vector
-- Crucial for efficiently finding documents containing specific query terms.
CREATE INDEX ON documents USING gin (term_frequencies jsonb_path_ops);
-- 4. B-Tree index on metadata for filtering (example)
CREATE INDEX ON documents USING gin ((metadata->'tags'));
Architectural Decisions Explained:
* embedding VECTOR(768): This column stores the dense vector generated by a sentence-transformer model. We use vector_cosine_ops for the HNSW index, as cosine similarity is standard for normalized embeddings from these models.
* term_frequencies JSONB: This is the core of our BM25 simulation. Instead of storing a full sparse vector, we pre-calculate and store the term frequencies (TF) for each document. The format {'term': count, ...} is highly indexable with a GIN index.
* content_tsv TSVECTOR: This is a pre-processed representation of the content for PostgreSQL's native Full-Text Search (FTS). It handles stemming and stop-word removal. We'll use this to calculate Inverse Document Frequency (IDF) and for initial document filtering.
* HNSW vs. IVFFlat: We choose HNSW (Hierarchical Navigable Small World) over IVFFlat for our vector index. While IVFFlat can be faster for very high-recall requirements, HNSW generally provides a better balance of speed and accuracy, especially on modern hardware, and doesn't require ivf.probes tuning at query time. The ef_construction and m parameters are tuned during index creation to balance build time and graph quality.
To populate the content_tsv and term_frequencies columns automatically, we'll use a trigger.
-- Function to calculate term frequencies from text
CREATE OR REPLACE FUNCTION calculate_term_frequencies(text_content TEXT)
RETURNS JSONB AS $$
DECLARE
lexemes TEXT[];
lexeme TEXT;
tf_map JSONB := '{}'::jsonb;
BEGIN
IF text_content IS NULL THEN
RETURN '{}'::jsonb;
END IF;
lexemes := array(SELECT unnest(string_to_array(regexp_replace(lower(text_content), '[^a-z0-9\s]+', '', 'g'), ' ')));
FOREACH lexeme IN ARRAY lexemes
LOOP
IF lexeme <> '' THEN
tf_map := jsonb_set(
tf_map,
ARRAY[lexeme],
to_jsonb(COALESCE((tf_map->>lexeme)::INT, 0) + 1),
true
);
END IF;
END LOOP;
RETURN tf_map;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- Trigger to auto-populate derived columns
CREATE OR REPLACE FUNCTION update_document_derived_columns()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR NEW.content <> OLD.content THEN
NEW.content_tsv := to_tsvector('english', NEW.content);
NEW.term_frequencies := calculate_term_frequencies(NEW.content);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_update_document_derived_columns
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW
EXECUTE FUNCTION update_document_derived_columns();
This setup ensures our data is always ready for both dense and sparse retrieval queries upon insertion or update.
2. Implementing the Retrieval Components
With the architecture in place, we can now build the Python and SQL logic for each retrieval component.
Component A: Dense Retrieval with `pgvector`
This is the straightforward part. We generate a query embedding and use the <=> operator to find the k-nearest neighbors.
import psycopg2
from sentence_transformers import SentenceTransformer
# It's highly recommended to cache the model loading
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_dense_results(query_text: str, conn, limit: int = 10):
"""Performs dense vector search in PostgreSQL."""
embedding = model.encode(query_text).tolist()
with conn.cursor() as cur:
cur.execute(
"""SELECT id, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY embedding <=> %s
LIMIT %s;""",
(str(embedding), str(embedding), limit)
)
results = cur.fetchall()
return results
# Example Usage
# conn = psycopg2.connect(DSN)
# dense_res = get_dense_results("challenges in distributed systems", conn)
# print(dense_res)
Performance Consideration: The query planner will use the HNSW index effectively. However, performance is sensitive to the hnsw.ef_search parameter, which can be set per-transaction. A higher ef_search increases accuracy (recall) at the cost of latency.
-- For higher accuracy queries
BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT id, 1 - (embedding <=> '[...embedding...]') AS similarity
FROM documents ORDER BY embedding <=> '[...embedding...]' LIMIT 10;
COMMIT;
This dynamic tuning is crucial for applications that may have different latency/accuracy requirements for different types of queries.
Component B: Sparse Retrieval with BM25 Simulation in SQL
This is where the complexity lies. The Okapi BM25 ranking function is:
$$ ext{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$
Where:
- $q_i$ is the i-th query term.
- $f(q_i, D)$ is the term frequency of $q_i$ in document $D$.
- $|D|$ is the length of document $D$.
- $avgdl$ is the average document length in the collection.
- $k_1$ and $b$ are hyperparameters (typically $k_1 \in [1.2, 2.0]$ and $b=0.75$).
We need to compute this within a single SQL query. It's a multi-step process.
Step 1: Create a SQL function for the BM25 score calculation.
This function will encapsulate the complex formula, making our final query cleaner.
CREATE OR REPLACE FUNCTION bm25_score(
term_freq INTEGER, -- f(q_i, D)
doc_length INTEGER, -- |D|
avg_doc_length FLOAT, -- avgdl
doc_count_with_term BIGINT, -- N(q_i)
total_doc_count BIGINT, -- N
k1 FLOAT = 1.5,
b FLOAT = 0.75
)
RETURNS FLOAT AS $$
DECLARE
idf FLOAT;
BEGIN
-- Calculate Inverse Document Frequency (IDF) using a common variant
idf := log((total_doc_count - doc_count_with_term + 0.5) / (doc_count_with_term + 0.5) + 1.0);
RETURN idf * (term_freq * (k1 + 1)) / (term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length)));
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Step 2: The Master Query for Sparse Retrieval.
This query is advanced. It needs to tokenize the user query, fetch statistics (like avgdl and total_doc_count), and then join back to the documents table to calculate the score for each document that contains at least one query term.
import psycopg2
import re
def get_sparse_results(query_text: str, conn, limit: int = 10):
"""Performs BM25-like sparse search in PostgreSQL."""
# Simple whitespace and punctuation tokenizer
query_terms = list(set(re.sub(r'[^a-z0-9\s]+', '', query_text.lower()).split()))
if not query_terms:
return []
sql_query = """
WITH query_terms AS (
SELECT unnest(%(query_terms)s::TEXT[]) as term
),
collection_stats AS (
-- Pre-calculate average document length and total document count
SELECT
avg(length(content)) as avgdl,
count(1) as total_docs
FROM documents
),
doc_term_stats AS (
-- For each query term, find how many documents contain it
SELECT
qt.term,
count(d.id) as doc_count_with_term
FROM query_terms qt
JOIN documents d ON d.term_frequencies ? qt.term
GROUP BY qt.term
)
SELECT
d.id,
-- Sum the BM25 score for each query term present in the document
SUM(
bm25_score(
(d.term_frequencies->>qts.term)::INT, -- f(q_i, D)
length(d.content), -- |D|
cs.avgdl, -- avgdl
dts.doc_count_with_term, -- N(q_i)
cs.total_docs, -- N
1.5, -- k1
0.75 -- b
)
) as score
FROM documents d
-- Join with a filtered list of terms relevant to this document
JOIN (
SELECT term FROM query_terms
) qts ON d.term_frequencies ? qts.term
CROSS JOIN collection_stats cs
JOIN doc_term_stats dts ON qts.term = dts.term
-- CRITICAL: Filter to only documents that have at least one query term.
-- The GIN index on term_frequencies makes this fast.
WHERE d.term_frequencies ?| (SELECT array_agg(term) FROM query_terms)
GROUP BY d.id, cs.avgdl, cs.total_docs
ORDER BY score DESC
LIMIT %(limit)s;
"""
with conn.cursor() as cur:
cur.execute(sql_query, {'query_terms': query_terms, 'limit': limit})
results = cur.fetchall()
return results
# Example Usage
# conn = psycopg2.connect(DSN)
# sparse_res = get_sparse_results("PostgreSQL HNSW index tuning", conn)
# print(sparse_res)
Query Breakdown and Performance:
* CTEs (WITH clauses): We use Common Table Expressions to structure the query logically. collection_stats and doc_term_stats compute global values once.
* ?| operator: The WHERE d.term_frequencies ?| array['term1', 'term2'] clause is the performance linchpin. It translates to "does the term_frequencies JSONB contain any key from this array?". This operation is heavily accelerated by the GIN index on term_frequencies, allowing PostgreSQL to rapidly identify a candidate set of documents before performing expensive scoring calculations.
* EXPLAIN ANALYZE: Running EXPLAIN ANALYZE on this query would show an initial fast Bitmap Heap Scan using the GIN index, followed by joins and aggregations on that much smaller subset of documents. Without the GIN index, this query would perform a full table scan and be prohibitively slow.
3. The Fusion Layer: Reciprocal Rank Fusion (RRF)
Now we have two ranked lists of document IDs: one from dense search, one from sparse. We cannot simply add their scores, as they are on different scales and represent different concepts (semantic similarity vs. lexical relevance). Reciprocal Rank Fusion (RRF) is a simple yet remarkably effective algorithm for combining ranked lists without needing score normalization.
The formula for a document's RRF score is:
$$ ext{RRF_score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)} $$
Where:
- $R$ is the set of result lists (in our case, dense and sparse).
- $ ext{rank}_r(d)$ is the rank of document $d$ in result list $r$. If $d$ is not in the list, its contribution is 0.
60.from collections import defaultdict
def reciprocal_rank_fusion(results_lists, k=60):
"""
Performs Reciprocal Rank Fusion on a list of search result lists.
Args:
results_lists: A list of lists, where each inner list contains (id, score) tuples.
k: The constant used in the RRF formula.
Returns:
A sorted list of (id, score) tuples.
"""
ranked_items = defaultdict(float)
for results in results_lists:
for rank, (doc_id, _) in enumerate(results, 1):
ranked_items[doc_id] += 1 / (k + rank)
# Sort by the fused score in descending order
fused_results = sorted(ranked_items.items(), key=lambda item: item[1], reverse=True)
return fused_results
# --- Putting It All Together ---
def hybrid_search(query_text: str, conn, limit: int = 10):
# In a production system, these two calls should be run concurrently
# e.g., using asyncio or threading, to reduce latency.
dense_results = get_dense_results(query_text, conn, limit=50) # Fetch more to allow for re-ranking
sparse_results = get_sparse_results(query_text, conn, limit=50)
fused = reciprocal_rank_fusion([dense_results, sparse_results])
# You might want to fetch the full document content for the top N results
top_ids = [doc_id for doc_id, score in fused[:limit]]
if not top_ids:
return []
with conn.cursor() as cur:
cur.execute(
"SELECT id, content, metadata FROM documents WHERE id = ANY(%s)",
(top_ids,)
)
# Re-order the fetched docs according to the fused rank
docs_by_id = {doc[0]: doc for doc in cur.fetchall()}
final_results = [(doc_id, score, docs_by_id[doc_id]) for doc_id, score in fused[:limit] if doc_id in docs_by_id]
return final_results
# Example Usage
# conn = psycopg2.connect(DSN)
# final_res = hybrid_search("how to fix CVE-2021-44228 in our java services", conn)
# for doc_id, score, doc_data in final_res:
# print(f"ID: {doc_id}, Score: {score:.4f}, Content: {doc_data[1][:100]}...")
This hybrid approach shines on a query like "how to fix CVE-2021-44228 in our java services". The dense search will find documents about Java vulnerabilities and logging frameworks, while the sparse search will ensure that documents containing the exact identifier "CVE-2021-44228" are ranked extremely highly. RRF will then balance these signals to produce a final ranking that is superior to either method alone.
4. Advanced Considerations and Production Patterns
Edge Case: Score Normalization and Weighted Fusion
While RRF avoids the need for score normalization, sometimes you may want to give more weight to one retriever. A simple modification to RRF is to add a weight multiplier:
$$ ext{Weighted_RRF_score}(d) = \sum_{r \in R} w_r \cdot \frac{1}{k + \text{rank}_r(d)} $$
For example, you could set w_dense = 1.0 and w_sparse = 1.5 to give a slight boost to keyword matches. This requires empirical tuning based on your specific dataset and query patterns.
Asynchronous Indexing Pipeline
The trigger-based approach is simple but can add latency to INSERT operations, especially with the overhead of embedding generation. A robust production architecture uses a message queue (e.g., RabbitMQ, Kafka) for an asynchronous indexing pipeline:
documents table with embedding and term_frequencies as NULL, and publishes the document id to a message queue.id, the worker:* Fetches the document content from the database.
* Generates the dense embedding using a GPU-accelerated model service.
* Calculates term frequencies.
* Updates the corresponding row in the documents table with the new data.
This decouples the write path from the expensive computation, ensuring low-latency ingestion.
Scaling with Table Partitioning
For collections with hundreds of millions or billions of documents, a single table and its indexes can become unwieldy. PostgreSQL's declarative partitioning can be a powerful tool.
For example, you could partition the documents table by a tenant_id for multi-tenant applications or by a date range (created_at) for time-series data.
-- Example of partitioning by tenant_id
CREATE TABLE documents (
-- columns as before
tenant_id INT NOT NULL
) PARTITION BY LIST (tenant_id);
CREATE TABLE documents_tenant_1 PARTITION OF documents FOR VALUES IN (1);
CREATE TABLE documents_tenant_2 PARTITION OF documents FOR VALUES IN (2);
-- etc.
When a query includes a WHERE tenant_id = X clause, PostgreSQL's query planner will perform partition pruning, only scanning the relevant partition and its indexes. This dramatically reduces the search space and improves query performance at scale.
Conclusion
By moving beyond pure vector search and implementing a sophisticated hybrid retrieval system, we can build RAG applications that are significantly more robust and accurate. This approach, which marries the semantic understanding of dense vectors with the lexical precision of BM25, directly addresses the shortcomings of vector-only systems.
Implementing this entire stack within PostgreSQL offers compelling operational advantages, creating a unified, transaction-safe data store for complex AI applications. The techniques explored here—combining HNSW and GIN indexes, simulating BM25 with advanced SQL, and fusing results with RRF—are not just theoretical. They represent a production-ready blueprint for senior engineers to build next-generation search and retrieval systems that are powerful, scalable, and resilient to the nuances of real-world user queries.