Production RAG: Hybrid Search with pgvector, BM25, and RRF
The Achilles' Heel of Pure Vector Search in Production RAG
As senior engineers building Retrieval-Augmented Generation (RAG) systems, we've all witnessed the magic of semantic search. A user asks about "cloud billing anomalies," and our system, powered by pgvector and a high-quality embedding model, correctly surfaces documents discussing "unforeseen GCP expenditures." This is a powerful capability. However, we've also seen its critical failure mode in production: a query for a specific error code like GKE-1128-B or a product SKU XJ-481-Z returns vaguely related concepts instead of the one document that contains the exact identifier.
This is the fundamental limitation of dense retrieval. It excels at understanding semantic meaning but is inherently lossy regarding lexical precision. For production systems where factual accuracy is paramount, this isn't just an inconvenience; it's a critical flaw that erodes user trust. The solution isn't to abandon vector search, but to augment it.
This article details an advanced, production-ready pattern for implementing hybrid search directly within PostgreSQL. We will combine dense vector retrieval (pgvector) with sparse, keyword-based retrieval (PostgreSQL's Full-Text Search, a powerful proxy for BM25) and fuse the results using a sophisticated, parameter-free ranking algorithm: Reciprocal Rank Fusion (RRF). This architecture provides the best of both worlds—semantic understanding and lexical precision—within a single, operationally simple database, eliminating the complexity of managing separate vector and keyword search indices (like Elasticsearch).
Architectural Blueprint: Dense + Sparse Retrieval in a Single Schema
Our goal is to execute two distinct types of searches in parallel and intelligently merge their results.
pgvector. It finds documents that are conceptually similar to the query, even if they don't share keywords.To support this, we need a carefully designed table schema and corresponding indices.
Database Schema for Hybrid Search
Let's define a documents table that stores our content, its embedding, and a pre-calculated tsvector for full-text search.
-- Ensure the vector extension is enabled
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL, -- Example: OpenAI ada-002 embedding size
content_tsv TSVECTOR,
created_at TIMESTAMPTZ DEFAULT NOW()
);
The key element here is the content_tsv column of type TSVECTOR. This column will store a processed version of the content field, optimized for text search—it's stripped of stop words, and words are reduced to their root form (stemming). Storing this pre-calculated vector is far more performant than computing it at query time.
Automating `tsvector` Generation with a Trigger
Manually keeping content_tsv in sync with content is error-prone. A database trigger is the canonical solution, ensuring atomicity and data integrity. We'll create a function that updates the content_tsv column whenever a document's content is inserted or updated.
-- Create a function to update the tsvector column
CREATE OR REPLACE FUNCTION update_content_tsv()
RETURNS TRIGGER AS $$
BEGIN
-- 'english' is a predefined text search configuration.
-- It handles stemming and stop words for the English language.
NEW.content_tsv := to_tsvector('english', NEW.content);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Create a trigger that calls the function before insert or update
CREATE TRIGGER documents_tsv_update
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW
EXECUTE FUNCTION update_content_tsv();
Now, any INSERT or UPDATE operation on the documents table will automatically and performantly populate the content_tsv column. This is a robust, set-and-forget pattern for production systems.
High-Performance Indexing Strategy
Without proper indexing, our hybrid search query will result in sequential scans, which is unacceptable for any non-trivial dataset. We need two distinct types of indices.
-- HNSW index for fast ANN search on embeddings
-- The cosine operator (<=>) is used for distance calculation
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Production Considerations for HNSW:
* m: The maximum number of connections per layer. Higher values increase accuracy (recall) but also increase index size and build time. 16 is a solid default.
* ef_construction: The size of the dynamic candidate list during index construction. A higher value leads to a better quality index and higher accuracy, at the cost of longer build times. 64 is a reasonable starting point.
tsvector column, a GIN (Generalized Inverted Index) is the ideal choice. It's highly optimized for queries that check for the existence of terms within the tsvector. -- GIN index for fast full-text search
CREATE INDEX ON documents USING gin (content_tsv);
Why GIN over GiST? While both can index tsvector, GIN is generally faster for lookups because it stores a mapping from each lexeme (word) to the documents containing it. GiST uses a more complex tree structure that can be slower for the common text search queries we'll be running.
With this schema and indexing strategy, our database is primed for high-performance hybrid queries.
The Fusion Problem: Merging Incomparable Scores
We can now get two ranked lists of documents:
pgvector, ranked by cosine distance (a value from 0 to 2).ts_rank_cd (an arbitrary positive float, typically between 0 and 1, but can be higher).These scores are on completely different, un-normalized scales. A simple weighted average (0.5 vector_score + 0.5 fts_score) is a naive approach that rarely works well. The relative magnitudes of the scores are unpredictable, and finding the right weights is a dark art that changes with the dataset and query patterns.
The Solution: Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) is an elegant, powerful, and—best of all—parameter-free algorithm for combining ranked lists. It disregards the raw scores entirely and uses only the rank of each item in the result lists.
The formula is simple: for each document, its RRF score is the sum of the reciprocal of its rank in each list.
RRF_Score(doc) = Σ (1 / (k + rank_i(doc)))
* rank_i(doc) is the rank of the document in result list i.
* k is a constant used to mitigate the impact of high ranks (i.e., documents ranked 1st or 2nd) having disproportionately large scores. A k value of 60 is a widely used and effective default.
Why RRF is superior:
* No Normalization Needed: It completely sidesteps the problem of incomparable scores.
* Robustness: It rewards documents that appear in multiple result lists, regardless of their raw scores.
* Simplicity: It requires no tuning or domain-specific weighting, making it a general-purpose and resilient choice.
The Master Query: Implementing Hybrid Search with RRF in SQL
We will now construct a single, powerful SQL query that performs both searches, ranks the results, and computes the RRF score. We'll use Common Table Expressions (CTEs) to keep the logic clean and readable.
Let's assume our input is a user query string and its corresponding embedding vector.
-- Placeholders:
-- :query_embedding -> The 1536-dimensional vector for the user's query
-- :query_text -> The user's query string, formatted for to_tsquery, e.g., 'cloud & billing & anomalies'
-- :k_rrf -> The RRF constant, e.g., 60
-- :top_k_vector -> How many results to fetch from vector search, e.g., 50
-- :top_k_keyword -> How many results to fetch from keyword search, e.g., 50
-- :final_limit -> The final number of results to return, e.g., 10
WITH vector_search AS (
-- 1. Perform dense vector search
SELECT
id,
-- Rank is calculated based on the cosine distance ordering
ROW_NUMBER() OVER (ORDER BY embedding <=> :query_embedding) AS rank
FROM documents
ORDER BY embedding <=> :query_embedding
LIMIT :top_k_vector
), keyword_search AS (
-- 2. Perform sparse keyword search
SELECT
id,
-- Rank is calculated based on the FTS score
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', :query_text)) DESC) AS rank
FROM documents
WHERE content_tsv @@ to_tsquery('english', :query_text)
ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', :query_text)) DESC
LIMIT :top_k_keyword
), combined_results AS (
-- 3. Combine results from both searches
SELECT id, rank FROM vector_search
UNION ALL
SELECT id, rank FROM keyword_search
)
-- 4. Calculate RRF score and aggregate
SELECT
cr.id,
SUM(1.0 / (:k_rrf + cr.rank)) AS rrf_score,
d.content -- Also fetch the content for the final result
FROM combined_results cr
JOIN documents d ON cr.id = d.id
GROUP BY cr.id, d.content
ORDER BY rrf_score DESC
LIMIT :final_limit;
Deconstructing the Query
vector_search CTE: This performs the ANN search using the <=> (cosine distance) operator. We use ROW_NUMBER() to assign a rank to each result based on its distance. Critically, we LIMIT the number of results to a reasonable number (:top_k_vector, e.g., 50) to keep the query fast.keyword_search CTE: This performs the FTS search. The @@ operator matches the tsvector against a tsquery. ts_rank_cd calculates the relevance score, which we use for ordering. Again, we use ROW_NUMBER() to assign a rank and LIMIT the results.combined_results CTE: We UNION ALL the two result sets. A document that appears in both searches will now have two rows in this CTE, one with its vector rank and one with its keyword rank.id, and for each document, we calculate its final rrf_score by summing the 1 / (k + rank) values. The results are then ordered by this new, unified score to produce our final, relevance-fused ranking.This single query is the heart of our hybrid search engine. It's declarative, efficient, and leverages the full power of PostgreSQL's query planner.
Production Implementation: Python Application Logic
Let's wrap this SQL logic in a Python function using the asyncpg library for high-performance, asynchronous database access.
import asyncio
import asyncpg
import numpy as np
from typing import List, Dict, Any
# Assume you have an embedding function
# from some_embedding_service import get_embedding
# Placeholder for a real embedding function
def get_embedding(text: str) -> List[float]:
# In a real application, this would call an API like OpenAI, Cohere, or a local model.
# Using a random 1536-dim vector for demonstration.
return np.random.rand(1536).tolist()
def format_tsquery(query: str) -> str:
"""Formats a user query string into a tsquery-compatible string.
Example: 'cloud billing anomalies' -> 'cloud & billing & anomalies'
"""
return " & ".join(query.strip().split())
class HybridSearcher:
def __init__(self, dsn: str):
self.dsn = dsn
self.pool = None
async def connect(self):
self.pool = await asyncpg.create_pool(self.dsn)
async def close(self):
if self.pool:
await self.pool.close()
async def search(self,
query_text: str,
top_k_vector: int = 50,
top_k_keyword: int = 50,
final_limit: int = 10,
k_rrf: int = 60) -> List[Dict[str, Any]]:
if not self.pool:
raise ConnectionError("Database connection pool is not initialized. Call connect() first.")
query_embedding = get_embedding(query_text)
ts_query_text = format_tsquery(query_text)
sql_query = """
WITH vector_search AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM documents
ORDER BY embedding <=> $1
LIMIT $2
), keyword_search AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', $3)) DESC) AS rank
FROM documents
WHERE content_tsv @@ to_tsquery('english', $3)
ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', $3)) DESC
LIMIT $4
), combined_results AS (
SELECT id, rank FROM vector_search
UNION ALL
SELECT id, rank FROM keyword_search
)
SELECT
cr.id,
SUM(1.0 / ($5 + cr.rank)) AS rrf_score,
d.content
FROM combined_results cr
JOIN documents d ON cr.id = d.id
GROUP BY cr.id, d.content
ORDER BY rrf_score DESC
LIMIT $6;
"""
async with self.pool.acquire() as connection:
results = await connection.fetch(
sql_query,
query_embedding,
top_k_vector,
ts_query_text,
top_k_keyword,
k_rrf,
final_limit
)
return [dict(row) for row in results]
# Example Usage
async def main():
DATABASE_URL = "postgresql://user:password@host:port/dbname"
searcher = HybridSearcher(dsn=DATABASE_URL)
await searcher.connect()
try:
# Scenario 1: Semantic search
results_semantic = await searcher.search("unexpected charges on my cloud bill")
print("--- Semantic Search Results ---")
for res in results_semantic:
print(f"ID: {res['id']}, Score: {res['rrf_score']:.4f}, Content: {res['content'][:100]}...")
print("\n")
# Scenario 2: Keyword-heavy search
results_keyword = await searcher.search("GKE-1128-B autoscaler error")
print("--- Keyword Search Results ---")
for res in results_keyword:
print(f"ID: {res['id']}, Score: {res['rrf_score']:.4f}, Content: {res['content'][:100]}...")
finally:
await searcher.close()
if __name__ == "__main__":
asyncio.run(main())
This Python code provides a clean, reusable class for interacting with our backend. It correctly parameterizes the query to prevent SQL injection and handles the logic of getting embeddings and formatting the text for tsquery.
Performance Analysis and Edge Case Handling
A complex query like this requires scrutiny to ensure it performs under load.
Analyzing the Query Plan
Running EXPLAIN ANALYZE on the query is non-negotiable. Here’s what you should look for in the query plan:
-> Limit (cost=...)
-> Sort (cost=... sort_key: (sum(...)))
-> HashAggregate (cost=...)
-> Append (cost=...)
-> Subquery Scan on vector_search (cost=...)
-> Limit (cost=...)
-> Index Scan using documents_embedding_idx on documents (cost=...)
-> Subquery Scan on keyword_search (cost=...)
-> Limit (cost=...)
-> Sort (cost=... sort_key: ...)
-> Bitmap Heap Scan on documents (cost=...)
-> Bitmap Index Scan using documents_content_tsv_idx (cost=...)
Key things to verify:
documents_embedding_idx: The plan for vector_search must show an HNSW index scan. If it shows a sequential scan, your index is not being used, and performance will be abysmal.documents_content_tsv_idx: The plan for keyword_search should leverage the GIN index. A Bitmap scan is a common and highly efficient pattern for GIN indices.LIMIT clauses should be pushed down effectively, meaning the database should only retrieve 50 rows (or whatever your limit is) from each index scan, not the entire table.The `:top_k` Tuning Dilemma
The values for :top_k_vector and :top_k_keyword are critical tuning parameters.
* Too small: You risk missing relevant documents that might have been ranked higher after RRF fusion. For example, a document ranked #51 in vector search and #1 in keyword search would be missed entirely if top_k_vector is 50.
* Too large: You increase the amount of data being processed in the final aggregation step, which can slow down the query.
A good starting point is between 50 and 100. The optimal value depends on your data distribution and can be found by running benchmarks with a representative query set.
Edge Case: Advanced Text Search Configuration
The default 'english' text search configuration is a good start, but for specialized domains (e.g., medical, legal, or technical documentation), it can be insufficient. It might incorrectly stem technical terms or remove important stop words (like 'a' in 'Vitamin A').
You can create custom configurations to handle these cases.
-- Example: Create a new configuration that uses a simple dictionary (no stemming or stop words)
CREATE TEXT SEARCH CONFIGURATION public.technical (
PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION public.technical
ADD MAPPING FOR asciiword WITH simple;
-- Then, use it in your trigger and queries:
-- to_tsvector('public.technical', NEW.content)
-- to_tsquery('public.technical', :query_text)
This gives you fine-grained control over how your text is tokenized, which is crucial for maximizing the effectiveness of the keyword search component.
Scaling Beyond a Single Node
This single-instance PostgreSQL pattern is remarkably scalable and can comfortably handle millions to tens of millions of documents on appropriate hardware (especially with high RAM to cache indices). When you outgrow a single vertical node:
* Read Replicas: The most straightforward scaling path for read-heavy workloads. Direct search queries to a pool of read replicas.
* Sharding (e.g., Citus): For truly massive datasets, you can use extensions like Citus to shard your documents table across multiple nodes. Both pgvector and FTS are compatible with Citus, but this introduces significant operational complexity.
* Dedicated Systems: At a certain scale, it may become prudent to migrate the search component to a dedicated system like Vespa or a combination of Elasticsearch and a dedicated vector database. However, the pattern described here will carry you an incredibly long way, offering an unmatched combination of power and operational simplicity for a vast range of production RAG applications.
Conclusion
By moving beyond naive vector-only retrieval, we have architected a sophisticated, production-ready hybrid search system. This approach directly addresses the lexical precision failures of pure semantic search by integrating the strengths of keyword-based FTS. By using Reciprocal Rank Fusion, we fuse the results in a principled, parameter-free manner, achieving a final ranking superior to either method alone.
Implementing this entire pipeline within PostgreSQL—combining pgvector, FTS, and advanced SQL—delivers a powerful, consolidated, and maintainable retrieval engine. It's a testament to the extensibility of modern relational databases and a pattern that enables senior engineers to build truly robust and trustworthy RAG systems.