PostgreSQL Hybrid Search: Combining pgvector & FTS for Advanced RAG
The Retrieval Problem in Production RAG
In building sophisticated Retrieval-Augmented Generation (RAG) systems, the quality of the retrieval step is paramount. A common starting point is pure vector similarity search, which excels at finding semantically related documents. However, senior engineers quickly discover its Achilles' heel: an inability to reliably match specific keywords, acronyms, or product codes. A query for "PostgreSQL pg_stat_statements" might surface documents about general database monitoring but miss the exact technical document because the semantic meaning is close but the keywords are not dominant in the vector space.
Conversely, traditional full-text search (FTS), like PostgreSQL's native implementation, is a master of keyword matching but is semantically naive. It cannot understand that "database performance tuning" and "optimizing SQL queries" are related concepts.
This dichotomy leads to a suboptimal retrieval context for the LLM, resulting in less accurate and relevant generated responses. The solution is not to choose one over the other, but to combine their strengths. This article provides a production-focused blueprint for implementing a high-performance hybrid search system directly within PostgreSQL, leveraging the pgvector extension and native FTS. We will dispense with the basics and dive directly into schema design, advanced indexing, query execution patterns, and a robust result fusion strategy.
Schema and Data Modeling for Duality
Our foundation is a table designed to efficiently serve both FTS and vector search queries. The key is to pre-process and store the data in a query-optimized format.
-- Ensure the vector extension is enabled
CREATE EXTENSION IF NOT EXISTS vector;
-- Main table for our documents
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB,
-- Assuming OpenAI's text-embedding-ada-002 model with 1536 dimensions
embedding VECTOR(1536),
-- Pre-calculated tsvector for FTS performance
fts_document TSVECTOR,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create a GIN index on the tsvector for fast FTS
CREATE INDEX idx_documents_fts ON documents USING GIN(fts_document);
The Critical Role of the Pre-computed `tsvector`
A common mistake is to call to_tsvector() within the WHERE clause of a query. This forces PostgreSQL to convert the content text to a tsvector on-the-fly for every row being considered, which is disastrous for performance as it prevents an index from being used effectively.
The production pattern is to pre-compute the tsvector and store it in its own column. We enforce this with a trigger that automatically updates the fts_document column whenever a document's content changes.
-- Function to update the fts_document column
CREATE OR REPLACE FUNCTION update_fts_document() RETURNS TRIGGER AS $$
BEGIN
-- Using 'english' configuration, but this can be customized
NEW.fts_document := to_tsvector('english', NEW.content);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Trigger to call the function before any insert or update
CREATE TRIGGER trg_update_fts_document
BEFORE INSERT OR UPDATE OF content ON documents
FOR EACH ROW
EXECUTE FUNCTION update_fts_document();
This setup ensures that the expensive to_tsvector operation happens only once at write time, making our read queries for FTS significantly faster by allowing them to directly use the idx_documents_fts GIN index.
Advanced Indexing: A Tale of Two Search Types
With our schema in place, the most critical performance consideration is indexing. We need to optimize for two fundamentally different query patterns.
Vector Indexing: IVFFlat vs. HNSW
pgvector offers two primary index types for Approximate Nearest Neighbor (ANN) search: IVFFlat and HNSW. Choosing the right one is crucial.
1. IVFFlat (Inverted File with Flat Compression)
IVFFlat works by partitioning the vector space into a number of lists (clusters). At query time, it identifies the closest lists (probes) to the query vector and then performs an exhaustive search only within those lists.
lists (the number of clusters to partition the data into).- Pros: Faster to build than HNSW. Uses less memory during index construction.
- Cons: The speed-recall trade-off is often less favorable than HNSW. Performance is highly dependent on tuning the probes parameter at query time.
Implementation:
-- Example for a dataset of 1 million vectors
-- A common starting point for 'lists' is N / 1000 for smaller datasets,
-- or sqrt(N) for larger ones. Let's use sqrt(1,000,000) = 1000.
CREATE INDEX idx_documents_embedding_ivfflat ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
At query time, you must set the ivfflat.probes parameter. A higher value increases accuracy (recall) at the cost of latency.
-- Set for the current session before querying
SET LOCAL ivfflat.probes = 10;
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
When to consider IVFFlat: You have a truly massive dataset (tens or hundreds of millions of vectors) where HNSW build times are prohibitive, or you are operating in a memory-constrained environment during indexing.
2. HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layered graph of vectors where links are established between neighbors. Searching involves traversing this graph from a coarse layer to a fine layer, making it extremely efficient.
m (max number of connections per node) and ef_construction (size of the dynamic candidate list during build).- Pros: Generally provides a much better speed-recall trade-off. Query performance is excellent.
- Cons: Slower to build and more memory-intensive during index construction.
Implementation:
-- m: 16-32 is a good range. Higher m = better recall, higher index size.
-- ef_construction: 64-128 is a good range. Higher = better recall, slower build.
CREATE INDEX idx_documents_embedding_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);
At query time, you tune hnsw.ef_search, which is the size of the dynamic candidate list during the search. This is the primary knob for balancing speed and recall.
-- Set for the current session before querying
SET LOCAL hnsw.ef_search = 64;
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10;
Verdict for most RAG use cases: HNSW is the superior choice. The improved recall-to-latency ratio directly translates to better retrieval quality. The longer build time is a one-time cost that is well worth the query-time performance gains.
The Fusion Strategy: Reciprocal Rank Fusion (RRF)
Now that we can efficiently execute both FTS and vector searches, we need a robust method to combine their results. A naive approach like UNION in SQL is problematic because the scores (ts_rank for FTS, vector distance for similarity) are not directly comparable. FTS scores relate to keyword frequency and density, while vector distances relate to position in a high-dimensional space.
The industry-standard solution is Reciprocal Rank Fusion (RRF). It's an elegant, score-agnostic algorithm that prioritizes documents that rank highly in multiple result sets.
The formula is:
RRF_Score(doc) = Σ (1 / (k + rank(doc)))
Where rank(doc) is the 1-based rank of the document in a given result list, and k is a constant to reduce the impact of documents with very low ranks. A common value for k is 60.
Crucially, RRF is performed in the application layer, not in the database. This allows each database query to be highly optimized for its specific task (FTS or vector search) using its dedicated index.
Production Implementation in Python
Here is a complete, production-ready Python implementation using asyncpg for concurrent query execution and a clear RRF implementation.
import asyncio
import asyncpg
from typing import List, Dict, Tuple
# Assume this function exists to get embeddings for a query
# from your embedding_service import get_embedding
async def get_embedding(text: str) -> List[float]:
# In a real app, this would call an embedding model (e.g., OpenAI, Cohere)
# For this example, we'll return a dummy vector.
import numpy as np
return np.random.rand(1536).tolist()
class HybridSearchService:
def __init__(self, pool: asyncpg.Pool):
self.pool = pool
# RRF constant, a common choice
self.RRF_K = 60
async def _execute_fts_query(self, query_text: str, limit: int) -> List[Tuple[str, float]]:
"""Executes a full-text search query and returns (id, rank)."""
# websearch_to_tsquery is more robust for user input than plainto_tsquery
sql = """
SELECT id, ts_rank(fts_document, websearch_to_tsquery('english', $1)) AS rank
FROM documents
WHERE fts_document @@ websearch_to_tsquery('english', $1)
ORDER BY rank DESC
LIMIT $2;
"""
async with self.pool.acquire() as conn:
return await conn.fetch(sql, query_text, limit)
async def _execute_vector_query(self, query_embedding: List[float], limit: int, ef_search: int) -> List[Tuple[str, float]]:
"""Executes a vector similarity search and returns (id, distance)."""
sql = """
SELECT id, embedding <=> $1 AS distance
FROM documents
ORDER BY distance
LIMIT $2;
"""
async with self.pool.acquire() as conn:
# Set HNSW ef_search for the current transaction for better recall
await conn.execute(f"SET LOCAL hnsw.ef_search = {ef_search}")
return await conn.fetch(sql, query_embedding, limit)
def _fuse_results(
self,
fts_results: List[Tuple[str, float]],
vector_results: List[Tuple[str, float]]
) -> Dict[str, float]:
"""Combines results using Reciprocal Rank Fusion."""
fused_scores = {}
# Process FTS results
for i, (doc_id, _) in enumerate(fts_results):
rank = i + 1
if doc_id not in fused_scores:
fused_scores[doc_id] = 0.0
fused_scores[doc_id] += 1.0 / (self.RRF_K + rank)
# Process Vector results
for i, (doc_id, _) in enumerate(vector_results):
rank = i + 1
if doc_id not in fused_scores:
fused_scores[doc_id] = 0.0
fused_scores[doc_id] += 1.0 / (self.RRF_K + rank)
return fused_scores
async def search(self, query_text: str, top_n: int = 10) -> List[Dict]:
"""
Performs a full hybrid search: embedding, parallel queries, RRF fusion,
and finally fetching the full document content.
"""
if not query_text:
return []
# We fetch more results than needed (e.g., 5x) from each source
# to give the fusion algorithm more data to work with.
fetch_limit = top_n * 5
hnsw_ef_search = 128 # Higher value for better recall
# 1. Get query embedding
query_embedding = await get_embedding(query_text)
# 2. Run FTS and vector queries concurrently
fts_task = self._execute_fts_query(query_text, fetch_limit)
vector_task = self._execute_vector_query(query_embedding, fetch_limit, hnsw_ef_search)
fts_results, vector_results = await asyncio.gather(fts_task, vector_task)
# Handle edge case where one or both searches return no results
if not fts_results and not vector_results:
return []
# 3. Fuse the results using RRF
fused_scores = self._fuse_results(fts_results, vector_results)
# 4. Sort by fused score and get the top N document IDs
sorted_docs = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
top_doc_ids = [doc_id for doc_id, score in sorted_docs[:top_n]]
if not top_doc_ids:
return []
# 5. Fetch the full document content for the top IDs
# Use array parameter for efficient fetching
sql_fetch_docs = """
SELECT id, content, metadata FROM documents WHERE id = ANY($1::UUID[]);
"""
async with self.pool.acquire() as conn:
final_docs_raw = await conn.fetch(sql_fetch_docs, top_doc_ids)
# Preserve the RRF order in the final result
doc_map = {str(doc['id']): doc for doc in final_docs_raw}
final_results = [doc_map[str(doc_id)] for doc_id in top_doc_ids if str(doc_id) in doc_map]
return final_results
# Example Usage:
async def main():
db_pool = await asyncpg.create_pool(user='user', password='password', database='db', host='127.0.0.1')
search_service = HybridSearchService(db_pool)
query = "how to optimize postgresql query performance"
results = await search_service.search(query, top_n=5)
for result in results:
print(f"ID: {result['id']}\nContent: {result['content'][:100]}...\n")
await db_pool.close()
if __name__ == "__main__":
asyncio.run(main())
This implementation demonstrates several production-ready patterns:
asyncio.gather executes the FTS and vector queries in parallel, minimizing I/O wait time.asyncpg's connection pool is used correctly.WHERE id = ANY(...) which is far more efficient than fetching documents one by one.Performance Tuning and Query Analysis
Writing the code is only half the battle. In production, you must verify your queries are performant and using the correct indexes. The tool for this is EXPLAIN ANALYZE.
Analyzing the Vector Search Query
EXPLAIN ANALYZE SET LOCAL hnsw.ef_search = 64;
SELECT id, embedding <=> '[...]' AS distance
FROM documents
ORDER BY distance
LIMIT 10;
What to look for in the output:
Limit (cost=...) (actual time=15.36..)
-> Index Scan using idx_documents_embedding_hnsw on documents (cost=...) (actual time=15.34..)
Order By: (embedding <=> '[...]')
Planning Time: ...
Execution Time: 15.8 ms
The key line is Index Scan using idx_documents_embedding_hnsw. This confirms the HNSW index is being used. If you see a Sequential Scan, your index is not being used, and performance will be abysmal. The execution time should be in the low double-digit milliseconds for a well-tuned system.
Tuning hnsw.ef_search: This is your primary lever. A higher ef_search increases the search breadth within the graph, improving recall at the cost of latency. To find the optimal value, you need a validation set of queries with known relevant documents. Benchmark recall vs. p99 latency for different ef_search values (e.g., 32, 64, 128, 256) and choose the value that meets your application's requirements.
Analyzing the FTS Query
EXPLAIN ANALYZE SELECT id, ts_rank(fts_document, websearch_to_tsquery('english', '...')) AS rank
FROM documents
WHERE fts_document @@ websearch_to_tsquery('english', '...')
ORDER BY rank DESC
LIMIT 50;
What to look for in the output:
Limit (cost=...) (actual time=8.45..)
-> Sort (cost=...) (actual time=8.42..)
Sort Key: (ts_rank(..))
-> Bitmap Heap Scan on documents (cost=...) (actual time=2.15..)
Recheck Cond: (fts_document @@ ..)
-> Bitmap Index Scan on idx_documents_fts (cost=...) (actual time=1.5..)
Index Cond: (fts_document @@ ..)
Planning Time: ...
Execution Time: 8.9 ms
Here, the critical lines are Bitmap Index Scan on idx_documents_fts followed by a Bitmap Heap Scan. This is the canonical plan for an efficient GIN index lookup. The database first uses the index to create a bitmap of matching pages in memory, then visits only those pages to fetch the rows.
Tuning work_mem: The Sort operation in the FTS query plan is sensitive to work_mem. If the result set being sorted is larger than work_mem, PostgreSQL will spill to disk, causing a massive performance penalty. For applications with heavy FTS usage, monitoring for temporary file writes (log_temp_files parameter) and increasing work_mem at the session or user level can provide significant speedups.
Handling Advanced Edge Cases
tenant_id column to the documents table. All queries must include WHERE tenant_id = $.... To maintain performance, this column must be the first column in your composite indexes. For example: CREATE INDEX ... ON documents (tenant_id, fts_document) and CREATE INDEX ... ON documents USING hnsw (embedding vector_cosine_ops) WHERE tenant_id IS NOT NULL; (Note: HNSW in pgvector doesn't directly support multi-column indexes, so filtering is done post-index scan, but partitioning or other architectural patterns can mitigate this at scale).Conclusion
By moving beyond simplistic single-mode retrieval, you can build a significantly more powerful and accurate RAG system. This hybrid search pattern, implemented entirely within PostgreSQL, provides a robust, scalable, and operationally simple solution. It avoids the complexity and cost of maintaining a separate search index like Elasticsearch alongside your primary database. The key takeaways for a production-grade implementation are:
tsvector representations alongside your vector embeddings.EXPLAIN ANALYZE to verify index usage and tune parameters like hnsw.ef_search and work_mem to meet your specific latency and recall targets.This approach combines the semantic richness of vector search with the keyword precision of FTS, delivering a retrieval mechanism that provides a far more relevant and comprehensive context to your language models.