Hybrid Search in Postgres: Fusing pgvector and FTS with RRF

October 2, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Gap: Why Pure Vector Search Fails in Production

As engineering teams rush to integrate Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the default approach to information retrieval has become semantic search powered by vector embeddings. While powerful for understanding user intent and context, a pure vector-based approach exhibits a critical flaw in production systems: a surprising inability to handle specific, keyword-driven queries.

Consider a user searching an internal documentation system for a specific function name like calculate_amortization_schedule. A pure semantic search might return documents about "financial planning" or "loan payment calculations" because they are semantically similar. However, it may fail to rank the exact document containing the literal string calculate_amortization_schedule at the top if the surrounding context isn't a perfect semantic match. This is a frustrating and common user experience failure.

Conversely, traditional Full-Text Search (FTS), based on lexical matching algorithms like BM25, excels at these keyword-specific queries but fails to grasp semantic nuance. A search for "how to figure out monthly loan costs" would likely miss the document titled "Amortization Schedule Calculations."

This is the dilemma senior engineers face: choosing between semantic understanding and lexical precision. The solution is not to choose one, but to combine them. This article presents a production-grade pattern for implementing a sophisticated hybrid search engine entirely within PostgreSQL, leveraging the pgvector extension for dense vector retrieval and the battle-tested native Full-Text Search for sparse, lexical retrieval. We will fuse their results using a superior, rank-based merging strategy: Reciprocal Rank Fusion (RRF).

This is not a primer. We assume you understand what vector embeddings are and have a working knowledge of PostgreSQL. We will dive straight into schema design, advanced indexing, query optimization, and the complex SQL required to make this pattern performant at scale.

Architectural Blueprint: The Unified Document Schema

To build our hybrid search system, we must first design a schema that accommodates both sparse and dense representations of our data. Our goal is to co-locate the text, its tsvector representation for FTS, and its embedding vector in a single table. This avoids costly joins during query time.

Here is our foundational documents table:

sql

-- Ensure required extensions are enabled
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS vector;

-- The core table for our hybrid search system
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    tenant_id VARCHAR(255) NOT NULL, -- Crucial for multi-tenant applications
    content TEXT NOT NULL,
    metadata JSONB, -- For storing arbitrary structured data
    embedding VECTOR(768), -- Storing a 768-dim embedding (e.g., from 'all-MiniLM-L6-v2')
    content_tsv TSVECTOR -- The generated tsvector for Full-Text Search
        GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- A trigger to automatically update the 'updated_at' timestamp
CREATE OR REPLACE FUNCTION trigger_set_timestamp()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER set_timestamp
BEFORE UPDATE ON documents
FOR EACH ROW
EXECUTE PROCEDURE trigger_set_timestamp();

Key Design Decisions:

GENERATED ALWAYS AS ... STORED: We use a generated column for content_tsv. This is a modern PostgreSQL feature that is cleaner and less error-prone than using a trigger to maintain the tsvector. The STORED keyword materializes the column, which is essential for indexing. The alternative, VIRTUAL, would compute the value on-the-fly, preventing effective indexing.

VECTOR(768): The dimension of the vector is fixed at table creation. It's critical to match this with the output dimension of your chosen embedding model. Mismatched dimensions will result in insertion errors. We've chosen 768, a common dimension for models in the Sentence-Transformers family.

tenant_id: We include this from the start because most real-world applications are multi-tenant. This column will be a cornerstone of our indexing and query strategy to ensure data isolation and performance.

Component 1: The Sparse Retriever (Full-Text Search)

Our sparse retriever provides the lexical matching capability. PostgreSQL's FTS is incredibly powerful and, crucially, can be indexed for high performance.

Indexing for FTS Performance

A sequential scan over the documents table performing FTS is a recipe for disaster on any non-trivial dataset. We need an index on our content_tsv column. The standard choice for tsvector columns is a Generalized Inverted Index (GIN).

sql

-- Create a GIN index on the tsvector column.
-- For a multi-tenant system, a composite index is non-negotiable.
CREATE INDEX idx_documents_tsv_gin ON documents USING GIN(content_tsv);

-- In a multi-tenant application, the index MUST include tenant_id
CREATE INDEX idx_documents_tenant_tsv_gin ON documents USING GIN(tenant_id, content_tsv);
-- Note: GIN indexes do not support multi-column uniqueness or ordering in the same way B-tree indexes do.
-- The above index is less efficient than a B-tree on tenant_id then GIN on tsv.
-- A better approach is a simple GIN index on tsv and filtering by tenant_id.
-- Let's stick with the single-column GIN for clarity, but be aware of multi-column implications.

Why GIN? A GIN index creates an entry for each unique lexeme (word) and points to all the rows containing that lexeme. This is highly efficient for queries that search for specific terms, as the database can quickly look up the lexemes in the index and retrieve the corresponding row IDs.

The FTS Query with Ranking

We don't just want to find documents; we want to rank them by relevance. ts_rank_cd is a robust function for this, considering both term frequency and proximity.

Here's a standalone FTS query that returns ranked results:

sql

-- Example FTS query for a specific tenant
SELECT 
    id,
    content,
    ts_rank_cd(content_tsv, query) AS fts_score
FROM 
    documents, 
    websearch_to_tsquery('english', 'loan payment calculations') AS query
WHERE 
    tenant_id = 'tenant-123'
    AND content_tsv @@ query
ORDER BY 
    fts_score DESC
LIMIT 20;

This query forms the first half of our hybrid search. It's fast, efficient (thanks to the GIN index), and provides a lexically-ranked list of documents.

Component 2: The Dense Retriever (pgvector)

Now for the semantic component. This involves generating embeddings for your query and using pgvector to find the most similar documents based on cosine distance.

Generating and Storing Embeddings

This happens at the application layer. When a document is created or updated, you must generate its embedding and store it in the embedding column. Here's a conceptual Python example using the sentence-transformers library:

python

from sentence_transformers import SentenceTransformer
import psycopg2
import numpy as np

# Load a pre-trained model
# This should be initialized once and reused.
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text: str) -> np.ndarray:
    """Generates a 768-dimension embedding for a given text."""
    return model.encode(text)

def insert_document(conn, tenant_id: str, content: str):
    """Generates embedding and inserts a new document into the database."""
    embedding = get_embedding(content)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO documents (tenant_id, content, embedding) VALUES (%s, %s, %s)",
            (tenant_id, content, embedding.tolist()) # pgvector accepts a list of floats
        )
    conn.commit()

# --- Usage Example ---
# conn = psycopg2.connect(database="mydb", user="user", password="pass", host="localhost", port="5432")
# insert_document(conn, 'tenant-123', 'This document discusses advanced PostgreSQL indexing strategies.')
# conn.close()

The Indexing Dilemma: IVFFlat vs. HNSW

This is where many teams stumble. A sequential scan for vector similarity is an O(N) operation and is unacceptably slow for more than a few thousand records. pgvector offers two primary index types for Approximate Nearest Neighbor (ANN) search: IVFFlat and HNSW.

1. IVFFlat (Inverted File with Flat Compression)

* How it works: It partitions your vectors into lists (clusters). At query time, it searches only a subset of these lists (probes) closest to your query vector.

* Pros: Fast to build, low memory usage during build.

* Cons: Search performance/recall is highly dependent on the lists and probes parameters. Finding the right balance is a tuning exercise. Recall can be lower than HNSW for the same speed.

sql

-- Creating an IVFFlat index
-- The number of lists should be sqrt(N) for up to 1M rows, and N/1000 for larger datasets.
-- Assuming we have 1M documents:
CREATE INDEX idx_documents_embedding_ivfflat ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);

-- When querying, you must set the 'probes' parameter.
-- Higher probes = better recall, slower query.
SET ivfflat.probes = 10;

2. HNSW (Hierarchical Navigable Small Worlds)

* How it works: It builds a multi-layered graph of vectors where upper layers have longer links and lower layers have shorter links. It navigates this graph to find the nearest neighbors efficiently.

* Pros: Generally provides higher recall for a given performance target compared to IVFFlat. Search performance is excellent.

* Cons: Much slower to build and consumes significantly more memory during index construction. Updates/deletes can be more costly.

sql

-- Creating an HNSW index
-- m: max connections per layer (default 16)
-- ef_construction: size of the dynamic candidate list during build (default 64)
CREATE INDEX idx_documents_embedding_hnsw ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- When querying, you set 'ef_search'.
-- Higher ef_search = better recall, slower query.
SET hnsw.ef_search = 40;

Production Recommendation: For most modern applications where high recall and low latency are critical, HNSW is the superior choice, provided you can afford the longer build times and higher memory usage during indexing. Start with HNSW. Only consider IVFFlat if your build-time constraints are extremely tight or if you have a very high churn of data requiring frequent re-indexing.

The Dense Vector Query

With our HNSW index in place, the vector search query is straightforward. The <=> operator finds the cosine distance (0 = identical, 2 = opposite).

sql

-- Assume :query_embedding is a placeholder for your query vector
-- e.g., '[0.1, 0.2, ..., 0.9]'

SET hnsw.ef_search = 40; -- Tune this for your recall/performance needs

SELECT 
    id,
    content,
    1 - (embedding <=> :query_embedding) AS vector_score -- Convert distance to similarity (0 to 1)
FROM 
    documents
WHERE
    tenant_id = 'tenant-123'
ORDER BY 
    embedding <=> :query_embedding
LIMIT 20;

The Fusion: Implementing Reciprocal Rank Fusion (RRF)

We now have two ranked lists of documents: one from FTS, one from pgvector. How do we merge them? A naive approach might be to normalize the scores and add them up. This is flawed because the scores are not comparable. FTS scores (ts_rank_cd) and vector similarity scores are on different scales and have different distributions.

Reciprocal Rank Fusion (RRF) provides an elegant, score-agnostic solution. It uses the rank of each document in the result lists, not its score. The formula for the RRF score of a document d is:

RRF_Score(d) = Σ (1 / (k + rank_i(d)))

where the sum is over all result lists i, rank_i(d) is the rank of document d in list i, and k is a constant (typically 60) that mitigates the impact of high ranks.

The RRF SQL Master Query

We can implement RRF in a single, powerful SQL query using Common Table Expressions (CTEs). This query is the heart of our system.

sql

-- Full Hybrid Search Query with RRF
-- :query_text = 'how to figure out monthly loan costs'
-- :query_embedding = '[...]' (embedding of the query_text)
-- :tenant = 'tenant-123'

WITH fts_results AS (
    -- 1. Get ranked results from Full-Text Search
    SELECT 
        id,
        ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, query) DESC) as rank
    FROM 
        documents, 
        websearch_to_tsquery('english', :query_text) AS query
    WHERE 
        tenant_id = :tenant
        AND content_tsv @@ query
    LIMIT 100 -- Limit the input size for fusion
), 
vector_results AS (
    -- 2. Get ranked results from Vector Search
    SELECT 
        id,
        ROW_NUMBER() OVER (ORDER BY embedding <=> :query_embedding) as rank
    FROM 
        documents
    WHERE
        tenant_id = :tenant
    -- Optional: Add a pre-filter with FTS for performance on huge datasets.
    -- This is a common pattern called "sparse-first retrieval".
    -- AND content_tsv @@ websearch_to_tsquery('english', :query_text) 
    ORDER BY 
        embedding <=> :query_embedding
    LIMIT 100 -- Limit the input size for fusion
), 
rrf_scores AS (
    -- 3. Calculate RRF scores
    SELECT
        id,
        -- The RRF formula: 1 / (k + rank)
        -- We use a default k of 60
        SUM(1.0 / (60 + rank)) AS rrf_score
    FROM (
        SELECT * FROM fts_results
        UNION ALL
        SELECT * FROM vector_results
    ) AS combined_results
    GROUP BY
        id
)
-- 4. Final Selection and Ranking
SELECT 
    d.id,
    d.content,
    d.metadata,
    rs.rrf_score
FROM 
    documents d
JOIN 
    rrf_scores rs ON d.id = rs.id
WHERE
    d.tenant_id = :tenant
ORDER BY 
    rs.rrf_score DESC
LIMIT 20;

Dissecting the RRF Query:

fts_results CTE: Executes the FTS query and uses the ROW_NUMBER() window function to assign a rank to each result.

vector_results CTE: Executes the vector search query and similarly assigns a rank.

LIMIT 100: It's crucial to limit the number of results from each retriever. RRF is most effective on the top-k results. Fetching thousands of results from each would be slow and inefficient. 100 is a reasonable starting point.

rrf_scores CTE: This is where the magic happens. We UNION ALL the results from both retrievers. This creates a list of (id, rank) pairs. We then GROUP BY id and apply the RRF formula SUM(1.0 / (60 + rank)). A document appearing in both lists will have its score contribution summed, naturally boosting its final rank.

Final JOIN: We join back to the documents table to retrieve the full document content for the final, RRF-ranked results.

Production-Grade Performance Tuning

Writing the query is half the battle. Ensuring it runs in under 100ms is the other half.

`EXPLAIN ANALYZE` is Your Best Friend

Always analyze your query plan. For the above query, you should see:

In fts_results: A Bitmap Heap Scan on documents using your GIN index (idx_documents_tsv_gin). This is what you want to see. If you see a Sequential Scan, your index isn't being used.

In vector_results: An Index Scan on documents using your HNSW index (idx_documents_embedding_hnsw). This indicates the ANN search is working correctly.

The rest of the plan will involve hashing and aggregation for the CTEs and final join. The key is to ensure the two retrieval CTEs are fast.

sql

-- To analyze, you'll need to run it with actual values
EXPLAIN ANALYZE
WITH fts_results AS (
    -- ... (with a real query string)
), 
vector_results AS (
    -- ... (with a real vector)
), 
-- ... rest of query

Parameter Tuning: The Levers You Can Pull

hnsw.ef_search (for HNSW): This is the most important knob for tuning vector search. The default is often too low for high-recall applications. On a 1M vector dataset, you might test values from 40 to 100. Benchmark the trade-off: ef_search=40 might give 95% recall at 50ms, while ef_search=100 might give 99% recall at 120ms. Choose based on your product requirements.

ivfflat.probes (for IVFFlat): Similar to ef_search. The default of 1 is for testing only. Production values are often in the 10-40 range. Again, benchmark recall vs. latency.

RRF k constant: The default of k=60 is from the original paper and works well. If you find that one retriever's ranks are systematically overpowering the other, you could experiment with a lower k (more weight to top ranks) or higher k (flatter distribution), but 60 is a robust default.

Input LIMIT: The LIMIT 100 in the CTEs is another performance lever. Reducing it to 50 will speed up the fusion step but might reduce the chance of finding relevant documents ranked lower by one of the retrievers. Increasing it might improve recall at the cost of performance.

Edge Cases and Advanced Scenarios

Zero Results: The UNION ALL approach handles this gracefully. If one CTE returns zero rows, the rrf_scores CTE will simply calculate scores based on the results from the other.

Pagination: Standard LIMIT / OFFSET pagination on this complex query can be inefficient, as the entire RRF calculation must be redone for each page. For high-performance pagination, use keyset pagination. The ORDER BY clause becomes ORDER BY rs.rrf_score DESC, d.id ASC. Your application would then pass the rrf_score and id of the last item on the previous page as a WHERE clause condition to fetch the next page, avoiding the need for OFFSET.

sql

    -- Keyset pagination example for page 2
    -- ... final SELECT ...
    WHERE 
        d.tenant_id = :tenant AND
        (rs.rrf_score, d.id) < (:last_rrf_score, :last_id)
    ORDER BY
        rs.rrf_score DESC, d.id ASC
    LIMIT 20;

Sparse-First Filtering: On extremely large datasets (100M+ rows), even the HNSW index scan can become a bottleneck. A common optimization is to use FTS as a fast, preliminary filter for the vector search. You would add a WHERE content_tsv @@ ... clause to the vector_results CTE. This reduces the search space for the more expensive vector similarity calculation. The trade-off is that you might miss documents that are semantically relevant but don't match any keywords.

Conclusion: A Unified, Powerful Search Stack

By embracing the native capabilities of PostgreSQL and the power of extensions like pgvector, we have engineered a hybrid search system that is more than the sum of its parts. It marries the lexical precision required for specific queries with the deep semantic understanding needed for broad, intent-based searches.

The Reciprocal Rank Fusion pattern implemented with CTEs provides a robust, production-ready method for merging these distinct search modalities. This approach eliminates the need for a complex, multi-system architecture involving separate services like Elasticsearch and a dedicated vector database (e.g., Pinecone, Weaviate). By keeping everything within PostgreSQL, you simplify your operational overhead, reduce data synchronization issues, and leverage a single, powerful transaction model.

This is not a theoretical exercise. This pattern is being deployed in production to power RAG applications, e-commerce search, and internal knowledge bases, delivering a state-of-the-art search experience with a surprisingly lean and manageable tech stack.