Hybrid Search in Postgres: Fusing pgvector and BM25 with RRF

October 4, 2025

20 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Dichotomy of Modern Search: Precision vs. Semantics

In contemporary application development, search is no longer a solved problem. The rise of Large Language Models (LLMs) and accessible embedding APIs has shifted user expectations towards semantic understanding. Users expect a search engine to find "documents about reducing cloud infrastructure costs" even if the literal phrase doesn't appear. This is the domain of vector search.

However, the power of semantic search often comes at the cost of precision. A vector search might fail to surface a document that contains the exact, critical keyword "AWS Graviton3 cost-benefit analysis" if the query's semantic vector is too broad. This is where traditional, lexical search, powered by algorithms like Okapi BM25, excels.

The engineering dilemma is clear: how do we build a system that offers the best of both worlds without the operational overhead of managing two disparate systems, like Elasticsearch for keyword search and a dedicated vector database like Pinecone or Weaviate?

This article presents a production-grade solution within a single, familiar datastore: PostgreSQL. By leveraging the power of the pgvector extension for vector search and PostgreSQL's mature full-text search (FTS) capabilities, we can construct a sophisticated hybrid search engine. The core of our approach will be a robust technique for merging these two distinct result sets: Reciprocal Rank Fusion (RRF).

This is not an introduction. We assume you are familiar with PostgreSQL, SQL, and the fundamental concepts of vector embeddings and full-text search. We will focus on the intricate details of schema design, advanced indexing, query construction, result fusion algorithms, and performance tuning necessary for a production environment.

Part 1: The Foundation - Schema Design and Data Ingestion

Our first step is to design a table that can efficiently store textual content, its corresponding full-text search vector, and its semantic embedding vector. We'll also include a metadata column, which is common practice for filtering and faceting in real-world applications.

Table Schema

We'll need the pgvector extension enabled. If you haven't already, run CREATE EXTENSION vector;.

sql

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB,
    -- Column for Full-Text Search (FTS)
    content_tsv TSVECTOR 
        GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
    -- Column for Semantic Search (Vector)
    embedding VECTOR(768) -- Dimension depends on your embedding model
);

Key Design Decisions:

id BIGSERIAL: Using BIGSERIAL is crucial for tables expected to grow beyond 2 billion records. Standard SERIAL (an integer) can be a silent time bomb.

content_tsv TSVECTOR GENERATED ALWAYS AS (...) STORED: This is a powerful PostgreSQL 12+ feature. The tsvector is automatically generated and kept in sync with the content column. STORED materializes the column, which is essential for indexing. Without STORED, the tsvector would be computed on-the-fly for every query, destroying performance.

embedding VECTOR(768): The dimension (here, 768) is dictated entirely by the embedding model you choose. For example, all-MiniLM-L6-v2 produces 384-dimension vectors, while text-embedding-ada-002 from OpenAI produces 1536. Mismatching this dimension will cause errors.

Indexing Strategy

With the table defined, we must create indexes to support fast querying for both search paradigms.

sql

-- Index for Full-Text Search
CREATE INDEX idx_documents_content_tsv ON documents USING GIN(content_tsv);

-- Index for Vector Search (using HNSW)
-- The parameters M and ef_construction are critical for performance and recall.
CREATE INDEX idx_documents_embedding_hnsw ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Advanced Indexing Considerations:

* GIN for FTS: A GIN (Generalized Inverted Index) is the standard and most performant choice for tsvector data. It maps lexemes to the locations where they appear, making keyword lookups extremely fast.

* HNSW for Vectors: pgvector supports multiple index types. We've chosen HNSW (Hierarchical Navigable Small World) over the older IVFFlat for several reasons:

* No probes tuning: HNSW search performance is controlled by the ef_search parameter at query time, not by a fixed probes value at index time, offering more flexibility.

* Better Recall/Performance Trade-off: HNSW generally provides superior performance for a given level of recall compared to IVFFlat, especially on high-dimensional data and large datasets.

* m and ef_construction: These are build-time parameters. m defines the maximum number of connections per layer in the graph (defaults to 16). ef_construction controls the size of the dynamic candidate list during index build (defaults to 64). Increasing these can improve recall at the cost of slower index build times and a larger index size. For a production system, you must benchmark these values against your specific dataset and recall requirements.

Data Ingestion Pipeline (Python Example)

Here is a production-style Python script for populating our table. It uses sentence-transformers for local embedding generation and psycopg2 for robust database interaction.

python

import psycopg2
from psycopg2.extras import execute_batch
from sentence_transformers import SentenceTransformer
import json

# --- Configuration ---
# It's crucial to use a model that aligns with your vector dimension (768 in our schema)
MODEL_NAME = 'msmarco-distilbert-base-v4'
DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"

# --- Sample Data ---
sample_docs = [
    {
        "content": "The new XYZ framework provides significant performance improvements for server-side rendering.",
        "metadata": {"source": "engineering_blog", "year": 2023}
    },
    {
        "content": "Optimizing cloud costs is a key priority. We can achieve this by leveraging ARM-based instances like AWS Graviton.",
        "metadata": {"source": "finance_report", "year": 2022}
    },
    {
        "content": "Our guide to server performance tuning covers caching, database indexing, and network latency.",
        "metadata": {"source": "dev_docs", "year": 2021}
    },
    {
        "content": "A cost-benefit analysis of migrating to Graviton processors for our primary compute workloads.",
        "metadata": {"source": "architecture_rfc", "year": 2023}
    }
]

def ingest_data():
    """Connects to PostgreSQL, generates embeddings, and ingests data."""
    print(f"Loading sentence transformer model: {MODEL_NAME}")
    model = SentenceTransformer(MODEL_NAME)

    print("Generating embeddings for documents...")
    contents = [doc['content'] for doc in sample_docs]
    embeddings = model.encode(contents, show_progress_bar=True)

    records_to_insert = []
    for i, doc in enumerate(sample_docs):
        records_to_insert.append((
            doc['content'], 
            json.dumps(doc['metadata']), 
            embeddings[i].tolist() # Convert numpy array to list for psycopg2
        ))

    try:
        print("Connecting to the database...")
        with psycopg2.connect(DB_CONNECTION_STRING) as conn:
            with conn.cursor() as cur:
                print("Ingesting data using execute_batch...")
                execute_batch(
                    cur,
                    "INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s);",
                    records_to_insert,
                    page_size=100 # Adjust batch size based on record size and memory
                )
                print(f"Successfully inserted {cur.rowcount} records.")
        conn.commit()
    except psycopg2.Error as e:
        print(f"Database error: {e}")

if __name__ == "__main__":
    ingest_data()

This script demonstrates batch insertion with execute_batch, which is significantly more performant than single INSERT statements in a loop.

Part 2: The Building Blocks - Independent Querying

Before we can fuse results, we must master querying each search system independently. This allows us to establish a baseline and understand the behavior of each component.

Keyword Search with BM25

PostgreSQL's FTS ranking function, ts_rank_cd, is a well-regarded implementation that closely follows the principles of the Okapi BM25 algorithm. It considers term frequency (TF) and inverse document frequency (IDF) and normalizes for document length.

Let's search for documents related to "server performance":

sql

-- Create a query variable for clarity
WITH query AS (SELECT to_tsquery('english', 'server & performance') AS q)
SELECT 
    d.id,
    d.content,
    ts_rank_cd(d.content_tsv, query.q) AS bm25_score
FROM 
    documents d, query
WHERE 
    d.content_tsv @@ query.q
ORDER BY 
    bm25_score DESC
LIMIT 10;

EXPLAIN ANALYZE Output Analysis:

text

Limit  (cost=...)
  ->  Sort  (cost=...)
        Sort Key: (ts_rank_cd(content_tsv, ...))
        ->  Bitmap Heap Scan on documents d  (cost=...)
              Recheck Cond: (content_tsv @@ ...)
              ->  Bitmap Index Scan on idx_documents_content_tsv  (cost=...)
                    Index Cond: (content_tsv @@ ...)

The key takeaway is the Bitmap Index Scan on our GIN index (idx_documents_content_tsv). This confirms the index is being used effectively, preventing a costly sequential scan over the entire table.

Semantic Search with pgvector and HNSW

For semantic search, we first generate an embedding for our query string and then use a distance operator to find the nearest neighbors in our database.

The <=> operator in pgvector calculates the cosine distance (1 - cosine similarity). A smaller distance means a closer semantic match.

Here's the Python and SQL flow:

python

# In your application code
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-v4')
query_text = "how to improve server speed"
query_embedding = model.encode(query_text).tolist()

# Now, pass `query_embedding` as a parameter to your SQL query

sql

-- In your database client, assuming :query_embedding is the parameter
-- NOTE: HNSW search performance is tuned with ef_search
SET hnsw.ef_search = 100;

SELECT 
    id,
    content,
    1 - (embedding <=> :query_embedding) AS similarity_score -- Convert distance to similarity
FROM 
    documents
ORDER BY 
    embedding <=> :query_embedding
LIMIT 10;

Advanced HNSW Tuning (ef_search):

* hnsw.ef_search is a run-time parameter that controls the size of the dynamic candidate list during the search phase. It is the single most important knob for tuning the trade-off between speed and accuracy (recall).

* Low ef_search (e.g., 20): Very fast searches, but may miss relevant neighbors (lower recall).

* High ef_search (e.g., 200): Slower searches, but explores more of the graph, leading to higher recall.

* Production Strategy: You must benchmark this value. Run a set of known queries and measure both latency (p95, p99) and recall against a ground-truth set (e.g., results from an exhaustive k-NN search) to find the optimal value for your application's SLOs.

Part 3: The Core Challenge - Reciprocal Rank Fusion (RRF)

Now we have two ordered lists of document IDs, one from BM25 and one from vector search. The challenge is that their scores are on completely different scales and distributions. A BM25 score of 0.8 has no mathematical relationship to a cosine similarity of 0.8. Simply adding them or averaging them is statistically invalid and will produce poor results.

This is where Reciprocal Rank Fusion (RRF) comes in. RRF is a simple yet remarkably effective algorithm that disregards the raw scores and instead uses the rank of each document in its respective result list.

The formula is:

RRF_Score(d) = Σ (1 / (k + rank_i(d)))

Where:

* d is a document.

* rank_i(d) is the rank of document d in result set i.

* k is a constant to mitigate the effect of high ranks (i.e., documents ranked 1st or 2nd) having disproportionately large scores. A common value for k is 60.

Let's implement this in SQL using Common Table Expressions (CTEs).

sql

-- Full Hybrid Search Query with RRF
-- :query_text = 'cost optimization for graviton servers'
-- :query_embedding = [0.1, 0.2, ...]

WITH fts AS (
    -- 1. Get ranked results from Full-Text Search
    SELECT 
        id,
        ts_rank_cd(content_tsv, websearch_to_tsquery('english', :query_text)) AS score,
        ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, websearch_to_tsquery('english', :query_text)) DESC) as rank
    FROM documents
    WHERE content_tsv @@ websearch_to_tsquery('english', :query_text)
    ORDER BY score DESC
    LIMIT 100 -- Limit the candidate set from each search
), 
vector AS (
    -- 2. Get ranked results from Vector Search
    SET LOCAL hnsw.ef_search = 100;
    SELECT 
        id,
        1 - (embedding <=> :query_embedding) as score,
        ROW_NUMBER() OVER (ORDER BY embedding <=> :query_embedding ASC) as rank
    FROM documents
    ORDER BY embedding <=> :query_embedding ASC
    LIMIT 100 -- Limit the candidate set from each search
), 
rrf_fusion AS (
    -- 3. Fuse the results using Reciprocal Rank Fusion (RRF)
    SELECT
        COALESCE(fts.id, vector.id) as id,
        -- Calculate RRF score. k=60 is a common constant.
        (CASE WHEN fts.rank IS NULL THEN 0.0 ELSE 1.0 / (60 + fts.rank) END) +
        (CASE WHEN vector.rank IS NULL THEN 0.0 ELSE 1.0 / (60 + vector.rank) END) as rrf_score
    FROM fts
    FULL OUTER JOIN vector ON fts.id = vector.id
)
SELECT 
    rf.id,
    d.content,
    rf.rrf_score
FROM rrf_fusion rf
JOIN documents d ON rf.id = d.id
ORDER BY rf.rrf_score DESC
LIMIT 20;

Breakdown of the Fusion Logic:

fts CTE: Executes the BM25 search. Critically, it uses ROW_NUMBER() to assign a rank to each result.

vector CTE: Executes the vector search and also assigns a rank.

LIMIT 100: This is a vital performance optimization. We only consider the top 100 candidates from each search paradigm for fusion. Fusing the entire dataset would be prohibitively slow. The choice of 100 is arbitrary and should be tuned based on your recall requirements.

rrf_fusion CTE: This is the heart of the algorithm.

* FULL OUTER JOIN: This is essential for handling edge cases where a document appears in one result set but not the other. An INNER JOIN would discard these valuable unique results.

* COALESCE(fts.id, vector.id): Gets the ID regardless of which result set it came from.

* CASE WHEN ... IS NULL THEN 0.0 ...: If a document is not found in a result set (e.g., fts.rank is NULL), its contribution to the RRF score from that set is 0. Otherwise, we calculate 1 / (60 + rank).

Final SELECT: Joins back to the documents table to retrieve the content and presents the final, re-ranked list to the user.

Part 4: Production-Ready Implementation - A Pluggable PL/pgSQL Function

Exposing the complex query above directly to your application layer is a poor practice. It's brittle, hard to maintain, and requires the application to know too much about the database's internal logic. A much cleaner approach is to encapsulate this logic within a PostgreSQL function.

This provides a stable, versioned API for your application to call.

sql

CREATE OR REPLACE FUNCTION hybrid_search(
    p_query_text TEXT,
    p_query_embedding VECTOR(768),
    p_match_limit INT DEFAULT 20,
    p_candidate_limit INT DEFAULT 100,
    p_rrf_k INT DEFAULT 60
) 
RETURNS TABLE (id BIGINT, content TEXT, rrf_score DOUBLE PRECISION) AS $$
DECLARE
    -- Use websearch_to_tsquery for more lenient parsing of user input
    v_tsquery TSQUERY := websearch_to_tsquery('english', p_query_text);
BEGIN
    -- Set HNSW parameters for the duration of this transaction
    -- Using SET LOCAL ensures it doesn't affect other sessions
    SET LOCAL hnsw.ef_search = 100;

    RETURN QUERY
    WITH fts AS (
        SELECT 
            d.id,
            ROW_NUMBER() OVER (ORDER BY ts_rank_cd(d.content_tsv, v_tsquery) DESC) as rank
        FROM documents d
        WHERE d.content_tsv @@ v_tsquery
        ORDER BY rank
        LIMIT p_candidate_limit
    ), 
    vector AS (
        SELECT 
            d.id,
            ROW_NUMBER() OVER (ORDER BY d.embedding <=> p_query_embedding ASC) as rank
        FROM documents d
        ORDER BY rank
        LIMIT p_candidate_limit
    ), 
    rrf_fusion AS (
        SELECT
            COALESCE(fts.id, vector.id) as doc_id,
            (CASE WHEN fts.rank IS NULL THEN 0.0 ELSE 1.0 / (p_rrf_k + fts.rank) END) +
            (CASE WHEN vector.rank IS NULL THEN 0.0 ELSE 1.0 / (p_rrf_k + vector.rank) END) as score
        FROM fts
        FULL OUTER JOIN vector ON fts.id = vector.id
    )
    SELECT 
        rf.doc_id,
        d.content,
        rf.score
    FROM rrf_fusion rf
    JOIN documents d ON rf.doc_id = d.id
    ORDER BY rf.score DESC
    LIMIT p_match_limit;

END; $$ LANGUAGE plpgsql;

Advantages of this Functional Approach:

Abstraction: The application only needs to call SELECT FROM hybrid_search('my query', ARRAY[...]);. The complexity is hidden.

* Maintainability: You can update the fusion logic, tune parameters, or even swap out the ranking algorithm within the function without changing any application code.

* Configuration: We've parameterized key values like match_limit, candidate_limit, and the RRF k constant, allowing for flexible tuning from the application if needed.

* SET LOCAL: This is a critical detail for production. It ensures that our hnsw.ef_search setting only applies to the current transaction, preventing it from leaking and affecting other queries running in the same session.

Part 5: Advanced Topics and Edge Cases

Weighted Fusion

While RRF's parameter-free nature is a strength, sometimes you may want to explicitly boost one search type over the other. You can introduce weights into the RRF formula:

Weighted_RRF = (w_fts (1 / (k + rank_fts))) + (w_vec (1 / (k + rank_vec)))

To implement this, simply add weight parameters (p_fts_weight, p_vector_weight) to the function and multiply the components of the score calculation.

sql

-- Inside the rrf_fusion CTE of the function
...
(p_fts_weight * (CASE WHEN fts.rank IS NULL THEN 0.0 ELSE 1.0 / (p_rrf_k + fts.rank) END)) +
(p_vector_weight * (CASE WHEN vector.rank IS NULL THEN 0.0 ELSE 1.0 / (p_rrf_k + vector.rank) END)) as score
...

This allows you to dynamically adjust the search behavior, perhaps based on user preferences or the type of query.

Handling Metadata Filtering

Real-world search almost always involves filtering (e.g., WHERE metadata->>'year' = '2023'). This poses a significant challenge for vector search indexes like HNSW.

The Problem: HNSW is not designed for post-filtering. The index traversal finds the N nearest neighbors in the entire dataset first, and then the WHERE clause is applied. If your filter is highly selective (e.g., it removes 99% of the documents), it's possible that all of the top N results from the index scan will be filtered out, leaving you with zero results, even though matching documents exist.

Solutions and Trade-offs:

Over-fetch and Re-rank (The Common Approach): Increase the LIMIT in the vector CTE significantly (e.g., from 100 to 1000 or more). This fetches more candidates from the index, increasing the probability that enough will survive the filter. This is the simplest approach but can degrade performance.

sql

    -- In the vector CTE
    ...
    FROM documents d
    WHERE d.metadata->>'source' = 'engineering_blog'
    ORDER BY embedding <=> p_query_embedding ASC
    LIMIT 1000; -- Over-fetch to compensate for filtering

Multi-tenancy via Table Partitioning: For strict data separation (e.g., multi-tenant applications), partition the documents table by tenant_id. This allows the HNSW index to operate on a much smaller, pre-filtered subset of the data, which is far more efficient.

Emerging Index Capabilities: The vector database space is evolving rapidly. Keep an eye on pgvector and other extensions for improvements in metadata filtering capabilities, as this is an area of active research.

Index Maintenance and VACUUM

Both GIN and HNSW indexes require maintenance.

* GIN: The GIN index can have a "pending list" of new entries that are not yet merged into the main index structure. Frequent VACUUM operations are necessary to keep query performance optimal, especially on tables with high write throughput.

* HNSW: Deletes in pgvector are currently logical. The space is not reclaimed until the index is rebuilt. For tables with very high churn (frequent DELETEs or UPDATEs), you may need a strategy to periodically REINDEX the HNSW index during a maintenance window.

Conclusion: A Unified, Powerful Search System

By thoughtfully combining PostgreSQL's built-in full-text search with the pgvector extension, we have architected a sophisticated hybrid search system that rivals the capabilities of complex, multi-system setups. The use of Reciprocal Rank Fusion provides a principled, robust method for merging disparate result sets, while encapsulation within a PL/pgSQL function creates a clean, maintainable API for our applications.

This pattern is not a silver bullet, but it represents a powerful and operationally simple alternative for teams already invested in the PostgreSQL ecosystem. It avoids the added complexity, cost, and latency of managing and synchronizing data between separate keyword and vector search clusters. For a vast range of applications, building an advanced search experience directly within your primary database is no longer a compromise—it's a strategic advantage.