Production RAG: Hybrid Search with pgvector, BM25, and RRF

October 17, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Achilles' Heel of Pure Vector Search in Production RAG

As senior engineers building Retrieval-Augmented Generation (RAG) systems, we've all witnessed the magic of semantic search. A user asks about "cloud billing anomalies," and our system, powered by pgvector and a high-quality embedding model, correctly surfaces documents discussing "unforeseen GCP expenditures." This is a powerful capability. However, we've also seen its critical failure mode in production: a query for a specific error code like GKE-1128-B or a product SKU XJ-481-Z returns vaguely related concepts instead of the one document that contains the exact identifier.

This is the fundamental limitation of dense retrieval. It excels at understanding semantic meaning but is inherently lossy regarding lexical precision. For production systems where factual accuracy is paramount, this isn't just an inconvenience; it's a critical flaw that erodes user trust. The solution isn't to abandon vector search, but to augment it.

This article details an advanced, production-ready pattern for implementing hybrid search directly within PostgreSQL. We will combine dense vector retrieval (pgvector) with sparse, keyword-based retrieval (PostgreSQL's Full-Text Search, a powerful proxy for BM25) and fuse the results using a sophisticated, parameter-free ranking algorithm: Reciprocal Rank Fusion (RRF). This architecture provides the best of both worlds—semantic understanding and lexical precision—within a single, operationally simple database, eliminating the complexity of managing separate vector and keyword search indices (like Elasticsearch).

Architectural Blueprint: Dense + Sparse Retrieval in a Single Schema

Our goal is to execute two distinct types of searches in parallel and intelligently merge their results.

Dense Retrieval: This is the semantic search component, handled by pgvector. It finds documents that are conceptually similar to the query, even if they don't share keywords.

Sparse Retrieval: This is the keyword search component. We'll use PostgreSQL's built-in Full-Text Search (FTS), which provides ranking capabilities similar to the classic Okapi BM25 algorithm. It excels at finding documents containing exact query terms.

To support this, we need a carefully designed table schema and corresponding indices.

Database Schema for Hybrid Search

Let's define a documents table that stores our content, its embedding, and a pre-calculated tsvector for full-text search.

sql

-- Ensure the vector extension is enabled
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding VECTOR(1536) NOT NULL, -- Example: OpenAI ada-002 embedding size
    content_tsv TSVECTOR,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

The key element here is the content_tsv column of type TSVECTOR. This column will store a processed version of the content field, optimized for text search—it's stripped of stop words, and words are reduced to their root form (stemming). Storing this pre-calculated vector is far more performant than computing it at query time.

Automating `tsvector` Generation with a Trigger

Manually keeping content_tsv in sync with content is error-prone. A database trigger is the canonical solution, ensuring atomicity and data integrity. We'll create a function that updates the content_tsv column whenever a document's content is inserted or updated.

sql

-- Create a function to update the tsvector column
CREATE OR REPLACE FUNCTION update_content_tsv()
RETURNS TRIGGER AS $$
BEGIN
    -- 'english' is a predefined text search configuration.
    -- It handles stemming and stop words for the English language.
    NEW.content_tsv := to_tsvector('english', NEW.content);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create a trigger that calls the function before insert or update
CREATE TRIGGER documents_tsv_update
BEFORE INSERT OR UPDATE ON documents
FOR EACH ROW
EXECUTE FUNCTION update_content_tsv();

Now, any INSERT or UPDATE operation on the documents table will automatically and performantly populate the content_tsv column. This is a robust, set-and-forget pattern for production systems.

High-Performance Indexing Strategy

Without proper indexing, our hybrid search query will result in sequential scans, which is unacceptable for any non-trivial dataset. We need two distinct types of indices.

HNSW Index for Vector Search: For Approximate Nearest Neighbor (ANN) search on our embeddings, an HNSW (Hierarchical Navigable Small World) index is the state-of-the-art. It provides an excellent balance of speed and accuracy.

sql

    -- HNSW index for fast ANN search on embeddings
    -- The cosine operator (<=>) is used for distance calculation
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Production Considerations for HNSW:

* m: The maximum number of connections per layer. Higher values increase accuracy (recall) but also increase index size and build time. 16 is a solid default.

* ef_construction: The size of the dynamic candidate list during index construction. A higher value leads to a better quality index and higher accuracy, at the cost of longer build times. 64 is a reasonable starting point.

GIN Index for Full-Text Search: For our tsvector column, a GIN (Generalized Inverted Index) is the ideal choice. It's highly optimized for queries that check for the existence of terms within the tsvector.

sql

    -- GIN index for fast full-text search
    CREATE INDEX ON documents USING gin (content_tsv);

Why GIN over GiST? While both can index tsvector, GIN is generally faster for lookups because it stores a mapping from each lexeme (word) to the documents containing it. GiST uses a more complex tree structure that can be slower for the common text search queries we'll be running.

With this schema and indexing strategy, our database is primed for high-performance hybrid queries.

The Fusion Problem: Merging Incomparable Scores

We can now get two ranked lists of documents:

From pgvector, ranked by cosine distance (a value from 0 to 2).

From FTS, ranked by ts_rank_cd (an arbitrary positive float, typically between 0 and 1, but can be higher).

These scores are on completely different, un-normalized scales. A simple weighted average (0.5 vector_score + 0.5 fts_score) is a naive approach that rarely works well. The relative magnitudes of the scores are unpredictable, and finding the right weights is a dark art that changes with the dataset and query patterns.

The Solution: Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) is an elegant, powerful, and—best of all—parameter-free algorithm for combining ranked lists. It disregards the raw scores entirely and uses only the rank of each item in the result lists.

The formula is simple: for each document, its RRF score is the sum of the reciprocal of its rank in each list.

RRF_Score(doc) = Σ (1 / (k + rank_i(doc)))

* rank_i(doc) is the rank of the document in result list i.

* k is a constant used to mitigate the impact of high ranks (i.e., documents ranked 1st or 2nd) having disproportionately large scores. A k value of 60 is a widely used and effective default.

Why RRF is superior:

* No Normalization Needed: It completely sidesteps the problem of incomparable scores.

* Robustness: It rewards documents that appear in multiple result lists, regardless of their raw scores.

* Simplicity: It requires no tuning or domain-specific weighting, making it a general-purpose and resilient choice.

The Master Query: Implementing Hybrid Search with RRF in SQL

We will now construct a single, powerful SQL query that performs both searches, ranks the results, and computes the RRF score. We'll use Common Table Expressions (CTEs) to keep the logic clean and readable.

Let's assume our input is a user query string and its corresponding embedding vector.

sql

-- Placeholders:
-- :query_embedding -> The 1536-dimensional vector for the user's query
-- :query_text      -> The user's query string, formatted for to_tsquery, e.g., 'cloud & billing & anomalies'
-- :k_rrf           -> The RRF constant, e.g., 60
-- :top_k_vector    -> How many results to fetch from vector search, e.g., 50
-- :top_k_keyword   -> How many results to fetch from keyword search, e.g., 50
-- :final_limit     -> The final number of results to return, e.g., 10

WITH vector_search AS (
    -- 1. Perform dense vector search
    SELECT 
        id,
        -- Rank is calculated based on the cosine distance ordering
        ROW_NUMBER() OVER (ORDER BY embedding <=> :query_embedding) AS rank
    FROM documents
    ORDER BY embedding <=> :query_embedding
    LIMIT :top_k_vector
), keyword_search AS (
    -- 2. Perform sparse keyword search
    SELECT 
        id,
        -- Rank is calculated based on the FTS score
        ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', :query_text)) DESC) AS rank
    FROM documents
    WHERE content_tsv @@ to_tsquery('english', :query_text)
    ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', :query_text)) DESC
    LIMIT :top_k_keyword
), combined_results AS (
    -- 3. Combine results from both searches
    SELECT id, rank FROM vector_search
    UNION ALL
    SELECT id, rank FROM keyword_search
) 
-- 4. Calculate RRF score and aggregate
SELECT 
    cr.id,
    SUM(1.0 / (:k_rrf + cr.rank)) AS rrf_score,
    d.content -- Also fetch the content for the final result
FROM combined_results cr
JOIN documents d ON cr.id = d.id
GROUP BY cr.id, d.content
ORDER BY rrf_score DESC
LIMIT :final_limit;

Deconstructing the Query

vector_search CTE: This performs the ANN search using the <=> (cosine distance) operator. We use ROW_NUMBER() to assign a rank to each result based on its distance. Critically, we LIMIT the number of results to a reasonable number (:top_k_vector, e.g., 50) to keep the query fast.

keyword_search CTE: This performs the FTS search. The @@ operator matches the tsvector against a tsquery. ts_rank_cd calculates the relevance score, which we use for ordering. Again, we use ROW_NUMBER() to assign a rank and LIMIT the results.

combined_results CTE: We UNION ALL the two result sets. A document that appears in both searches will now have two rows in this CTE, one with its vector rank and one with its keyword rank.

Final Aggregation: We group by the document id, and for each document, we calculate its final rrf_score by summing the 1 / (k + rank) values. The results are then ordered by this new, unified score to produce our final, relevance-fused ranking.

This single query is the heart of our hybrid search engine. It's declarative, efficient, and leverages the full power of PostgreSQL's query planner.

Production Implementation: Python Application Logic

Let's wrap this SQL logic in a Python function using the asyncpg library for high-performance, asynchronous database access.

python

import asyncio
import asyncpg
import numpy as np
from typing import List, Dict, Any

# Assume you have an embedding function
# from some_embedding_service import get_embedding

# Placeholder for a real embedding function
def get_embedding(text: str) -> List[float]:
    # In a real application, this would call an API like OpenAI, Cohere, or a local model.
    # Using a random 1536-dim vector for demonstration.
    return np.random.rand(1536).tolist()

def format_tsquery(query: str) -> str:
    """Formats a user query string into a tsquery-compatible string.
    Example: 'cloud billing anomalies' -> 'cloud & billing & anomalies'
    """
    return " & ".join(query.strip().split())

class HybridSearcher:
    def __init__(self, dsn: str):
        self.dsn = dsn
        self.pool = None

    async def connect(self):
        self.pool = await asyncpg.create_pool(self.dsn)

    async def close(self):
        if self.pool:
            await self.pool.close()

    async def search(self, 
                     query_text: str, 
                     top_k_vector: int = 50, 
                     top_k_keyword: int = 50, 
                     final_limit: int = 10, 
                     k_rrf: int = 60) -> List[Dict[str, Any]]:
        if not self.pool:
            raise ConnectionError("Database connection pool is not initialized. Call connect() first.")

        query_embedding = get_embedding(query_text)
        ts_query_text = format_tsquery(query_text)

        sql_query = """
        WITH vector_search AS (
            SELECT 
                id,
                ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
            FROM documents
            ORDER BY embedding <=> $1
            LIMIT $2
        ), keyword_search AS (
            SELECT 
                id,
                ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', $3)) DESC) AS rank
            FROM documents
            WHERE content_tsv @@ to_tsquery('english', $3)
            ORDER BY ts_rank_cd(content_tsv, to_tsquery('english', $3)) DESC
            LIMIT $4
        ), combined_results AS (
            SELECT id, rank FROM vector_search
            UNION ALL
            SELECT id, rank FROM keyword_search
        ) 
        SELECT 
            cr.id,
            SUM(1.0 / ($5 + cr.rank)) AS rrf_score,
            d.content
        FROM combined_results cr
        JOIN documents d ON cr.id = d.id
        GROUP BY cr.id, d.content
        ORDER BY rrf_score DESC
        LIMIT $6;
        """

        async with self.pool.acquire() as connection:
            results = await connection.fetch(
                sql_query,
                query_embedding,
                top_k_vector,
                ts_query_text,
                top_k_keyword,
                k_rrf,
                final_limit
            )
        
        return [dict(row) for row in results]

# Example Usage
async def main():
    DATABASE_URL = "postgresql://user:password@host:port/dbname"
    searcher = HybridSearcher(dsn=DATABASE_URL)
    await searcher.connect()

    try:
        # Scenario 1: Semantic search
        results_semantic = await searcher.search("unexpected charges on my cloud bill")
        print("--- Semantic Search Results ---")
        for res in results_semantic:
            print(f"ID: {res['id']}, Score: {res['rrf_score']:.4f}, Content: {res['content'][:100]}...")

        print("\n")

        # Scenario 2: Keyword-heavy search
        results_keyword = await searcher.search("GKE-1128-B autoscaler error")
        print("--- Keyword Search Results ---")
        for res in results_keyword:
            print(f"ID: {res['id']}, Score: {res['rrf_score']:.4f}, Content: {res['content'][:100]}...")

    finally:
        await searcher.close()

if __name__ == "__main__":
    asyncio.run(main())

This Python code provides a clean, reusable class for interacting with our backend. It correctly parameterizes the query to prevent SQL injection and handles the logic of getting embeddings and formatting the text for tsquery.

Performance Analysis and Edge Case Handling

A complex query like this requires scrutiny to ensure it performs under load.

Analyzing the Query Plan

Running EXPLAIN ANALYZE on the query is non-negotiable. Here’s what you should look for in the query plan:

text

-> Limit (cost=...)
  -> Sort (cost=... sort_key: (sum(...)))
    -> HashAggregate (cost=...)
      -> Append (cost=...)
        -> Subquery Scan on vector_search (cost=...)
          -> Limit (cost=...)
            -> Index Scan using documents_embedding_idx on documents (cost=...)
        -> Subquery Scan on keyword_search (cost=...)
          -> Limit (cost=...)
            -> Sort (cost=... sort_key: ...)
              -> Bitmap Heap Scan on documents (cost=...)
                -> Bitmap Index Scan using documents_content_tsv_idx (cost=...)

Key things to verify:

Index Scan on documents_embedding_idx: The plan for vector_search must show an HNSW index scan. If it shows a sequential scan, your index is not being used, and performance will be abysmal.

Bitmap Index Scan on documents_content_tsv_idx: The plan for keyword_search should leverage the GIN index. A Bitmap scan is a common and highly efficient pattern for GIN indices.

Low Row Counts in Scans: The LIMIT clauses should be pushed down effectively, meaning the database should only retrieve 50 rows (or whatever your limit is) from each index scan, not the entire table.

The `:top_k` Tuning Dilemma

The values for :top_k_vector and :top_k_keyword are critical tuning parameters.

* Too small: You risk missing relevant documents that might have been ranked higher after RRF fusion. For example, a document ranked #51 in vector search and #1 in keyword search would be missed entirely if top_k_vector is 50.

* Too large: You increase the amount of data being processed in the final aggregation step, which can slow down the query.

A good starting point is between 50 and 100. The optimal value depends on your data distribution and can be found by running benchmarks with a representative query set.

Edge Case: Advanced Text Search Configuration

The default 'english' text search configuration is a good start, but for specialized domains (e.g., medical, legal, or technical documentation), it can be insufficient. It might incorrectly stem technical terms or remove important stop words (like 'a' in 'Vitamin A').

You can create custom configurations to handle these cases.

sql

-- Example: Create a new configuration that uses a simple dictionary (no stemming or stop words)
CREATE TEXT SEARCH CONFIGURATION public.technical (
    PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION public.technical
    ADD MAPPING FOR asciiword WITH simple;

-- Then, use it in your trigger and queries:
-- to_tsvector('public.technical', NEW.content)
-- to_tsquery('public.technical', :query_text)

This gives you fine-grained control over how your text is tokenized, which is crucial for maximizing the effectiveness of the keyword search component.

Scaling Beyond a Single Node

This single-instance PostgreSQL pattern is remarkably scalable and can comfortably handle millions to tens of millions of documents on appropriate hardware (especially with high RAM to cache indices). When you outgrow a single vertical node:

* Read Replicas: The most straightforward scaling path for read-heavy workloads. Direct search queries to a pool of read replicas.

* Sharding (e.g., Citus): For truly massive datasets, you can use extensions like Citus to shard your documents table across multiple nodes. Both pgvector and FTS are compatible with Citus, but this introduces significant operational complexity.

* Dedicated Systems: At a certain scale, it may become prudent to migrate the search component to a dedicated system like Vespa or a combination of Elasticsearch and a dedicated vector database. However, the pattern described here will carry you an incredibly long way, offering an unmatched combination of power and operational simplicity for a vast range of production RAG applications.

Conclusion

By moving beyond naive vector-only retrieval, we have architected a sophisticated, production-ready hybrid search system. This approach directly addresses the lexical precision failures of pure semantic search by integrating the strengths of keyword-based FTS. By using Reciprocal Rank Fusion, we fuse the results in a principled, parameter-free manner, achieving a final ranking superior to either method alone.

Implementing this entire pipeline within PostgreSQL—combining pgvector, FTS, and advanced SQL—delivers a powerful, consolidated, and maintainable retrieval engine. It's a testament to the extensibility of modern relational databases and a pattern that enables senior engineers to build truly robust and trustworthy RAG systems.