Postgres Hybrid Search: Combining pgvector and FTS with RRF

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Semantic Gap: Why Your Database Search Falls Short

For senior engineers building data-intensive applications, the limitations of traditional database search are a familiar pain point. A standard LIKE query or even a well-configured Full-Text Search (FTS) struggles with semantic nuance. A user searching for "ways to make my app faster" will miss a critical document titled "optimizing application performance," because the keywords don't overlap. This is the semantic gap.

Vector search, powered by dense embeddings from models like Sentence-BERT, brilliantly solves this problem by searching in a conceptual space. However, it introduces its own set of challenges. Pure vector search can be too abstract, failing to retrieve results where a specific, rare keyword (like a product SKU, a function name, or a person's name) is the most important signal. It can miss the trees for the forest.

This is where hybrid search enters the picture. It's not a compromise; it's a synthesis. By combining the keyword-matching precision of traditional text search with the semantic understanding of vector search, we can build a system that delivers relevance far exceeding what either method can achieve alone. The common approach is to bolt on a dedicated search engine like Elasticsearch or a specialized vector database like Pinecone. This, however, introduces significant operational complexity: data synchronization pipelines, another system to monitor and scale, and increased infrastructure costs.

This article presents a more elegant, operationally streamlined solution: implementing a sophisticated hybrid search engine entirely within PostgreSQL. We will leverage two powerful extensions, pgvector and pg_trgm, to execute parallel semantic and keyword searches, and then fuse the results using a robust algorithm called Reciprocal Rank Fusion (RRF). This is not a theoretical overview; it's a production-focused guide with complete, runnable code, performance tuning details, and a frank discussion of the edge cases you'll encounter in a real-world deployment.

Our Technical Stack

  • PostgreSQL (15+): The core of our system.
  • pgvector extension: For efficient, indexed similarity search on high-dimensional vectors. We'll specifically focus on the HNSW index type.
  • pg_trgm extension: For efficient trigram-based text search, offering a powerful balance of speed and fuzzy matching capabilities.
  • Python 3.10+: As the application layer to orchestrate the process, generate embeddings, and perform result fusion.
  • sentence-transformers: A Python library to generate high-quality text embeddings.
  • psycopg (v3): The modern, async-capable PostgreSQL driver for Python.

  • Section 1: Database Schema and Advanced Indexing Strategy

    Our foundation is a well-designed table and, more importantly, a carefully considered indexing strategy. A naive approach will lead to poor performance and scalability issues. Let's design for production from the start.

    1.1. Table Schema and Extensions

    First, we need to enable the necessary extensions in our database. Then, we define the table that will store our documents and their corresponding vector embeddings.

    sql
    -- Ensure the extensions are available in your PostgreSQL instance.
    -- You may need to run `apt-get install postgresql-15-pgvector` or similar on your server.
    CREATE EXTENSION IF NOT EXISTS vector;
    CREATE EXTENSION IF NOT EXISTS pg_trgm;
    
    -- The main table for our searchable documents
    CREATE TABLE documents (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        content TEXT NOT NULL,
        embedding VECTOR(384) NOT NULL, -- Dimension matches 'all-MiniLM-L6-v2'
        created_at TIMESTAMPTZ DEFAULT NOW()
    );

    A key decision here is the vector dimension. We've chosen 384, which corresponds to the popular and efficient all-MiniLM-L6-v2 sentence transformer model. Using a fixed-size vector is crucial for pgvector's performance.

    1.2. The Indexing Deep Dive: HNSW vs. IVFFlat and GIN vs. GiST

    This is the most critical part of the setup. The right indexes are the difference between a system that flies and one that crawls.

    Vector Indexing with HNSW

    pgvector offers two main index types for approximate nearest neighbor (ANN) search: IVFFlat and HNSW. While IVFFlat was the original standard, HNSW (Hierarchical Navigable Small World) is now the superior choice for the vast majority of use cases.

    * Why HNSW? HNSW builds a graph-like structure that allows for highly efficient searching. It generally provides a better trade-off between recall (accuracy) and query speed than IVFFlat. Crucially, HNSW index performance degrades more gracefully as you add metadata filters, a common production requirement we'll discuss later.

    Let's create the HNSW index. The parameters m and ef_construction are critical for tuning.

    sql
    -- Create an HNSW index on the embedding column
    CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

    * vector_cosine_ops: We specify the operator class. For normalized embeddings (as produced by most sentence transformers), cosine similarity is equivalent to Euclidean distance but is often more intuitive. Using this operator class ensures the index is built to optimize cosine distance calculations (<=>).

    * m: The maximum number of connections per node in the graph's layers. A higher m (e.g., 24, 32) creates a more dense, accurate graph at the cost of higher index build time, more memory usage, and slightly slower queries. m = 16 is a solid default.

    * ef_construction: The size of the dynamic candidate list during index construction. A higher value leads to a better-quality index (and better recall) but makes index creation significantly slower. ef_construction = 64 is a reasonable starting point.

    Text Indexing with Trigrams and GIN

    For our keyword search component, we could use PostgreSQL's standard FTS. However, pg_trgm often provides a more flexible solution for fuzzy matching, as it doesn't rely on stemming and lexemes. It breaks text down into three-character chunks (trigrams), which is robust against minor misspellings and variations.

    We need an index to accelerate trigram searches. The choice is between GIN and GiST.

    * Why GIN? A GIN (Generalized Inverted Index) is generally the better choice for trigram indexing on static or infrequently updated text. It is faster to query than GiST because it stores a list of locations for each trigram. GiST (Generalized Search Tree) can be faster to build and update, but for a search-heavy workload, GIN's query performance is superior.

    sql
    -- Create a GIN index on the content column using trigram operators
    CREATE INDEX ON documents
    USING gin (content gin_trgm_ops);

    With our schema and, more importantly, our specialized indexes in place, we're ready to tackle the query layer.


    Section 2: The Dual-Query Execution Pattern

    The strategy is to run two independent queries against our documents table: one for semantic similarity and one for keyword similarity. We will execute these queries in parallel from our application layer to minimize latency.

    Let's assume we have a user query: "optimizing database query speed".

    2.1. Step 1: Generate Query Embedding

    Our application first needs to convert this text query into a vector using the same model we used for the documents.

    python
    # app/embedding.py
    from sentence_transformers import SentenceTransformer
    
    # It's crucial to load the model once and reuse it.
    # In a real app, this would be a singleton or part of a dependency injection system.
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    def get_embedding(text: str) -> list[float]:
        """Generates a 384-dimension embedding for a given text."""
        embedding = model.encode(text, convert_to_tensor=False)
        return embedding.tolist()
    
    # Example usage
    query_text = "optimizing database query speed"
    query_embedding = get_embedding(query_text)
    # query_embedding is now a list of 384 floats

    2.2. Query A: Vector Similarity Search

    This query finds documents that are semantically closest to the user's query. We use the cosine distance operator (<=>), which pgvector provides. The distance ranges from 0 (identical) to 2 (opposite).

    sql
    -- Vector Search SQL
    -- $1: query_embedding (as a string like '[0.1, 0.2, ...]')
    -- $2: limit (e.g., 50)
    SELECT
        id,
        content,
        1 - (embedding <=> $1) AS similarity_score
    FROM
        documents
    ORDER BY
        embedding <=> $1
    LIMIT $2;

    Performance Tuning: The speed and accuracy of this query can be tuned at runtime using the hnsw.ef_search parameter. This controls the size of the candidate list during the search phase. A higher value increases accuracy (recall) at the cost of speed.

    sql
    -- Example of tuning the search
    BEGIN;
    SET LOCAL hnsw.ef_search = 100; -- Higher value for better recall
    SELECT id, 1 - (embedding <=> $1) AS score FROM documents ORDER BY embedding <=> $1 LIMIT 50;
    COMMIT;

    This should be wrapped in a transaction to ensure the setting only applies to the current query.

    2.3. Query B: Trigram Keyword Search

    This query finds documents that have high lexical overlap with the user's query. We use the word_similarity function from pg_trgm, which is generally more effective than the basic similarity operator (%) as it considers the similarity of whole words.

    sql
    -- Trigram Search SQL
    -- $1: query_text
    -- $2: limit (e.g., 50)
    SELECT
        id,
        content,
        word_similarity(content, $1) AS similarity_score
    FROM
        documents
    WHERE
        -- Pre-filter to ensure there's some trigram similarity. 
        -- The '%' operator can use the GIN index effectively.
        content % $1
    ORDER BY
        similarity_score DESC, -- The primary sort key
        ts_rank(to_tsvector('english', content), plainto_tsquery('english', $1)) DESC -- Optional secondary boost
    LIMIT $2;

    Performance Tuning: The pg_trgm.word_similarity_threshold setting can be adjusted to control the sensitivity of the search. The WHERE content % $1 clause is critical; it acts as an efficient index filter before the more expensive word_similarity function is calculated for ordering.

    We now have two ranked lists of document IDs, each representing a different aspect of relevance. The next step is to intelligently merge them.


    Section 3: Fusion Logic with Reciprocal Rank Fusion (RRF)

    How do we combine our two result sets? A naive approach might be to normalize the scores from both queries (e.g., to a 0-1 range) and add them up. This is deeply flawed. The similarity scores from vector search and trigram search are on completely different scales and have different distributions. Normalizing them is non-trivial and often arbitrary.

    We need a better, rank-based approach. Reciprocal Rank Fusion (RRF) is a simple yet powerful technique that sidesteps the score normalization problem entirely. It combines lists by looking only at the rank of an item in each list, not its score.

    The RRF score for a document is calculated as:

    RRF_Score(doc) = Σ (1 / (k + rank(doc))) for each result list.

    * rank(doc) is the position of the document in a given result list (starting from 1).

    * k is a constant that dampens the influence of high ranks. A common value for k is 60.

    Let's say a document doc_A is ranked #2 in the vector results and #10 in the keyword results. Its RRF score would be (1 / (60 + 2)) + (1 / (60 + 10)). We calculate this for every unique document across all lists and then sort the documents by their final RRF score.

    Python Implementation of RRF

    Here is a clean, efficient implementation of the RRF algorithm.

    python
    # app/fusion.py
    from collections import defaultdict
    
    def reciprocal_rank_fusion(
        results: list[list[dict]], 
        k: int = 60
    ) -> list[dict]:
        """
        Performs Reciprocal Rank Fusion on a list of search result lists.
    
        Args:
            results: A list where each element is a ranked list of documents.
                     Each document is a dict with at least an 'id' key.
            k: The constant used in the RRF formula.
    
        Returns:
            A single, re-ranked list of documents.
        """
        fused_scores = defaultdict(float)
    
        # Each 'result_set' is a list of dicts, e.g., from vector or keyword search
        for result_set in results:
            for rank, doc in enumerate(result_set, 1):
                doc_id = doc['id']
                fused_scores[doc_id] += 1 / (k + rank)
    
        # Create a unified list of all unique documents involved
        all_docs = {doc['id']: doc for result_set in results for doc in result_set}
    
        # Sort the documents based on their fused RRF score
        reranked_results = sorted(
            all_docs.values(), 
            key=lambda doc: fused_scores[doc['id']], 
            reverse=True
        )
    
        # Add the RRF score to each document for inspection
        for doc in reranked_results:
            doc['rrf_score'] = fused_scores[doc['id']]
    
        return reranked_results
    

    This function takes our two lists of results, calculates the RRF score for each unique document, and returns a single, definitive ranking.


    Section 4: A Production-Grade Implementation

    Let's tie everything together into a cohesive application. We'll use FastAPI for the web layer and asyncio to run our database queries concurrently.

    4.1. Project Setup and Docker

    A docker-compose.yml ensures a reproducible environment.

    yaml
    # docker-compose.yml
    version: '3.8'
    services:
      db:
        image: pgvector/pgvector:pg15
        ports:
          - "5432:5432"
        environment:
          - POSTGRES_USER=user
          - POSTGRES_PASSWORD=password
          - POSTGRES_DB=hybrid_search_db
        volumes:
          - postgres_data:/var/lib/postgresql/data
    
      app:
        build: .
        ports:
          - "8000:8000"
        depends_on:
          - db
        environment:
          - DATABASE_URL=postgresql+psycopg://user:password@db:5432/hybrid_search_db
    
    volumes:
      postgres_data:

    And a Dockerfile for our Python application:

    dockerfile
    # Dockerfile
    FROM python:3.11-slim
    
    WORKDIR /app
    
    RUN pip install --no-cache-dir fastapi uvicorn "psycopg[binary]" sentence-transformers
    
    COPY ./app /app
    
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

    4.2. The FastAPI Application

    This application exposes a /search endpoint that orchestrates the entire hybrid search process.

    python
    # app/main.py
    import os
    import asyncio
    from typing import List
    
    from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel, Field
    import psycopg
    from psycopg.rows import dict_row
    
    from .embedding import get_embedding
    from .fusion import reciprocal_rank_fusion
    
    # --- Pydantic Models ---
    class Document(BaseModel):
        id: str
        content: str
        similarity_score: float = Field(..., alias='score')
    
    class RerankedDocument(Document):
        rrf_score: float
    
    # --- Database Connection ---
    DATABASE_URL = os.getenv("DATABASE_URL")
    
    async def get_db_conn():
        return await psycopg.AsyncConnection.connect(DATABASE_URL, row_factory=dict_row)
    
    app = FastAPI()
    
    # --- Search Logic ---
    async def vector_search(conn, query_embedding: List[float], limit: int) -> List[dict]:
        async with conn.cursor() as cur:
            await cur.execute("SET LOCAL hnsw.ef_search = 100")
            await cur.execute(
                """SELECT id, content, 1 - (embedding <=> %s) AS score 
                   FROM documents ORDER BY embedding <=> %s LIMIT %s""",
                (str(query_embedding), str(query_embedding), limit)
            )
            return await cur.fetchall()
    
    async def keyword_search(conn, query_text: str, limit: int) -> List[dict]:
        async with conn.cursor() as cur:
            await cur.execute(
                """SELECT id, content, word_similarity(content, %s) AS score
                   FROM documents WHERE content %% %s 
                   ORDER BY score DESC LIMIT %s""",
                (query_text, query_text, limit)
            )
            return await cur.fetchall()
    
    @app.post("/search", response_model=List[RerankedDocument])
    async def search(query: str, k_vector: int = 50, k_keyword: int = 50):
        if not query:
            raise HTTPException(status_code=400, detail="Query cannot be empty")
    
        query_embedding = get_embedding(query)
    
        conn = await get_db_conn()
        async with conn:
            # Execute both searches concurrently
            vector_results_task = asyncio.create_task(vector_search(conn, query_embedding, k_vector))
            keyword_results_task = asyncio.create_task(keyword_search(conn, query, k_keyword))
    
            vector_results, keyword_results = await asyncio.gather(
                vector_results_task,
                keyword_results_task
            )
    
        # Fuse the results using RRF
        reranked_results = reciprocal_rank_fusion([vector_results, keyword_results])
    
        return reranked_results
    
    # Add a simple endpoint to add documents for testing
    @app.post("/documents")
    async def add_document(content: str):
        embedding = get_embedding(content)
        conn = await get_db_conn()
        async with conn:
            async with conn.cursor() as cur:
                await cur.execute(
                    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
                    (content, str(embedding))
                )
        return {"status": "success"}

    This implementation demonstrates several production best practices:

  • Concurrency: asyncio.gather is used to run the vector and keyword queries simultaneously, minimizing I/O wait time.
  • Connection Management: A single connection is used for both concurrent queries, which is efficient.
  • Clear Separation of Concerns: Embedding, fusion, and web logic are in separate modules.

  • Section 5: Advanced Considerations and Production Edge Cases

    Implementing the core logic is only half the battle. Here are the complex issues you will face in a real-world system.

    5.1. The Challenge of Metadata Filtering

    Your search will almost certainly need to be filtered. For example, in a multi-tenant system, you must filter by tenant_id. Or perhaps by a date range or a status flag.

    SELECT ... FROM documents WHERE tenant_id = 'abc' ORDER BY embedding <=> $1

    This is where ANN indexes like HNSW show their complexity. An HNSW index is not like a B-tree. The filter (WHERE tenant_id = 'abc') is applied after the approximate search traverses the graph. This can have two negative consequences:

  • Reduced Recall: The HNSW search might find the top 100 nearest neighbors in the entire dataset, but if only 5 of them match your tenant_id filter, you'll only get 5 results back, even if other, perfectly valid documents for that tenant were slightly further away.
  • Performance Degradation: The database has to fetch many more candidates from the index to satisfy the LIMIT clause after filtering, which can slow down the query significantly.
  • Solutions and Mitigations:

    * Partitioning: The most effective solution is to partition your documents table by the filter column (e.g., tenant_id). This way, the HNSW search is performed only on the relevant partition, making it both fast and accurate. This is the gold standard for multi-tenancy.

    * Over-fetching: In your query, fetch a much larger number of results from the vector search (LIMIT 200 instead of LIMIT 50) before applying the filter and final limit in a subquery or CTE. This improves recall at the cost of performance.

    * Embrace Post-filtering: For low-cardinality filters, accept that the index provides candidates and the WHERE clause provides exactness. Monitor performance closely.

    5.2. Scalable Pagination of Fused Results

    How do you implement pagination on the final, fused result set? A simple LIMIT/OFFSET on the database query won't work because the fusion happens in the application layer.

    The naive approach is to:

    • Fetch a large number of results from both vector search (e.g., 200) and keyword search (e.g., 200).
    • Fuse them in the application.
    • The application then holds the full, fused list of ~400 results in memory.
  • Return page 1 ([0:20]), page 2 ([20:40]), etc., from this in-memory list.
  • This works for the first few pages but breaks down completely for deep pagination. Requesting page 50 would still require fetching and fusing the entire 400-document set. Furthermore, what if the truly relevant result is at position 201 in the vector search? It would be missed entirely.

    Advanced Solutions:

    * Stateful Caching: For a given search query, cache the fused list of document IDs on the server (e.g., in Redis) with a short TTL. Subsequent page requests for the same query can read from this cached list. This is the most common and practical solution.

    * Cursor-based Pagination: Instead of page numbers, use a cursor. The cursor would contain the RRF score of the last item on the previous page. The next query would need to fetch all items with a score greater than the cursor, which is difficult to translate back to the underlying database queries.

    * Re-evaluating the Need: For most search use cases, users rarely go past the first few pages. The naive over-fetch-and-paginate-in-memory approach is often sufficient and pragmatic.

    5.3. Relative Weighting of Semantic vs. Keyword Search

    RRF is great because it's parameter-free, but sometimes you want to favor one search type over another. For example, if a query is in quotes ("exact phrase"), you might want to heavily boost the keyword search results.

    Solution: Weighted RRF

    You can introduce a simple weighting factor (alpha, from 0 to 1) to control the balance.

    python
    # In the RRF function...
    # Let's say results[0] is vector, results[1] is keyword
    vector_weight = alpha
    keyword_weight = 1 - alpha
    
    # Inside the loop for vector results
    fused_scores[doc_id] += vector_weight * (1 / (k + rank))
    
    # Inside the loop for keyword results
    fused_scores[doc_id] += keyword_weight * (1 / (k + rank))

    By adjusting alpha based on query heuristics (e.g., presence of quotes, query length), you can dynamically control the search behavior to better match user intent.

    Conclusion: A Unified and Powerful Approach

    By leveraging the extensibility of PostgreSQL with pgvector and pg_trgm, we have constructed a hybrid search system that rivals the relevance of dedicated, external search engines. This approach offers a compelling advantage: operational simplicity. Your search index lives with your primary data, participating in the same transactions, backup schedules, and replication streams. There is no complex data synchronization pipeline to build and maintain.

    We've seen that success lies in the details: choosing HNSW over IVFFlat for superior performance, using a GIN index for fast trigram queries, and applying a robust, rank-based fusion algorithm like RRF to merge results intelligently. Most importantly, we've confronted the hard production problems—metadata filtering, pagination, and dynamic weighting—that separate a proof-of-concept from a scalable, production-ready system.

    While dedicated systems may still offer advantages at massive scale or for highly specialized use cases, this Postgres-native approach provides a powerful, cost-effective, and operationally sane solution for a vast number of applications. It empowers senior engineers to deliver sophisticated search experiences without expanding their infrastructure footprint, a truly powerful combination.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles