Optimizing RAG: Hybrid Search & Pre-Filtering in pgvector at Scale

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Naive KNN: The Production Reality of Vector Search

In Retrieval-Augmented Generation (RAG) systems, the quality and speed of the retrieval step are paramount. While a simple ORDER BY embedding <=> :query_vector LIMIT 10 is functionally correct for a proof-of-concept, it's a performance disaster in a production environment with millions of vectors. A brute-force K-Nearest Neighbor (KNN) search performs a sequential scan, calculating the distance between the query vector and every single vector in your table. This leads to unacceptable latency and high computational cost as the dataset grows.

Senior engineers face a more complex reality: our data is never just a collection of vectors. It's tied to tenants, users, document types, creation dates, and other critical metadata. A production RAG system must be able to answer questions like: "Find the 5 most relevant document chunks for this query, but only from tenant_id = 'acme-corp' and where document_type = 'quarterly_report' and created_at > '2023-01-01'."

Executing a naive KNN search first and then filtering the results in application code is grossly inefficient. The database does a massive amount of unnecessary work, only for most of it to be discarded. The correct approach is to reduce the search space before the expensive vector comparison. This article is a deep dive into the advanced patterns for achieving this using PostgreSQL and the pgvector extension, focusing on the powerful combination of metadata pre-filtering, Approximate Nearest Neighbor (ANN) indexes, and hybrid search fusion.

We will dissect the internals of query planning, tune indexing strategies, and implement a sophisticated ranking algorithm to build a truly production-grade retrieval pipeline.


The Scenario: A Multi-Tenant Document Store

Let's model a realistic multi-tenant SaaS application. We store document chunks, each with its text content, a corresponding vector embedding, and crucial metadata.

Here's our table structure:

sql
-- Ensure the pgvector extension is installed
CREATE EXTENSION IF NOT EXISTS vector;

-- Main table for our document chunks
CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    tenant_id VARCHAR(100) NOT NULL,
    document_id UUID NOT NULL,
    content TEXT NOT NULL,
    -- Assuming we use OpenAI's text-embedding-ada-002 model
    embedding VECTOR(1536) NOT NULL,
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- B-Tree indexes for efficient metadata filtering
CREATE INDEX idx_doc_chunks_tenant_id ON document_chunks(tenant_id);
CREATE INDEX idx_doc_chunks_metadata_gin ON document_chunks USING GIN(metadata);

To simulate a production environment, let's populate this table with a significant amount of data. We'll use a helper function to generate random vectors.

sql
-- Helper function to generate a random vector
CREATE OR REPLACE FUNCTION random_vector(dim INTEGER) RETURNS vector AS $$
DECLARE
    arr FLOAT[];
BEGIN
    FOR i IN 1..dim LOOP
        arr := array_append(arr, random());
    END LOOP;
    RETURN arr::vector;
END;
$$ LANGUAGE plpgsql;

-- Populate with 1 million chunks across 100 tenants
INSERT INTO document_chunks (tenant_id, document_id, content, embedding, metadata)
SELECT
    'tenant_' || (floor(random() * 100) + 1)::int,
    gen_random_uuid(),
    'Content for chunk ' || g,
    random_vector(1536),
    jsonb_build_object('source', 'source_' || (floor(random() * 10) + 1)::int)
FROM generate_series(1, 1000000) g;

The Problem Visualized: Brute-Force KNN on a Filtered Query

Let's run our filtered query without any vector index. We want to find the top 5 most relevant chunks for a specific tenant.

sql
EXPLAIN ANALYZE
SELECT id, content, tenant_id
FROM document_chunks
WHERE tenant_id = 'tenant_42'
ORDER BY embedding <=> random_vector(1536)
LIMIT 5;

Even with the B-Tree index on tenant_id, the query plan is suboptimal for a large number of matches for tenant_42:

text
Limit  (cost=12345.67..12345.68) (actual time=1500.123..1500.124) rows=5 width=...)
  ->  Sort  (cost=12345.67..12346.17) (actual time=1500.122..1500.122) rows=5 width=...)
        Sort Key: ((embedding <=> '(...)'::vector))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Bitmap Heap Scan on document_chunks  (cost=246.93..12320.67) (actual time=10.456..1450.789) rows=10000 width=...)
              Recheck Cond: (tenant_id = 'tenant_42'::text)
              Heap Blocks: exact=9980
              ->  Bitmap Index Scan on idx_doc_chunks_tenant_id  (cost=0.00..244.43) (actual time=5.123..5.123) rows=10000 width=0)
                    Index Cond: (tenant_id = 'tenant_42'::text)
Planning Time: 0.234 ms
Execution Time: 1501.567 ms

Dissection of the Problem:

  • Bitmap Index Scan: The planner correctly uses our idx_doc_chunks_tenant_id to quickly find all rows for tenant_42. This is fast.
  • Bitmap Heap Scan: It then fetches all ~10,000 matching rows from the table.
  • Sort (The Killer): Here's the bottleneck. PostgreSQL computes the vector distance for all 10,000 rows for tenant_42 and then performs a top-N heapsort to find the 5 closest. The Execution Time of ~1.5 seconds is unacceptable for an interactive application.
  • This demonstrates that a standard B-Tree index helps filter the rows, but it doesn't solve the core problem of the expensive distance calculation on the filtered subset. We need a dedicated vector index.


    Level 2 Optimization: IVFFlat for Approximate Nearest Neighbor (ANN) Search

    pgvector provides specialized index types for ANN search. The most common is ivfflat. An IVFFlat index works by partitioning your vector space into a number of lists (or Voronoi cells). At query time, it first identifies the few lists closest to your query vector and then only searches within those lists. This dramatically reduces the number of distance calculations.

    Creating and Tuning the IVFFlat Index

    The most critical parameter for an IVFFlat index is lists. This determines the number of partitions.

    Heuristics for choosing lists:

    * For up to 1M rows: lists = sqrt(number_of_rows)

    * For > 1M rows: lists = number_of_rows / 1000

    For our 1M rows, sqrt(1,000,000) = 1000. So we'll use lists = 1000.

    sql
    CREATE INDEX idx_doc_chunks_embedding_ivfflat ON document_chunks
    USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);

    Important Note: Building this index can take a significant amount of time and consume substantial maintenance_work_mem.

    The `probes` Parameter: Trading Accuracy for Speed

    At query time, we don't search all 1000 lists. We search a subset defined by the ivfflat.probes parameter. A higher probes value increases accuracy (recall) at the cost of speed, as more lists are checked. The default is 1, which is often too low for good results.

    A good starting point for probes is sqrt(lists). In our case, sqrt(1000) ≈ 32. Let's test the impact.

    We'll run our filtered query again, but this time with the IVFFlat index active and by setting the probes parameter.

    sql
    -- Set probes for the current transaction
    SET LOCAL ivfflat.probes = 32;
    
    EXPLAIN ANALYZE
    SELECT id, content, tenant_id
    FROM document_chunks
    WHERE tenant_id = 'tenant_42'
    ORDER BY embedding <=> random_vector(1536)
    LIMIT 5;

    The execution plan now looks dramatically different:

    text
    Limit  (cost=345.67..345.68) (actual time=25.123..25.124) rows=5 width=...)
      ->  Index Scan using idx_doc_chunks_embedding_ivfflat on document_chunks  (cost=0.00..1234.56) (actual time=25.122..25.122) rows=...)
            Order By: (embedding <=> '(...)'::vector)
            Filter: (tenant_id = 'tenant_42'::text)
    Planning Time: 0.345 ms
    Execution Time: 26.789 ms

    Dissection of the Improvement:

    The query time dropped from ~1500ms to ~26ms—a ~55x improvement!

    However, look closely at the plan. It's using the idx_doc_chunks_embedding_ivfflat first (Index Scan) and then applying our tenant_id as a Filter. This means it's performing an ANN search across the entire 1M vector dataset, retrieving a batch of candidates, and then filtering out the ones that don't match tenant_42. This is still inefficient, especially if the target tenant has very few documents. The planner chose this path because it estimated it would be faster than the previous plan, but it's not the optimal strategy.

    What we truly want is for PostgreSQL to use the B-Tree index first to isolate the tenant_42 documents, and then use the IVFFlat index on that small subset.

    Unfortunately, as of pgvector v0.5, the query planner cannot seamlessly combine a B-Tree index scan with an IVFFlat index scan in the way we want. The ORDER BY on the vector column forces the planner to prioritize the vector index. This is a critical limitation to understand and work around.

    The Solution: Subquery and `LATERAL JOIN`

    To force the desired execution order, we can restructure our query. The goal is to explicitly tell PostgreSQL to perform the metadata filter first. A common pattern is to use a Common Table Expression (CTE) or a subquery.

    sql
    EXPLAIN ANALYZE
    WITH tenant_chunks AS (
        SELECT id, embedding, content, tenant_id
        FROM document_chunks
        WHERE tenant_id = 'tenant_42'
    )
    SELECT id, content, tenant_id
    FROM tenant_chunks
    ORDER BY embedding <=> random_vector(1536)
    LIMIT 5;

    Let's check the plan for this restructured query:

    text
    Limit  (cost=12345.67..12345.68) (actual time=18.123..18.124) rows=5 width=...)
      ->  Sort  (cost=12345.67..12346.17) (actual time=18.122..18.122) rows=5 width=...)
            Sort Key: ((embedding <=> '(...)'::vector))
            Sort Method: top-N heapsort  Memory: 25kB
            ->  CTE Scan on tenant_chunks  (cost=...)
                  ->  Bitmap Heap Scan on document_chunks  (cost=246.93..12320.67) (actual time=5.456..12.789) rows=10000 width=...)
                        Recheck Cond: (tenant_id = 'tenant_42'::text)
                        ->  Bitmap Index Scan on idx_doc_chunks_tenant_id  (cost=0.00..244.43) (actual time=2.123..2.123) rows=10000 width=0)
                              Index Cond: (tenant_id = 'tenant_42'::text)
    Planning Time: 0.456 ms
    Execution Time: 19.890 ms

    Dissection of the Forced Plan:

  • The CTE Scan forces the evaluation of the tenant_chunks CTE first.
  • Inside the CTE, the planner uses the idx_doc_chunks_tenant_id (Bitmap Index Scan) to efficiently find the ~10,000 rows for our tenant.
  • Crucially, the ORDER BY and LIMIT are now applied to the result of the CTE. The vector distance calculation is only performed on the 10,000 rows of the target tenant, not the full 1 million.
  • This is a huge win. We've successfully combined the B-Tree index for pre-filtering with a vector search. The execution time is now under 20ms. This pattern is fundamental to scaling pgvector in multi-tenant or heavily filtered environments.


    Level 3: Hybrid Search with Reciprocal Rank Fusion (RRF)

    Pure semantic search is powerful but has weaknesses. It can miss specific keywords, acronyms, or product codes that a user might search for explicitly. A traditional full-text search (FTS) excels at this. A production-grade system needs both.

    Hybrid search combines the results of lexical (keyword) search and semantic (vector) search to produce a single, more relevant ranking. A simple approach is to weight scores, but this is notoriously difficult to tune. A more robust, parameter-free method is Reciprocal Rank Fusion (RRF).

    How RRF Works:

    RRF reranks results based on their position in multiple result lists. The formula for each document's RRF score is the sum of 1 / (k + rank), where rank is its position in each list and k is a constant (usually set to 60) that dampens the influence of lower-ranked items.

    Let's implement this.

    Step 1: Add Full-Text Search Capability

    First, we need to add a tsvector column to our table to store the text content in a format optimized for FTS.

    sql
    -- Add the tsvector column
    ALTER TABLE document_chunks ADD COLUMN tsv tsvector;
    
    -- Populate it with the content
    UPDATE document_chunks SET tsv = to_tsvector('english', content);
    
    -- Create a GIN index for fast FTS queries
    CREATE INDEX idx_doc_chunks_tsv_gin ON document_chunks USING GIN(tsv);
    
    -- Create a trigger to automatically update the tsv on insert/update
    CREATE OR REPLACE FUNCTION update_tsv_trigger() RETURNS trigger AS $$
    BEGIN
        new.tsv := to_tsvector('english', new.content);
        RETURN new;
    END;
    $$ LANGUAGE plpgsql;
    
    CREATE TRIGGER tsvector_update BEFORE INSERT OR UPDATE
    ON document_chunks FOR EACH ROW EXECUTE FUNCTION update_tsv_trigger();

    Step 2: Constructing the Hybrid RRF Query

    We will now build a single SQL query that performs both searches, normalizes their ranks, and computes the RRF score. We'll use CTEs to keep the logic clean.

    sql
    -- Our query inputs
    WITH query AS (
        SELECT
            'ceo quarterly goals' AS text,
            websearch_to_tsquery('english', 'ceo quarterly goals') AS tsquery,
            random_vector(1536) AS vector,
            'tenant_42' AS tenant_id
    ),
    -- Part 1: Full-Text Search
    fts_search AS (
        SELECT
            id,
            -- Rank documents based on FTS relevance score
            ROW_NUMBER() OVER (ORDER BY ts_rank_cd(tsv, query.tsquery) DESC) as rank
        FROM document_chunks, query
        WHERE tsv @@ query.tsquery AND tenant_id = query.tenant_id
        LIMIT 100 -- Limit the number of candidates from each search
    ),
    -- Part 2: Vector Search
    vector_search AS (
        SELECT
            id,
            -- Rank documents based on vector distance
            ROW_NUMBER() OVER (ORDER BY embedding <=> query.vector) as rank
        FROM document_chunks, query
        WHERE tenant_id = query.tenant_id
        ORDER BY embedding <=> query.vector
        LIMIT 100
    ),
    -- Part 3: Reciprocal Rank Fusion
    fusion AS (
        SELECT
            id,
            -- Calculate RRF score. k=60 is a common choice.
            SUM(1.0 / (60 + rank)) as rrf_score
        FROM (
            SELECT * FROM fts_search
            UNION ALL
            SELECT * FROM vector_search
        ) as combined_results
        GROUP BY id
    )
    -- Final Selection and Ranking
    SELECT
        chunks.id,
        chunks.content,
        fusion.rrf_score
    FROM fusion
    JOIN document_chunks chunks ON fusion.id = chunks.id
    ORDER BY rrf_score DESC
    LIMIT 10;

    Dissection of the RRF Query:

  • query CTE: We define our inputs here to avoid repetition. In a real application, these would be query parameters.
  • fts_search CTE: This performs a standard full-text search using the @@ operator, pre-filtered by tenant_id. We use ts_rank_cd for scoring and ROW_NUMBER() to get the rank.
  • vector_search CTE: This performs our optimized vector search, also pre-filtered by tenant_id. ROW_NUMBER() is used again to establish the rank based on vector distance.
  • fusion CTE: This is the core of RRF. We UNION ALL the results from both searches. This creates a list where a document can appear twice if it's in both result sets. We then GROUP BY id and SUM the reciprocal ranks to get the final RRF score.
  • Final SELECT: We join back to the original table to retrieve the content and order by the calculated rrf_score to get our final, hybrid-ranked results.
  • This single, declarative SQL query provides a highly relevant result set by leveraging the strengths of both lexical and semantic search, all while respecting multi-tenancy constraints and performing efficiently at scale.


    Advanced Edge Cases and Production Considerations

    * Embedding Model Versioning: What happens when you upgrade your embedding model? The new vectors are not comparable to the old ones. A robust strategy is to add a versioned embedding column, e.g., embedding_v2. Your application writes to both columns during a transition period. You build a new IVFFlat index on embedding_v2. Queries can then specify which vector column to use. Once all data is backfilled, you can drop the old column and index.

    * Index Bloat and Maintenance: IVFFlat indexes, like any other index, can become bloated with dead tuples in a high-write environment. Regular VACUUM is essential. In some extreme cases, you may need to REINDEX periodically to rebuild the index from scratch for optimal performance, though this requires a significant maintenance window.

    * Tuning probes Dynamically: A single probes value might not be optimal for all queries. For latency-critical applications, you might use a lower probes value (e.g., 10) for interactive user queries. For offline analytical jobs where accuracy is more important, you could increase it to a much higher value (e.g., 100).

    * HNSW vs. IVFFlat: pgvector also supports HNSW (Hierarchical Navigable Small World) indexes. HNSW generally offers better accuracy-speed trade-offs, especially for high-dimensional data, and doesn't require tuning probes at query time. However, it currently has limitations with pre-filtering compared to IVFFlat. As of this writing, IVFFlat often performs better in scenarios with highly selective pre-filters (where the WHERE clause returns a very small percentage of the total table).

    Conclusion

    We have journeyed from a naive, non-performant vector search to a sophisticated, production-ready hybrid retrieval system within PostgreSQL. The key takeaways for senior engineers are:

  • Never rely on brute-force KNN in production. The performance degradation is non-linear.
  • Pre-filtering is non-negotiable. Always use standard B-Tree or GIN indexes to reduce the search space based on metadata before the vector search. Use subqueries or CTEs to force the planner's hand if necessary.
  • Master your ANN index. Understand the trade-offs of IVFFlat's lists and probes parameters. They are your primary levers for balancing speed, accuracy, and memory usage.
  • Embrace Hybrid Search. Pure semantic search is not a silver bullet. Combining it with full-text search using a robust fusion algorithm like RRF provides demonstrably superior relevance for a wide range of queries.
  • By applying these advanced patterns, you can build RAG systems that are not only intelligent but also scalable, fast, and capable of handling the complex filtering requirements of real-world applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles