PostgreSQL pgvector: HNSW Index Tuning for Filtered Vector Search

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Bottleneck: When Vector Search Meets Metadata Filters

You've successfully integrated pgvector into your stack. Your initial proof-of-concept for pure semantic search was blazing fast, returning sub-50ms results from millions of vectors. The team celebrated. Now, the real work begins: integrating it into your multi-tenant application. The first production query looks simple enough:

sql
SELECT id, content
FROM documents
WHERE tenant_id = 'tenant-b4f2-4a51-a3e6'
ORDER BY embedding <-> :query_vector
LIMIT 10;

Suddenly, the response times jump to multiple seconds. Your HNSW index, which performed so well in isolation, seems to falter under the pressure of a simple WHERE clause. This isn't a bug; it's a fundamental challenge in approximate nearest neighbor (ANN) search. You're facing the post-filtering problem: the index efficiently finds candidate vectors, but most of them are subsequently discarded by the WHERE clause, forcing the database to scan a vast number of nodes to satisfy your LIMIT 10 for the correct tenant_id.

This article is a deep dive into solving this exact problem. We will skip the basics of pgvector and HNSW and focus entirely on advanced, production-grade patterns for optimizing filtered vector search. We'll analyze three core strategies:

  • Aggressive ef_search Tuning: The simplest approach, but with critical trade-offs between latency and recall.
  • Physical Isolation via Partitioning: A powerful database-native solution for common filtering keys like tenant_id.
  • Two-Stage Re-ranking Queries: A flexible application-level pattern for complex or high-cardinality filters.
  • Let's start by establishing a baseline to see the problem in action.

    Baseline Schema and Performance

    We'll use a documents table representing a multi-tenant system. The embeddings could be from a model like OpenAI's text-embedding-ada-002 (1536 dimensions).

    sql
    -- Ensure the pgvector extension is installed
    CREATE EXTENSION IF NOT EXISTS vector;
    
    -- Document table for a multi-tenant application
    CREATE TABLE documents (
        id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
        tenant_id UUID NOT NULL,
        content TEXT,
        embedding VECTOR(1536),
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
    
    -- Let's populate it with some data
    -- In a real scenario, this would be millions of rows.
    -- For this example, we'll simulate 1,000,000 rows with 1,000 tenants.
    INSERT INTO documents (tenant_id, embedding)
    SELECT
        ('00000000-0000-0000-0000-' || LPAD(floor(random() * 1000)::int::text, 12, '0'))::uuid,
        ARRAY(SELECT random() FROM generate_series(1, 1536))::vector
    FROM generate_series(1, 1000000);
    
    -- Create a standard HNSW index
    CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops);
    
    -- Create a B-tree index on our filter column
    CREATE INDEX ON documents (tenant_id);

    Now, let's run our problematic query with EXPLAIN ANALYZE.

    sql
    -- Set a placeholder for our query vector
    SET pg_trgm.similarity_threshold = 0.1;
    -- Let's define a query vector variable for clarity
    -- In a real app, this is passed as a parameter.
    DO $$
    DECLARE
        query_vector vector(1536) := ARRAY(SELECT random() FROM generate_series(1, 1536));
    BEGIN
        EXPLAIN (ANALYZE, BUFFERS)
        SELECT id
        FROM documents
        WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
        ORDER BY embedding <-> query_vector
        LIMIT 10;
    END $$;

    The output will vary, but you'll see something like this:

    text
    Limit  (cost=... rows=10 width=8) (actual time=1532.416..1532.419 rows=10 loops=1)
      Buffers: shared hit=123456
      ->  Index Scan using documents_embedding_idx on documents  (cost=... rows=4321 width=8) (actual time=1532.413..1532.416 rows=10 loops=1)
            Order By: (embedding <-> '...'::vector)
            Filter: (tenant_id = '00000000-0000-0000-0000-000000000001'::uuid)
            Rows Removed by Filter: 9876
            Buffers: shared hit=123456
    Planning Time: 0.150 ms
    Execution Time: 1532.500 ms

    The key lines are Rows Removed by Filter and Execution Time. The HNSW index scan visited thousands of nodes (9876 in this example) that were irrelevant to our target tenant, just to find 10 that matched. The database did the work, but it was incredibly inefficient. The B-tree index on tenant_id was ignored because the planner chose to satisfy the ORDER BY clause first.

    Pattern 1: Tuning `hnsw.ef_search` for Higher Recall on Filtered Sets

    The ef_search parameter controls the size of the dynamic list of candidates during the HNSW graph traversal. A higher value means a more exhaustive search, which increases accuracy (recall) at the cost of latency. In a filtered search context, it has a secondary effect: a larger candidate pool increases the probability of finding k items that also satisfy the WHERE clause.

    This is our first tuning knob. The default ef_search is often too low for filtered queries.

    Benchmarking Methodology

    To find the optimal ef_search value, you must benchmark. We'll test a range of values, measuring both latency and recall. Recall is crucial; a fast query is useless if it returns suboptimal results. To measure recall, we need the "ground truth"—the actual nearest neighbors. We can get this by running an exact search, which is slow but 100% accurate.

    Step 1: Get Ground Truth

    sql
    -- Disable approximate search to get the true nearest neighbors
    SET enable_seqscan = off;
    SET ivfflat.probes = 100; -- For IVFFlat, or for HNSW, just run it
    
    -- This will be slow, run it once to get the baseline IDs
    CREATE TEMP TABLE ground_truth AS
    SELECT id
    FROM documents
    WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
    ORDER BY embedding <-> :query_vector
    LIMIT 10;

    Step 2: Run Benchmark with Varying ef_search

    Now, we'll write a script (e.g., in Python with psycopg2 or a psql script) to execute the query with different ef_search settings and calculate recall.

    sql
    -- Example of one run within a larger script
    SET LOCAL hnsw.ef_search = 100;
    
    CREATE TEMP TABLE results_ef100 AS
    SELECT id
    FROM documents
    WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
    ORDER BY embedding <-> :query_vector
    LIMIT 10;
    
    -- Calculate recall
    SELECT
        (SELECT COUNT(*) FROM results_ef100 INNER JOIN ground_truth USING (id)) / 10.0 AS recall_at_10;

    Results and Analysis

    After running this for ef_search values from 40 (default) to 400, you'd get a table like this:

    hnsw.ef_searchAvg Latency (ms)Recall@10
    4015320.6
    10018500.8
    20024000.9
    30031000.9
    40038001.0

    Observations:

    * The default ef_search=40 yielded poor recall (60%). The search was fast but inaccurate because it gave up too early.

    * Latency increases linearly with ef_search, which is expected.

    * Recall improves significantly up to ef_search=200 and then plateaus. Pushing to ef_search=400 gives perfect recall but at a much higher latency cost.

    Conclusion: For this data distribution, an ef_search value between 200 and 300 offers the best trade-off. This is a per-query setting, so you can tune it dynamically.

    sql
    -- In your application code:
    BEGIN;
    SET LOCAL hnsw.ef_search = 200;
    SELECT ... -- your query
    COMMIT;

    When to use this pattern: Use ef_search tuning when your filters are moderately selective and you can tolerate a modest increase in latency for a significant gain in recall. It's the least invasive change but may not be sufficient for highly selective filters.

    Pattern 2: Physical Isolation with Table Partitioning

    For filters on low-cardinality columns that are central to your data model—like tenant_id—partitioning is the most powerful solution. Instead of one massive index, you create a separate HNSW index for each partition. When you query with a WHERE tenant_id = '...' clause, PostgreSQL's query planner is smart enough to scan only the index for that specific partition.

    This transforms the problem from searching 1M vectors and filtering to searching just 1,000 vectors (assuming even distribution).

    Implementation

    Let's re-architect our documents table to be partitioned by tenant_id.

    sql
    -- 1. Create a new partitioned table
    CREATE TABLE documents_partitioned (
        id BIGINT GENERATED ALWAYS AS IDENTITY,
        tenant_id UUID NOT NULL,
        content TEXT,
        embedding VECTOR(1536),
        created_at TIMESTAMPTZ DEFAULT NOW()
    ) PARTITION BY LIST (tenant_id);
    
    -- 2. Create partitions for each tenant. This can be automated.
    -- Note: In a real system with many tenants, you'd manage this with a script
    -- or use a hash partition if the tenant list is dynamic and large.
    CREATE TABLE documents_t1 PARTITION OF documents_partitioned
        FOR VALUES IN ('00000000-0000-0000-0000-000000000001');
    CREATE TABLE documents_t2 PARTITION OF documents_partitioned
        FOR VALUES IN ('00000000-0000-0000-0000-000000000002');
    -- ... and so on for all 1000 tenants.
    
    -- 3. Create indexes on the PARTITIONED table. PostgreSQL propagates it to all partitions.
    -- We need a primary key on the partitioned table, which must include the partition key.
    ALTER TABLE documents_partitioned ADD PRIMARY KEY (id, tenant_id);
    
    -- Create the HNSW index. This will create an index on each partition.
    CREATE INDEX ON documents_partitioned USING hnsw (embedding vector_l2_ops);
    
    -- (Migrate data from the old table to the new one here)

    Performance Comparison

    Now, let's run the same filtered query against our new partitioned table.

    sql
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT id
    FROM documents_partitioned
    WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
    ORDER BY embedding <-> :query_vector
    LIMIT 10;

    The EXPLAIN plan is dramatically different:

    text
    Limit  (cost=... rows=10 width=12) (actual time=25.123..25.125 rows=10 loops=1)
      Buffers: shared hit=1234
      ->  Index Scan using documents_t1_embedding_idx on documents_t1  (cost=... rows=1000 width=12) (actual time=25.121..25.123 rows=10 loops=1)
            Order By: (embedding <-> '...'::vector)
            Buffers: shared hit=1234
    Planning Time: 0.850 ms
    Execution Time: 25.200 ms

    Analysis:

    * Execution Time: Dropped from ~1500ms to ~25ms. This is a 60x performance improvement.

    * Index Scan: The planner correctly identified the target partition (documents_t1) and used its specific HNSW index (documents_t1_embedding_idx).

    * No Filtering Step: There is no Filter: line in the scan, and Rows Removed by Filter is zero. The filtering was handled by partition elimination before the scan even began.

    Caveats and Considerations:

    * Partition Key: This pattern only works when filtering by the partition key. A query filtering on a different column will not benefit.

    * Operational Overhead: You must manage the partitions. Creating a new partition for each new tenant adds a step to your tenant onboarding process.

    Cardinality: Best for low-to-medium cardinality keys. Partitioning by a UUID for each document* would be an anti-pattern.

    Pattern 3: The Two-Stage Re-ranking Strategy

    What if you can't partition? Perhaps your filter is on a high-cardinality column (e.g., author_id), or you need to combine multiple filters (status = 'published' AND category IN ('A', 'B')). In these cases, neither of the previous patterns is a good fit. The two-stage re-ranking strategy provides a flexible alternative.

    The logic is:

  • Stage 1 (Approximate Fetch): Perform a broad, unfiltered (or lightly filtered) ANN search to retrieve a large set of candidate documents. For example, LIMIT 200 instead of LIMIT 10.
  • Stage 2 (Exact Filter & Re-rank): Apply the precise, complex filter to this much smaller candidate set in a subsequent step, and then re-sort them by vector distance to find the true top k.
  • Implementation with CTEs

    Common Table Expressions (CTEs) in SQL are perfect for this.

    sql
    WITH approximate_neighbors AS (
      -- Stage 1: Fetch a large candidate pool using the index.
      SELECT id, tenant_id, embedding
      FROM documents
      ORDER BY embedding <-> :query_vector
      LIMIT 200 -- Widen the search significantly
    )
    -- Stage 2: Apply the precise filter and re-rank.
    SELECT id, tenant_id
    FROM approximate_neighbors
    WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
    ORDER BY embedding <-> :query_vector -- Re-ranking is crucial!
    LIMIT 10;

    Performance Analysis

    Let's analyze the EXPLAIN plan for this query.

    text
    Limit  (cost=... rows=10 width=...) (actual time=180.567..180.569 rows=10 loops=1)
      ->  Sort  (cost=... rows=50 width=...) (actual time=180.566..180.567 rows=10 loops=1)
            Sort Key: ((embedding <-> '...'::vector))
            ->  Subquery Scan on approximate_neighbors ... (actual time=...)
                  Filter: (tenant_id = '...')
                  ->  CTE Scan on approximate_neighbors ... (actual time=...)
                        ->  Limit  (cost=... rows=200 width=...) (actual time=150.123..155.456 rows=200 loops=1)
                              ->  Index Scan using documents_embedding_idx on documents ...

    * Stage 1 (Inner Limit): The Index Scan fetches 200 approximate neighbors. This took ~155ms. It's slower than fetching 10, but still leverages the index efficiently because there's no WHERE clause.

    * Stage 2 (Outer Filter and Sort): The outer query takes these 200 rows, filters them down (let's say 20 matching rows were found), and then performs a very fast in-memory sort on that small set to find the final top 10.

    * Total Time: The total execution time is ~180ms. This is a huge improvement over the naive 1500ms query, although not as fast as the 25ms partitioning solution.

    Tuning the Candidate Set Size (LIMIT):

    The LIMIT in the CTE (e.g., 200) is a critical tuning parameter. It controls the trade-off between the performance of Stage 1 and the recall of the final result.

    * Too small: You might not fetch enough candidates that satisfy the filter, leading to poor recall or fewer than k results.

    * Too large: Stage 1 becomes slow, diminishing the benefits of the pattern.

    You need to estimate the selectivity of your filter. If your filter typically matches 5% of the data, and you want a final LIMIT 10, you need to fetch at least 10 / 0.05 = 200 candidates in Stage 1 to have a good chance of finding 10 valid ones.

    Production Considerations and Advanced Topics

    Optimizing queries is only part of the story. Here are critical operational factors for running pgvector with HNSW at scale.

    * Index Build Time and Resources: Building an HNSW index is CPU and memory-intensive. For very large tables, use CREATE INDEX CONCURRENTLY to avoid locking out writes. Crucially, increase maintenance_work_mem significantly before building the index (e.g., to several gigabytes) to speed up the build process.

    Memory Usage: HNSW performance relies on the graph structure fitting into memory. The index size can be estimated. For a 1M row table with 1536 dimensions and default m=16, the index will be roughly 1,000,000 16 2 4 bytes (plus overhead), which is around 128MB. Monitor your cache hit rates to ensure the index isn't constantly being read from disk.

    * The UPDATE/DELETE Problem (Index Bloat): The current HNSW implementation in pgvector does not remove nodes from the graph on DELETE or UPDATE. It simply marks them as dead. This means the index will grow continuously and performance will degrade over time as the graph traversal wastes time on dead nodes. There is no simple VACUUM solution for this. The current best practice is to periodically re-index the table (REINDEX INDEX CONCURRENTLY ...). For tables with high churn, this is a mandatory maintenance task.

    Decision Framework: Which Pattern to Use?

    There is no single best solution. Your choice depends on your specific query patterns and data architecture.

  • Start with ef_search Tuning:
  • * When? For all filtered queries, this is your first and easiest step.

    * Pros: No schema changes, can be set per-transaction.

    * Cons: Limited effectiveness for highly selective filters; can significantly increase latency.

  • Use Partitioning for Core Filters:
  • * When? Your queries almost always filter on a known, low-to-medium cardinality key (e.g., tenant_id, region, product_category).

    * Pros: Drastic, order-of-magnitude performance improvement. The most robust solution.

    * Cons: Requires schema changes and operational overhead for partition management.

  • Implement Two-Stage Re-ranking for Complex/Ad-hoc Filters:
  • * When? You need to filter on high-cardinality columns or combine multiple, dynamic filters where partitioning is not feasible.

    * Pros: Highly flexible, significant performance gain over the naive approach.

    * Cons: Requires query restructuring, and tuning the candidate set size is crucial for balancing performance and recall.

    By methodically applying these advanced patterns and understanding their underlying trade-offs, you can move beyond basic pgvector usage and build a truly scalable, high-performance semantic search system that thrives under the complex filtering demands of production applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles