Optimizing pg_vector: HNSW vs. IVFFlat for Metadata-Rich Searches

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Filter-Then-Search vs. Search-Then-Filter Dilemma

In modern AI applications, vector search is rarely performed in a vacuum. Users expect to find semantically similar items within a specific context—products in a certain category, documents owned by a specific user, or images uploaded within a date range. In PostgreSQL with pg_vector, this translates to a query that combines a vector similarity operator (<=>, <->, or <#> for cosine, L2, and inner product respectively) with a standard WHERE clause.

Consider a canonical e-commerce scenario: a table of millions of products, each with a vector embedding representing its visual and textual features, and numerous metadata columns.

sql
CREATE TABLE products (
    id BIGSERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    category_id INT NOT NULL,
    tenant_id UUID NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    embedding VECTOR(768) -- From a model like sentence-transformers/all-mpnet-base-v2
);

-- Standard metadata indexes
CREATE INDEX ON products (category_id);
CREATE INDEX ON products (tenant_id);

-- Vector index (we'll explore types later)
CREATE INDEX ON products USING hnsw (embedding vector_l2_ops);

The user wants to find products visually similar to a query image, but only within 'category_id = 123' and for their specific 'tenant_id'. The intuitive query looks like this:

sql
-- The 'Naive' Query
SELECT id, name
FROM products
WHERE category_id = 123 AND tenant_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
ORDER BY embedding <-> '[0.1, 0.2, ..., 0.9]'
LIMIT 10;

This query presents a fundamental challenge to the PostgreSQL query planner. The planner must decide between two primary execution strategies:

  • Filter-Then-Search (Bitmap Heap Scan): First, use the B-tree indexes on category_id and tenant_id to find all rows matching the WHERE clause. Then, perform an exhaustive, exact nearest neighbor search (a sequential scan and sort) on that resulting subset of rows. This is efficient if the filter is highly selective (e.g., returns only a few hundred rows), but disastrous if it returns tens of thousands of rows, as the vector search portion becomes a costly, un-indexed operation.
  • Search-Then-Filter (Index Scan on Vector Index): First, use the ANN vector index to find the top N nearest neighbors to the query vector from the entire table. Then, filter this small result set against the WHERE clause. This is fast if the filter is not very selective, but it suffers from a critical accuracy problem. If the true 10 nearest neighbors that match the filter are not within the top, say, 1000 candidates returned by the index scan, they will be missed entirely. The planner retrieves a fixed number of candidates from the ANN index, and if your desired items aren't in that initial batch, they're gone.
  • The PostgreSQL planner uses statistics to estimate the cost of each path. However, for ANN indexes, these statistics can be misleading. The planner doesn't have a deep understanding of the vector space's data distribution or how the WHERE clause correlates with the ORDER BY clause. This often leads to suboptimal plan selection, resulting in either high latency or poor recall.

    Let's examine how the two primary pg_vector index types, IVFFlat and HNSW, handle this challenge internally.

    Deep Dive: IVFFlat Internals and Metadata Filtering

    IVFFlat (Inverted File with Flat Compression) works by partitioning the vector space. During index creation, it uses k-means clustering to find lists centroids. Each vector in your table is then assigned to the nearest centroid's list.

    sql
    -- Creating an IVFFlat index
    CREATE INDEX ON products USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);

    At query time, pg_vector identifies the probes number of centroid(s) closest to your query vector and only searches the vectors within those corresponding lists. This dramatically narrows the search space. For a table with 10 million vectors and lists = 1000, a query with probes = 10 will only scan roughly (10/1000) * 10M = 100,000 vectors instead of the full 10 million.

    How Filtering Interacts with IVFFlat:

    When you add a WHERE clause, the filtering happens after the list selection. The planner fetches all vectors from the probed lists and then applies the metadata filter. This means IVFFlat is fundamentally a Search-Then-Filter mechanism.

    Code Example 1: IVFFlat with a High-Cardinality Filter

    Let's simulate a scenario with 1 million products across 10,000 tenants. Each tenant has 100 products.

    sql
    -- Setup
    CREATE EXTENSION IF NOT EXISTS vector;
    DROP TABLE IF EXISTS products;
    
    CREATE TABLE products (
        id BIGSERIAL PRIMARY KEY,
        tenant_id INT NOT NULL,
        embedding VECTOR(128)
    );
    
    -- Insert 1M products for 10k tenants
    INSERT INTO products (tenant_id, embedding)
    SELECT
        (random() * 9999)::int,
        ARRAY(SELECT random() FROM generate_series(1, 128))::vector
    FROM generate_series(1, 1000000);
    
    CREATE INDEX ON products (tenant_id);
    CREATE INDEX products_embedding_ivfflat_idx ON products USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);
    
    -- Set query parameters
    SET ivfflat.probes = 10;
    
    -- Generate a random query vector
    DO $$DECLARE
        query_embedding vector;
    BEGIN
        query_embedding := ARRAY(SELECT random() FROM generate_series(1, 128))::vector;
    
        RAISE NOTICE 'EXPLAIN ANALYZE for IVFFlat:';
        -- Run the query for a specific tenant
        EXPLAIN (ANALYZE, BUFFERS) 
        SELECT id FROM products
        WHERE tenant_id = 1234
        ORDER BY embedding <-> query_embedding
        LIMIT 10;
    END$$;

    EXPLAIN ANALYZE Output (Illustrative):

    text
    Limit  (cost=134.58..134.60 rows=10 width=8) (actual time=150.345..150.348 rows=10 loops=1)
      Buffers: shared hit=2845
      ->  Sort  (cost=134.58..134.60 rows=10 width=140) (actual time=150.343..150.345 rows=10 loops=1)
            Sort Key: ((embedding <-> '[...]'::vector))
            Sort Method: top-N heapsort  Memory: 26kB
            Buffers: shared hit=2845
            ->  Index Scan using products_embedding_ivfflat_idx on products  (cost=0.00..134.02 rows=100 width=140) (actual time=0.879..149.981 rows=10000 loops=1)
                  Order By: (embedding <-> '[...]'::vector)
                  Filter: (tenant_id = 1234)
                  Rows Removed by Filter: 9000
                  Buffers: shared hit=2845
    Planning Time: 0.213 ms
    Execution Time: 150.489 ms

    Analysis:

    * The planner chose the IVFFlat index (products_embedding_ivfflat_idx).

    * It scanned the lists for the probes nearest centroids, fetching 10000 rows (actual time=... rows=10000).

    Crucially, the Filter: (tenant_id = 1234) was applied after* fetching these 10,000 rows. It removed 9,000 of them.

    * The final sort was performed on the remaining 1,000 rows that happened to be in the probed lists and matched the tenant ID.

    The Edge Case and Performance Trap:

    The major risk here is low recall. The 10,000 vectors retrieved from the probed lists are the closest neighbors overall. Our target tenant's 100 vectors might be sparsely distributed across the entire vector space. It's highly probable that the true nearest neighbors for tenant_id = 1234 are in lists that were never probed. We get a fast but potentially useless result.

    Increasing probes improves recall but linearly increases latency. To guarantee finding all 100 of the tenant's products, you might need to set probes to a very high value, negating the performance benefit of the index.

    Conclusion for IVFFlat: It's generally a poor choice for queries with highly selective, high-cardinality filters because its search-then-filter nature leads to a severe trade-off between performance and accuracy.

    Deep Dive: HNSW Internals and its Filtering Advantage

    HNSW (Hierarchical Navigable Small World) builds a multi-layered graph of vectors. The top layers contain long-range connections, and the bottom layers have short-range connections. A search starts at an entry point in the top layer, greedily traversing the graph to find the nearest neighbor in that layer. It then drops to the layer below and repeats the process until it reaches the bottom layer, where it performs a refined local search.

    sql
    -- Creating an HNSW index
    -- m: max connections per node
    -- ef_construction: size of the dynamic candidate list during index build
    CREATE INDEX ON products USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);

    At query time, the ef_search parameter controls the size of the candidate list used during the graph traversal. A higher ef_search increases accuracy at the cost of latency.

    How Filtering Interacts with HNSW:

    Unlike IVFFlat, HNSW's graph traversal offers more flexibility. The pg_vector implementation of HNSW with filtering is more sophisticated. It can perform the filtering check during the graph traversal. As it considers a new node (vector) to visit, it can first check if it satisfies the WHERE clause. If not, it can be discarded immediately, and the search can proceed to other neighbors. This is a form of on-the-fly filtering that is much more efficient than IVFFlat's post-filtering.

    Code Example 2: HNSW with the Same High-Cardinality Filter

    Let's drop the IVFFlat index and create an HNSW index on the same table.

    sql
    DROP INDEX products_embedding_ivfflat_idx;
    CREATE INDEX products_embedding_hnsw_idx ON products USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
    
    -- Set query parameters
    SET hnsw.ef_search = 40;
    
    -- Rerun the same query
    DO $$DECLARE
        query_embedding vector;
    BEGIN
        query_embedding := ARRAY(SELECT random() FROM generate_series(1, 128))::vector;
    
        RAISE NOTICE 'EXPLAIN ANALYZE for HNSW:';
        EXPLAIN (ANALYZE, BUFFERS) 
        SELECT id FROM products
        WHERE tenant_id = 1234
        ORDER BY embedding <-> query_embedding
        LIMIT 10;
    END$$;

    EXPLAIN ANALYZE Output (Illustrative):

    text
    Limit  (cost=0.15..32.83 rows=10 width=8) (actual time=45.123..45.126 rows=10 loops=1)
      Buffers: shared hit=1532
      ->  Index Scan using products_embedding_hnsw_idx on products  (cost=0.15..325.86 rows=100 width=8) (actual time=0.076..45.091 rows=100 loops=1)
            Order By: (embedding <-> '[...]'::vector)
            Filter: (tenant_id = 1234)
            Buffers: shared hit=1532
    Planning Time: 0.189 ms
    Execution Time: 45.234 ms

    Analysis:

    The plan looks similar, but the execution is fundamentally different and much faster (45ms vs 150ms). The key is that the Filter: (tenant_id = 1234) is being applied much more efficiently within the HNSW index scan logic. The index scan itself returns only the 100 rows that match the tenant filter, not a large, unfiltered super-set. This results in much better performance and recall compared to IVFFlat for this use case.

    Performance Considerations:

    * ef_search is critical: This parameter is your main lever for balancing speed and recall. For filtered searches, you often need to increase ef_search beyond the default to ensure the algorithm explores enough of the graph to find 10 candidates that also satisfy your filter.

    * Index Build Parameters (m, ef_construction): A well-constructed graph is essential. Higher m and ef_construction values create a higher-quality graph, which can improve search accuracy and even speed (by allowing for more direct paths), but at the cost of much slower index build times and a larger index on disk.

    While HNSW is a significant improvement, it can still be suboptimal if the filter is extremely selective (e.g., tenant_id returns only 5 rows out of 10 million). The planner might still struggle. For these cases, we need a more explicit strategy.

    Production Pattern 1: The Two-Step Query for High-Selectivity Filters

    Instead of asking the planner to magically figure out the best approach, we can force its hand. This pattern explicitly tells PostgreSQL: "First, efficiently find the small set of rows that match my metadata filter, and then perform a vector search only on that subset."

    We achieve this using a Common Table Expression (CTE) or a subquery to separate the two stages.

    Code Example 3: Implementing the Two-Step Query

    This query pattern is robust and often the best solution for high-cardinality, high-selectivity filters.

    sql
    -- The same HNSW index and ef_search settings are used.
    DO $$DECLARE
        query_embedding vector;
    BEGIN
        query_embedding := ARRAY(SELECT random() FROM generate_series(1, 128))::vector;
    
        RAISE NOTICE 'EXPLAIN ANALYZE for Two-Step Query:';
        EXPLAIN (ANALYZE, BUFFERS)
        WITH filtered_products AS (
            -- Step 1: Use the B-tree index to find the exact primary keys.
            SELECT id, embedding
            FROM products
            WHERE tenant_id = 1234
        )
        -- Step 2: Perform vector search ONLY on the filtered set.
        SELECT id
        FROM filtered_products
        ORDER BY embedding <-> query_embedding
        LIMIT 10;
    END$$;

    EXPLAIN ANALYZE Output (Illustrative):

    text
    Limit  (cost=20.55..20.57 rows=10 width=8) (actual time=8.101..8.103 rows=10 loops=1)
      Buffers: shared hit=105
      ->  Sort  (cost=20.55..20.80 rows=100 width=140) (actual time=8.100..8.101 rows=10 loops=1)
            Sort Key: ((p.embedding <-> '[...]'::vector))
            Sort Method: top-N heapsort  Memory: 26kB
            Buffers: shared hit=105
            ->  CTE Scan on filtered_products  (cost=19.00..19.20 rows=100 width=140) (actual time=0.038..8.012 rows=100 loops=1)
                  Buffers: shared hit=104
                  CTE filtered_products
                    ->  Index Scan using products_tenant_id_idx on products p  (cost=0.00..19.00 rows=100 width=140) (actual time=0.031..7.953 rows=100 loops=1)
                          Index Cond: (tenant_id = 1234)
                          Buffers: shared hit=104
    Planning Time: 0.350 ms
    Execution Time: 8.250 ms

    Analysis of the Breakthrough:

    * The planner is no longer confused. The CTE filtered_products is evaluated first.

    * It performs an Index Scan using products_tenant_id_idx—the fast B-tree index—to find the 100 products for our tenant.

    The outer query then performs a Sort (an exact, brute-force vector search) on only these 100 rows*. A brute-force search on 100 items is trivial and extremely fast.

    * The HNSW index is not even used here, and that's the key! We've traded an approximate search on millions of rows for an exact search on a tiny subset.

    Benchmark Comparison (Illustrative):

    MethodLatency (p95)RecallKey Characteristic
    Naive Query (IVFFlat)~150 msLowProne to missing results due to list probing.
    Naive Query (HNSW)~45 msHighGood, but performance depends heavily on ef_search.
    Two-Step Query (CTE)~8 ms100%Guarantees perfect accuracy on the filtered subset.

    This pattern is exceptionally powerful when your filter reduces the candidate set to a manageable size (e.g., < 50,000 rows), where a sequential scan and sort is faster than an ANN index scan.

    Production Pattern 2: Partial Indexes and Partitioning for Multi-Tenancy

    The Two-Step Query is great, but what if the filter isn't on a high-cardinality column like tenant_id? What if it's on a lower-cardinality field like status IN ('active', 'archived') or region IN ('US', 'EU', 'APAC')?

    In these scenarios, especially in multi-tenant systems where data is naturally siloed, creating separate, smaller indexes can provide a massive performance boost. PostgreSQL offers two excellent mechanisms for this: Partial Indexes and Table Partitioning.

    Partial Indexes

    A partial index is an index built over a subset of a table; the subset is defined by a WHERE clause in the CREATE INDEX statement. The planner will only consider using this index if the query's WHERE clause matches the index's WHERE clause.

    Code Example 4: Partial HNSW Indexes

    Imagine a system where a few enterprise tenants generate the vast majority of queries.

    sql
    -- Create a smaller, dedicated index just for tenant_id = 1234
    CREATE INDEX products_tenant_1234_embedding_hnsw_idx 
    ON products USING hnsw (embedding vector_l2_ops) 
    WHERE tenant_id = 1234;
    
    -- Create another for a different high-value tenant
    CREATE INDEX products_tenant_5678_embedding_hnsw_idx 
    ON products USING hnsw (embedding vector_l2_ops) 
    WHERE tenant_id = 5678;
    
    -- Now, run the original naive query for tenant 1234
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT id FROM products
    WHERE tenant_id = 1234
    ORDER BY embedding <-> '[...]'::vector
    LIMIT 10;

    EXPLAIN ANALYZE Output (Illustrative):

    text
    Limit ... (actual time=5.123..5.125 rows=10 loops=1)
      ->  Index Scan using products_tenant_1234_embedding_hnsw_idx on products ...
            Order By: (embedding <-> '[...]'::vector)

    Analysis:

    The planner intelligently selected the tiny partial index that only contains data for tenant 1234. The search is now happening on a much smaller, more concentrated HNSW graph, leading to extremely low latency. This avoids the overhead of traversing a massive graph that contains data for all tenants.

    Table Partitioning

    For an even cleaner separation, especially when the number of tenants is large but manageable, you can use declarative partitioning.

    sql
    -- Create a partitioned table
    CREATE TABLE products_partitioned (
        id BIGSERIAL,
        tenant_id INT NOT NULL,
        embedding VECTOR(128)
    ) PARTITION BY LIST (tenant_id);
    
    -- Create partitions for each tenant
    CREATE TABLE products_tenant_1 PARTITION OF products_partitioned FOR VALUES IN (1);
    CREATE TABLE products_tenant_2 PARTITION OF products_partitioned FOR VALUES IN (2);
    -- ... and so on
    
    -- Create an HNSW index on EACH partition
    CREATE INDEX ON products_tenant_1 USING hnsw (embedding vector_l2_ops);
    CREATE INDEX ON products_tenant_2 USING hnsw (embedding vector_l2_ops);

    When you query the parent table products_partitioned with WHERE tenant_id = 1, the query planner performs partition pruning. It knows it only needs to scan the products_tenant_1 table and its corresponding index. This is operationally more complex to manage (requires automation for new tenants) but provides the ultimate in query performance and data isolation.

    Advanced Considerations and Final Recommendations

    * Re-indexing and VACUUM: HNSW indexes in pg_vector are not updated on DELETE or UPDATE. Dead tuples remain in the graph until a VACUUM is run, but this doesn't rebuild the graph structure. For tables with high churn (frequent updates/deletes), the graph's quality can degrade over time, impacting performance. You may need to implement a periodic REINDEX strategy during a maintenance window.

    * Quantization: For extremely large datasets where the vector index size is a concern (HNSW indexes can be large), look into Scalar Quantization. pg_vector supports this. It reduces the precision of the vectors (e.g., from float4 to int8), dramatically shrinking the index size and memory footprint. This comes at the cost of a slight reduction in accuracy, a trade-off that is often acceptable.

    * EXPLAIN ANALYZE is Your Ground Truth: Never assume a query plan. Always test your specific queries against a realistic dataset. Use EXPLAIN (ANALYZE, BUFFERS) to see the actual execution plan, timings, and buffer hits. This is the only way to truly understand how your chosen indexing and query patterns are performing.

    Conclusion

    Moving from basic vector search to high-performance, filtered vector search requires a deep understanding of index internals and query execution. Here is the decision framework for senior engineers:

  • For high-cardinality, highly selective filters (e.g., user_id, document_id): The Two-Step Query pattern using a CTE is almost always the optimal choice. It provides perfect recall on the filtered set and leverages fast B-tree indexes for the initial filtering step, resulting in predictable, low latency.
  • For low-to-medium cardinality filters that define clear data silos (e.g., region, status, multi-tenancy): Partial Indexes or Table Partitioning are superior. They create smaller, more efficient vector indexes that the planner can target directly, leading to the best possible performance, albeit with higher operational overhead.
  • For low-selectivity filters where accuracy is not paramount: A naive query on a well-tuned HNSW index can be sufficient. Be prepared to tune ef_search carefully and monitor recall.
  • Avoid IVFFlat for any serious filtered search workload. By deliberately guiding the PostgreSQL query planner with these advanced patterns, you can build robust, scalable, and highly performant AI applications that seamlessly merge the power of semantic search with the precision of structured metadata filtering.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles