pgvector HNSW: Advanced Filtering for High-Cardinality Metadata

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Illusion of Simplicity: Why `WHERE` Clauses Kill HNSW Performance

In the rapidly evolving landscape of AI-powered applications, pgvector has emerged as a pragmatic choice for adding vector similarity search capabilities to PostgreSQL. The Hierarchical Navigable Small World (HNSW) index, introduced in later versions, promises logarithmic time complexity for Approximate Nearest Neighbor (ANN) search, a significant leap from the linear scan or the recall-limited IVFFlat index. Senior engineers often start with a query that looks deceptively simple:

sql
SELECT id, content
FROM documents
WHERE tenant_id = 'some-specific-tenant'
ORDER BY embedding <=> '[...your_query_vector...]'
LIMIT 10;

This query, which combines a metadata filter with a vector similarity search, is the bedrock of multi-tenant RAG systems, e-commerce recommendations, and personalized search. However, under the hood, this query can be a performance disaster, especially when tenant_id is a high-cardinality column and the HNSW index is in use.

The core problem is that the HNSW index graph is built solely based on vector proximity. It has no intrinsic knowledge of the metadata in other columns. Consequently, PostgreSQL's query planner, in most scenarios, adopts a post-filtering strategy:

  • HNSW Index Scan: The planner uses the HNSW index to find the k nearest neighbors to the query vector from the entire table. To satisfy a LIMIT 10 with a restrictive WHERE clause, it might need to retrieve hundreds or thousands of candidates (k >> 10).
  • Filter Application: These candidate rows are then joined back to the heap, and the WHERE tenant_id = '...' clause is applied.
  • Discard: The vast majority of these candidates are discarded, having wasted significant I/O and compute, until 10 rows matching the filter are found.
  • This article bypasses introductory concepts. We assume you understand vector embeddings, ANN search, and basic pgvector usage. We will dissect, implement, and benchmark advanced, production-grade patterns to solve the high-cardinality filtering problem with HNSW, moving from architectural solutions to query-level optimizations.

    Baseline Scenario: Quantifying the Performance Cliff

    To analyze the problem, let's establish a concrete, reproducible test case. We'll create a table with 1 million documents, each with a 768-dimension embedding, a tenant_id (10,000 unique tenants), and a user_id (100,000 unique users).

    1. Table Setup and Data Generation

    First, ensure you have the vector extension enabled:

    sql
    CREATE EXTENSION IF NOT EXISTS vector;

    Now, the table structure:

    sql
    CREATE TABLE documents (
        id BIGSERIAL PRIMARY KEY,
        tenant_id INT NOT NULL,
        user_id INT NOT NULL,
        content TEXT,
        embedding VECTOR(768)
    );

    We'll use a Python script with psycopg2 and numpy to populate the table with realistic data.

    python
    import psycopg2
    import numpy as np
    from psycopg2.extras import execute_values
    
    # --- Configuration ---
    DB_CONNECT_STRING = "postgresql://user:password@host:port/dbname"
    NUM_DOCUMENTS = 1_000_000
    NUM_TENANTS = 10_000
    NUM_USERS = 100_000
    EMBEDDING_DIM = 768
    BATCH_SIZE = 1000
    
    def generate_batch(batch_size):
        data = []
        for _ in range(batch_size):
            tenant = np.random.randint(1, NUM_TENANTS + 1)
            user = np.random.randint(1, NUM_USERS + 1)
            embedding = np.random.rand(EMBEDDING_DIM).astype(np.float32)
            # In a real scenario, content would be actual text
            content = f"Document for tenant {tenant} and user {user}"
            data.append((tenant, user, content, embedding.tolist()))
        return data
    
    conn = psycopg2.connect(DB_CONNECT_STRING)
    cur = conn.cursor()
    
    print("Populating documents table...")
    for i in range(0, NUM_DOCUMENTS, BATCH_SIZE):
        batch_data = generate_batch(BATCH_SIZE)
        execute_values(
            cur,
            "INSERT INTO documents (tenant_id, user_id, content, embedding) VALUES %s",
            batch_data
        )
        conn.commit()
        if (i + BATCH_SIZE) % 10000 == 0:
            print(f"Inserted {(i + BATCH_SIZE):,} documents...")
    
    print("Population complete.")
    cur.close()
    conn.close()

    2. Index Creation

    Let's create a standard B-Tree index for our metadata and an HNSW index for the embeddings. HNSW tuning parameters (m and ef_construction) are critical for the trade-off between index build time, index size, and recall. For this scale, m=32 and ef_construction=128 are robust starting points.

    sql
    -- Index for metadata filtering
    CREATE INDEX idx_documents_tenant_user ON documents (tenant_id, user_id);
    
    -- HNSW index for vector search
    -- This will take some time to build on 1M rows
    CREATE INDEX idx_documents_embedding_hnsw ON documents USING hnsw (embedding vector_l2_ops) 
    WITH (m = 32, ef_construction = 128);

    3. Baseline Benchmark

    Let's run EXPLAIN ANALYZE on three queries: an unfiltered KNN search, a filtered search on a common tenant, and a filtered search on a rare tenant. We'll set ef_search, the query-time parameter controlling search quality/speed, to a reasonable value.

    sql
    -- Set query-time HNSW parameter
    SET hnsw.ef_search = 64;
    
    -- Query 1: Unfiltered ANN search
    EXPLAIN ANALYZE SELECT id FROM documents ORDER BY embedding <=> (random_vector(768)) LIMIT 10;
    
    -- Query 2: Filtered search (common tenant, ~100 docs)
    EXPLAIN ANALYZE SELECT id FROM documents WHERE tenant_id = 101 ORDER BY embedding <=> (random_vector(768)) LIMIT 10;
    
    -- Query 3: Filtered search (rare tenant, assume it has exactly 10 docs)
    -- We need to find a tenant with few documents to simulate this
    EXPLAIN ANALYZE SELECT id FROM documents WHERE tenant_id = (SELECT tenant_id FROM documents GROUP BY tenant_id HAVING count(*) <= 15 LIMIT 1) ORDER BY embedding <=> (random_vector(768)) LIMIT 10;

    Expected Results & Analysis:

    * Query 1 (Unfiltered): This will be extremely fast. The plan will show an Index Scan on idx_documents_embedding_hnsw and return in single-digit milliseconds.

    text
        Limit  (cost=... rows=10 ...) (actual time=2.50..2.51ms rows=10 ...)
          ->  Index Scan using idx_documents_embedding_hnsw on documents ...
        Planning Time: ...
        Execution Time: 2.60 ms

    * Query 2 (Filtered, Common Tenant): This will be significantly slower. The planner will still use the HNSW index but will need to fetch many more rows than the LIMIT to satisfy the WHERE clause.

    text
        Limit  (cost=... rows=10 ...) (actual time=45.10..45.12ms rows=10 ...)
          ->  Gather Merge ...
            ->  Sort ...
              ->  Parallel Bitmap Heap Scan on documents  (cost=... rows=...)
                    Recheck Cond: (embedding <=> ...)
                    Filter: (tenant_id = 101)
                    Rows Removed by Filter: 990
                    ->  Bitmap Index Scan on idx_documents_embedding_hnsw ...
        Planning Time: ...
        Execution Time: 45.80 ms

    The key insight here is Rows Removed by Filter: 990. The system had to retrieve and check 1000 vectors just to find 10 that matched the tenant_id.

    * Query 3 (Filtered, Rare Tenant): This is the worst-case scenario. The query might take seconds or even time out. The planner will scan a massive portion of the HNSW index, fetch tens of thousands of rows, only to discard almost all of them.

    This benchmark clearly establishes the problem. Now, let's architect real solutions.

    Strategy 1: Declarative Partitioning for Tenant Isolation

    If your most common and critical filter is on a single, known column with low-to-moderate cardinality (like tenant_id), PostgreSQL's native partitioning is an exceptionally powerful solution. The idea is to physically separate data by tenant_id, so each partition has its own smaller, independent HNSW index.

    1. Implementation

    We'll redefine our table as a partitioned table. This requires planning upfront and is more difficult to apply to existing, large tables.

    sql
    -- Drop the old table first if it exists
    DROP TABLE IF EXISTS documents;
    
    CREATE TABLE documents_partitioned (
        id BIGSERIAL,
        tenant_id INT NOT NULL,
        user_id INT NOT NULL,
        content TEXT,
        embedding VECTOR(768),
        PRIMARY KEY (id, tenant_id) -- Partition key must be part of the primary key
    ) PARTITION BY LIST (tenant_id);
    
    -- Create a partition for a specific tenant
    CREATE TABLE documents_t1 PARTITION OF documents_partitioned FOR VALUES IN (1);
    CREATE TABLE documents_t2 PARTITION OF documents_partitioned FOR VALUES IN (2);
    -- ... and so on for each tenant. This can be automated.
    
    -- Create indexes on a partition. THIS IS KEY.
    CREATE INDEX idx_documents_t1_embedding_hnsw ON documents_t1 USING hnsw (embedding vector_l2_ops) WITH (m = 32, ef_construction = 128);
    CREATE INDEX idx_documents_t2_embedding_hnsw ON documents_t2 USING hnsw (embedding vector_l2_ops) WITH (m = 32, ef_construction = 128);
    
    -- A B-tree index for other filters can be created on the parent table
    CREATE INDEX idx_documents_partitioned_user ON documents_partitioned (user_id);

    2. How It Solves the Problem

    When you query the parent table documents_partitioned with a WHERE tenant_id = 1 clause, the query planner performs Partition Pruning. It knows that only the documents_t1 partition can possibly contain matching rows. Therefore, it routes the entire query, including the vector search, to that small partition and its dedicated HNSW index. It doesn't even consider the other partitions.

    3. Benchmark and EXPLAIN ANALYZE

    Let's repopulate data for a few tenants and run the filtered query again.

    sql
    EXPLAIN ANALYZE SELECT id FROM documents_partitioned WHERE tenant_id = 1 ORDER BY embedding <=> (random_vector(768)) LIMIT 10;

    Expected Result:

    The performance will be nearly as fast as the original unfiltered query on the non-partitioned table. The EXPLAIN plan will explicitly show it is only scanning the documents_t1 partition.

    text
    Limit  (cost=... rows=10 ...) (actual time=3.15..3.16ms rows=10 ...)
      ->  Index Scan using idx_documents_t1_embedding_hnsw on documents_t1 ...
            Order By: (embedding <=> ...)
    Planning Time: ...
    Execution Time: 3.50 ms

    4. Edge Cases and Production Considerations

    * Partition Management: You need a robust process for creating new partitions as new tenants sign up. This can be done with a trigger or a periodic maintenance job. Similarly, you need a strategy for dropping partitions when tenants churn.

    Cross-Tenant Queries: Queries that do not* include the partition key (tenant_id) will be slow, as they have to scan every partition. This architecture heavily optimizes for tenant-isolated queries.

    * Uneven Data Distribution: If one tenant has 100x more data than others, its partition and index will be much larger, leading to slower queries for that specific tenant (a "noisy neighbor" problem at the partition level).

    * Cardinality: This pattern is ideal for hundreds or thousands of tenants. It becomes unwieldy for millions of distinct values (e.g., partitioning by user_id).

    Strategy 2: The Two-Phase Re-ranking Pattern

    Partitioning is too rigid for many use cases. What if you need to filter on user_id, or a combination of tenant_id and a date range, or some other arbitrary metadata? For this, we need a query-level pattern. The goal is to avoid the full-table HNSW scan followed by a filter.

    The Two-Phase Re-ranking pattern flips the logic: we first get a large set of approximate candidates using the HNSW index, and then we apply the filter to this much smaller result set.

    1. Implementation

    This pattern relies on a subquery or Common Table Expression (CTE) to fetch a candidate set larger than our final LIMIT. The size of this candidate set (the "recall set") is a critical tuning parameter.

    sql
    -- Let's define our recall set size. This should be larger than the final LIMIT.
    -- A good starting point is 10-20x the final limit.
    DECLARE recall_set_size INT = 200;
    
    WITH candidates AS (
        -- Phase 1: Approximate Nearest Neighbor Search (Fast)
        -- Get the top 200 candidates from the entire table using only the HNSW index.
        SELECT id
        FROM documents
        ORDER BY embedding <=> (random_vector(768))
        LIMIT recall_set_size
    )
    -- Phase 2: Exact Filtering and Re-ranking (on a small set)
    -- Join the 200 candidates back to the main table to get their metadata,
    -- apply the strict filter, and then re-order by distance to be precise.
    SELECT
        d.id,
        d.content
    FROM
        documents d
    JOIN
        candidates c ON d.id = c.id
    WHERE
        d.tenant_id = 101 AND d.user_id = 12345
    ORDER BY
        d.embedding <=> (random_vector(768)) -- Re-ranking step is crucial for correctness
    LIMIT 10;

    2. Performance Analysis

    * Phase 1 is extremely fast. It's an unfiltered HNSW index scan with a LIMIT, which we've already shown takes single-digit milliseconds.

    * Phase 2 involves a join and filter on a tiny set of rows (200 in this case). This is also very fast, as the planner can use a Nested Loop Join against the small candidate set.

    The overall query time is dramatically lower than the naive post-filtering approach for selective filters.

    3. The Critical Trade-off: Recall vs. Precision

    This pattern is not perfect. It is an approximate filtering method. It's possible that the true nearest neighbor that also matches the WHERE clause is ranked #201 in the global vector space and is therefore missed by Phase 1.

    * Increasing recall_set_size: Improves the probability of finding the true nearest neighbors (higher recall) at the cost of making Phase 2 slower.

    * Decreasing recall_set_size: Makes the query faster but increases the risk of missing relevant results.

    The correct value for recall_set_size is application-dependent and should be determined by empirical testing against your specific data distribution and accuracy requirements.

    4. Benchmark Comparison

    Let's compare the naive filter vs. the re-ranking pattern for a highly selective filter (e.g., a specific user_id).

    StrategyQuery Time (ms)RecallComplexity
    Naive Post-Filter1500+100% (Exact)Slow, unpredictable, scales with filter selectivity
    Re-ranking (recall_set=200)25-35~95-99%Fast, predictable, requires tuning recall_set
    Re-ranking (recall_set=1000)60-80~98-99.9%Slower, higher recall, more robust

    (Note: Recall percentages are illustrative and depend heavily on data clustering.)

    Strategy 3: The Future - Native HNSW Filtering (pgvector >= 0.7.0)

    The pgvector maintainers are aware of this fundamental limitation. Recent versions, when compiled with a compatible version of the underlying hnswlib library, have started to introduce capabilities for pre-filtering within the HNSW index scan itself.

    This is a paradigm shift. Instead of post-filtering, the index traversal logic itself can be made to ignore nodes in the graph that do not match the metadata criteria.

    1. How it Works (Conceptual)

    This feature, often enabled by a compile-time flag like ENABLE_METADATA, changes the on-disk format of the HNSW index. It stores a reference or a compressed version of the filtered column's data alongside the vector data within the index pages. During the graph traversal search, when considering a new candidate node, the engine can first check if its metadata matches the WHERE clause before spending compute on the distance calculation or adding it to the candidate priority queue.

    2. Current State and Limitations (as of late 2023/early 2024)

    * Experimental: This is an emerging feature. Its performance characteristics are still being evaluated.

    * Limited Operator Support: It typically only supports simple equality (=) or IN list operators on integer-like types initially. Complex filters (>, LIKE, date ranges) are not supported.

    * Performance: The goal is to outperform post-filtering significantly without the recall trade-off of the re-ranking pattern. However, it will likely be slower than a pure, unfiltered HNSW scan because of the extra checks during traversal.

    * Availability: You must be on a very recent version of pgvector and may need to compile it from source to ensure the feature is enabled.

    3. Hypothetical EXPLAIN Plan

    If this feature is active and the planner chooses to use it, you would see a different kind of EXPLAIN plan. Instead of a Filter step that removes many rows, the Index Scan itself would report a much lower number of rows returned, indicating the filtering happened inside the index access method.

    text
    -- Hypothetical plan with native HNSW filtering
    Limit  (cost=... rows=10 ...) (actual time=15.20..15.21ms rows=10 ...)
      ->  Index Scan using idx_documents_embedding_hnsw on documents ...
            Order By: (embedding <=> ...)
            Index Cond: (tenant_id = 101) -- Filter is pushed down into the index scan
    Planning Time: ...
    Execution Time: 15.50 ms

    This is the holy grail for filtered ANN search, combining the strengths of B-Tree indexes (filtering) and HNSW indexes (vector search). Keep a close watch on pgvector release notes for this evolving capability.

    Conclusion: A Decision Framework for Production

    There is no single "best" way to combine HNSW search with high-cardinality metadata filters in pgvector. The optimal strategy is a function of your architecture, query patterns, and recall requirements.

    Here is a decision framework for senior engineers:

  • Is your primary filter always on a single, categorical column like tenant_id?
  • * Yes: Use Declarative Partitioning. It offers the best performance, perfect recall, and strong data isolation. The operational overhead of managing partitions is a worthwhile trade-off for the performance gains.

  • Do you need to filter on arbitrary, complex, or multiple high-cardinality columns?
  • Yes: The Two-Phase Re-ranking Pattern is your most robust and flexible option. Start with a recall_set_size of 20 limit and tune it based on performance and offline accuracy evaluations. This is the pragmatic choice for most complex, real-world systems today.

  • Are you starting a new project on the latest infrastructure and can tolerate some experimental risk for a simpler query model?
  • * Yes: Investigate the Native HNSW Filtering capabilities of the latest pgvector version. It may offer a cleaner, exact solution for simple equality filters, but be prepared for a rapidly evolving feature set and potential performance quirks.

    Never trust a naive WHERE clause with an HNSW index. By understanding the underlying mechanics and applying these advanced patterns, you can build scalable, high-performance semantic search systems that meet the stringent demands of production environments.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles