Advanced HNSW Tuning in pgvector for High-Cardinality Filtering

October 18, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Dichotomy of Vector Search and Relational Filtering

In modern data platforms, the fusion of unstructured semantic search with structured relational data is paramount. pgvector has emerged as a powerful tool, enabling developers to perform vector similarity searches directly within PostgreSQL. The introduction of HNSW (Hierarchical Navigable Small World) indexing provides sub-millisecond query times on multi-million row tables. However, a significant production challenge arises when these high-speed vector searches must be constrained by traditional, high-cardinality relational filters.

Consider a common multi-tenant SaaS application scenario: a table of user-generated documents, each with a vector embedding for semantic search, but queries must be strictly isolated to a specific tenant_id or user_id.

Let's establish our baseline schema. We'll use a documents table containing embeddings alongside high-cardinality metadata.

sql

-- Ensure the pgvector extension is installed
CREATE EXTENSION IF NOT EXISTS vector;

-- A table representing documents in a multi-tenant system
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    tenant_id UUID NOT NULL,
    user_id UUID NOT NULL,
    content TEXT,
    embedding VECTOR(768) -- Assuming a 768-dimension embedding model
);

-- Populate with sample data (in a real scenario, this would be millions of rows)
-- For demonstration, let's assume 1 million rows with 1000 tenants
INSERT INTO documents (tenant_id, user_id, content, embedding)
SELECT
    ('00000000-0000-0000-0000-' || LPAD((n % 1000)::TEXT, 12, '0'))::UUID,
    gen_random_uuid(),
    'Content for document ' || n,
    ARRAY(SELECT random() FROM generate_series(1, 768))::vector
FROM generate_series(1, 1000000) AS s(n);

-- Create standard B-Tree indexes for our metadata filters
CREATE INDEX idx_documents_tenant_id ON documents (tenant_id);
CREATE INDEX idx_documents_user_id ON documents (user_id);

-- Create the HNSW index on the embedding column
-- We'll start with default parameters
CREATE INDEX idx_documents_embedding_hnsw ON documents USING hnsw (embedding vector_l2_ops);

-- Analyze the table to ensure the planner has up-to-date statistics
ANALYZE documents;

Now, let's execute a seemingly straightforward query: find the 10 documents most semantically similar to a given query vector, but only for tenant_id = '...'.

sql

-- The naive query
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, tenant_id
FROM documents
WHERE tenant_id = '00000000-0000-0000-0000-000000000001' -- A specific tenant
ORDER BY embedding <-> (SELECT embedding FROM documents WHERE id = 12345) -- A random query vector
LIMIT 10;

The execution plan reveals the core of the problem:

text

Limit  (cost=1.29..15.34 rows=10 width=24) (actual time=153.488..153.493 rows=10 loops=1)
  Buffers: shared hit=4281
  ->  Index Scan using idx_documents_tenant_id on documents  (cost=1.29..1294.39 rows=1000 width=24) (actual time=153.485..153.490 rows=10 loops=1)
        Index Cond: (tenant_id = '00000000-0000-0000-0000-000000000001'::uuid)
        Order By: (embedding <-> '...'::vector)
        Buffers: shared hit=4281
Planning Time: 0.231 ms
Execution Time: 153.532 ms

Analysis of the Naive Plan:

Index Scan on idx_documents_tenant_id: The planner correctly chose to use the B-Tree index to find all rows matching the tenant_id. In our example, this is approximately 1,000 rows (1,000,000 / 1000 tenants).

Order By: (embedding <-> '...'): This is the critical failure. The ORDER BY clause is applied after the WHERE clause has filtered the rows. The database retrieves all 1,000 documents for the tenant and then performs an in-memory, sequential scan (a k-NN calculation) to sort them by vector distance. The HNSW index is not used at all.

Performance: The execution time of ~150ms for a mere 1,000 documents is abysmal. Extrapolate this to a tenant with 100,000 documents, and the query time becomes untenable for real-time applications.

The PostgreSQL planner faces a dilemma: it can either use the HNSW index to efficiently find the nearest neighbors across the entire table and then filter them (post-filtering), or use the B-Tree index to efficiently find the rows for the tenant and then sort them by vector distance (the chosen plan). It cannot, in this simple query structure, use both indexes to their full potential simultaneously.

Strategy 1: The Two-Step Query Refactor (CTE Approach)

A common and effective pattern to force the use of the HNSW index is to decouple the vector search from the metadata filtering. We can achieve this using a Common Table Expression (CTE).

The logic is as follows:

First, perform an unfettered nearest neighbor search across the entire table to get a candidate set of, say, the top 1000 closest vectors. This step will use the HNSW index.
Then, join this small candidate set back to the original table and apply the metadata filters.

sql

-- The Two-Step CTE Query
EXPLAIN (ANALYZE, BUFFERS)
WITH vector_matches AS (
    SELECT id
    FROM documents
    ORDER BY embedding <-> (SELECT embedding FROM documents WHERE id = 12345)
    LIMIT 100 -- Fetching 100 candidates to find 10 matches for the tenant
)
SELECT
    d.id,
    d.tenant_id
FROM documents d
JOIN vector_matches vm ON d.id = vm.id
WHERE d.tenant_id = '00000000-0000-0000-0000-000000000001'
LIMIT 10;

Let's examine the new execution plan:

text

Limit  (cost=14.93..15.14 rows=10 width=24) (actual time=1.234..1.236 rows=1 loops=1)
  Buffers: shared hit=336
  ->  Nested Loop  (cost=14.93..15.14 rows=10 width=24) (actual time=1.233..1.235 rows=1 loops=1)
        Buffers: shared hit=336
        ->  CTE Scan on vector_matches vm  (cost=14.65..14.85 rows=100 width=8) (actual time=0.985..0.992 rows=100 loops=1)
              Buffers: shared hit=321
              CTE vector_matches
                ->  Limit  (cost=14.28..14.65 rows=100 width=16) (actual time=0.978..0.986 rows=100 loops=1)
                      Buffers: shared hit=321
                      ->  Index Scan using idx_documents_embedding_hnsw on documents  (cost=0.42..36688.42 rows=1000000 width=16) (actual time=0.038..0.941 rows=100 loops=1)
                            Order By: (embedding <-> '...'::vector)
                            Buffers: shared hit=321
        ->  Index Scan using documents_pkey on documents d  (cost=0.28..0.42 rows=1 width=24) (actual time=0.002..0.002 rows=0 loops=100)
              Index Cond: (id = vm.id)
              Filter: (tenant_id = '00000000-0000-0000-0000-000000000001'::uuid)
              Rows Removed by Filter: 1
              Buffers: shared hit=15
Planning Time: 0.312 ms
Execution Time: 1.288 ms

Analysis of the CTE Plan:

CTE vector_matches: The planner executes the CTE first. The ORDER BY embedding <-> ... LIMIT 100 clause correctly triggers a fast Index Scan using idx_documents_embedding_hnsw. This is the key victory. It finds the 100 nearest neighbors in the entire dataset in under 1ms.

Nested Loop Join: It then joins these 100 candidate IDs back to the documents table (using its primary key index) and applies the WHERE d.tenant_id = '...' filter.

Performance: The total execution time is ~1.3ms, a ~118x improvement over the naive query. This is a production-viable speed.

Advanced Considerations for the CTE Approach:

* The "Magic Number" Problem: The LIMIT in the CTE (we used 100) is a crucial tuning parameter. It must be large enough to ensure you find at least 10 results for your target tenant after filtering. If your tenants have a very uneven distribution of documents, a fixed LIMIT is problematic.

* If LIMIT is too small, you may get fewer than 10 results, or even zero, even if the tenant has many relevant documents. This is a recall issue.

* If LIMIT is too large, you are doing unnecessary work in the initial vector search, increasing latency and I/O.

* Cardinality Impact: The optimal CTE LIMIT is a function of the filter's selectivity. If a tenant owns 1% of the data, you might need a LIMIT of 10 / 0.01 = 1000 to have a high probability of finding 10 matches. A dynamic approach might be necessary, where the application logic adjusts the CTE LIMIT based on known tenant size or retries with a larger limit if the initial result set is too small.

Strategy 2: Table Partitioning for Metadata Isolation

For systems where queries are almost always scoped to a specific high-cardinality key like tenant_id, the CTE approach can feel like a workaround. A more architecturally robust solution is to use PostgreSQL's declarative partitioning. By partitioning the documents table by tenant_id, we physically co-locate data for each tenant, allowing the planner to operate on a much smaller dataset from the outset.

Let's re-architect our schema.

sql

-- First, drop the old table
DROP TABLE documents;

-- Create a new partitioned table
CREATE TABLE documents (
    id BIGSERIAL,
    tenant_id UUID NOT NULL,
    user_id UUID NOT NULL,
    content TEXT,
    embedding VECTOR(768)
) PARTITION BY LIST (tenant_id);

-- We must also define the primary key and indexes on the partitioned table
-- Note: The partition key (tenant_id) must be part of the primary key.
ALTER TABLE documents ADD PRIMARY KEY (id, tenant_id);
CREATE INDEX idx_documents_user_id_partitioned ON documents (user_id);
CREATE INDEX idx_documents_embedding_hnsw_partitioned ON documents USING hnsw (embedding vector_l2_ops);

-- Now, create partitions for each tenant. In a real system, this would be
-- managed dynamically as new tenants are onboarded.
CREATE TABLE documents_tenant_1 PARTITION OF documents FOR VALUES IN ('00000000-0000-0000-0000-000000000001');
CREATE TABLE documents_tenant_2 PARTITION OF documents FOR VALUES IN ('00000000-0000-0000-0000-000000000002');
-- ... and so on for all 1000 tenants.

-- Repopulate data (this will now be routed to the correct partitions)
INSERT INTO documents ... ;
ANALYZE documents;

With this structure in place, the original naive query now behaves completely differently.

sql

-- The same "naive" query on the partitioned table
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, tenant_id
FROM documents
WHERE tenant_id = '00000000-0000-0000-0000-000000000001'
ORDER BY embedding <-> (SELECT embedding FROM documents WHERE id = 12345 LIMIT 1) -- Query vector from any partition
LIMIT 10;

The execution plan is a model of efficiency:

text

Limit  (cost=14.28..14.65 rows=10 width=24) (actual time=0.185..0.190 rows=10 loops=1)
  Buffers: shared hit=35
  ->  Append  (cost=14.28..40.38 rows=1000 width=24) (actual time=0.184..0.188 rows=10 loops=1)
        ->  Index Scan using documents_tenant_1_embedding_idx on documents_tenant_1  (cost=0.42..36.68 rows=1000 width=40) (actual time=0.183..0.187 rows=10 loops=1)
              Order By: (embedding <-> '...'::vector)
              Buffers: shared hit=35
Planning Time: 0.543 ms
Execution Time: 0.245 ms

Analysis of the Partitioned Plan:

Partition Pruning: The planner sees WHERE tenant_id = '...' and immediately knows it only needs to scan the documents_tenant_1 partition. All other partitions are ignored. This is the single most important optimization.

Index Scan on Partition Index: Within the tiny documents_tenant_1 partition (which contains only 1,000 rows), the planner can now efficiently use the HNSW index on that partition (documents_tenant_1_embedding_idx) to satisfy the ORDER BY clause.

Performance: The execution time is ~0.2ms. This is ~765x faster than the original naive query and ~6x faster than the CTE approach. It's the theoretical maximum performance, as we are only searching the exact subset of data required.

Advanced Considerations for Partitioning:

* Operational Overhead: Managing thousands of partitions can be complex. You need robust automation for creating new partitions when tenants are added and potentially for archiving or dropping them when tenants are removed.

* Query Patterns: This strategy is only effective if the vast majority of your queries include the partition key in the WHERE clause. A query without the tenant_id would force a scan across all partitions, which is often slower than querying a single large table.

* Primary Key Constraint: As shown, the partition key must be part of the primary key and any other unique indexes. This can sometimes complicate schema design.

Advanced HNSW Parameter Tuning for Filtered Scenarios

Regardless of the query strategy you choose, the HNSW index parameters themselves are critical levers for performance. The defaults are a reasonable starting point, but for filtered search, we can make more informed trade-offs.

The key parameters during index creation are m and ef_construction. The key query-time parameter is ef_search.

sql

-- Example of setting parameters at creation time
CREATE INDEX idx_documents_embedding_hnsw_tuned ON documents USING hnsw (embedding vector_l2_ops)
WITH (m = 32, ef_construction = 128);

-- Example of setting parameter at query time
SET hnsw.ef_search = 100;
SELECT ... FROM documents ORDER BY embedding <-> $1 LIMIT 10;

* m: The maximum number of connections per node in the graph. Higher m creates a denser graph, improving recall at the cost of larger index size, more memory usage, and longer build times.

* ef_construction: The size of the dynamic candidate list during index build. Higher values lead to a more accurate graph (better recall) but significantly increase index build time.

* ef_search: The size of the dynamic candidate list during a search. This is the most critical query-time parameter. Higher ef_search increases recall accuracy at the direct cost of query latency.

Tuning ef_search for the CTE Approach:

The CTE approach introduces a fascinating optimization opportunity. Since we are fetching a large candidate set (e.g., LIMIT 1000) and then filtering, we can often tolerate slightly lower recall in the initial vector search. The final WHERE clause will filter out any non-matching candidates anyway. This means we can potentially lower ef_search to make the initial vector search faster, without a significant impact on the final result quality.

Benchmark: Latency vs. Recall vs. ef_search with CTE

Let's simulate a scenario using the CTE approach. We will define "ground truth" as the results from a very high ef_search value (e.g., 400). Then we'll measure latency and recall (percentage of ground truth results found) for lower ef_search values.

Test Parameters:

* Table: 1M rows, 1000 tenants

* Query: CTE approach, LIMIT 200 in the CTE, final LIMIT 10.

* Filter: WHERE tenant_id = '...' (a tenant with ~1000 documents).

`hnsw.ef_search`	Avg. Latency (ms)	Recall@10 (%)	Notes
40 (default)	1.8	99.8	Baseline performance, excellent recall.
30	1.5	99.5	16% latency reduction for a negligible drop in recall.
20	1.1	98.2	Significant latency win. The ~2% recall drop might be acceptable.
15	0.9	95.1	Sub-millisecond latency, but the recall drop is becoming meaningful.
10	0.7	89.0	Very fast, but potentially unacceptable recall for many use cases.

This benchmark demonstrates a clear trade-off. By lowering ef_search from 40 to 20, we can almost halve the query latency while maintaining over 98% recall. The optimal value depends entirely on your product requirements for accuracy vs. speed.

Production Gotchas and Edge Cases

Index Bloat and VACUUM: HNSW indexes, like any other index in PostgreSQL, are subject to bloat due to MVCC. Frequent UPDATEs or DELETEs will leave dead tuples, which can degrade performance. A proactive VACUUM strategy is essential. For tables with high churn, consider periodic index rebuilds (REINDEX INDEX ...;) during a maintenance window.

Memory Management: For HNSW to be fast, the graph must reside in memory (shared_buffers). You can estimate the index size, but monitoring is key. Use pg_relation_size() to track the index's disk footprint and ensure your shared_buffers are adequately provisioned. If the index exceeds available memory, performance will plummet as it thrashes to and from disk.

Build Time vs. Query Time: A high ef_construction (e.g., 200+) will create a very high-quality index graph, which might allow you to use a lower ef_search at query time for the same recall. This is a classic build-time vs. query-time trade-off. If your data is static, invest in a long, high-quality build. If data is ingested continuously, a lower ef_construction might be necessary to keep the indexing process from falling behind.

A Decision Framework for Production

Choosing the right strategy depends on your specific access patterns, data distribution, and performance requirements.

Start with the CTE Approach: It's less invasive than re-architecting your tables with partitioning and provides significant gains over the naive query. It is the most flexible solution.

Benchmark and Tune:

* Establish a baseline for your application's recall requirements.

* Tune the CTE LIMIT to be large enough to consistently return a full result set for your p95 tenant size.

* Tune hnsw.ef_search downwards to find the sweet spot between latency and recall that meets your SLOs.

Consider Partitioning If:

* Your primary filter key has extremely high cardinality (e.g., user_id in a B2C app).

* Nearly 100% of your vector search queries are filtered by this key.

* The operational overhead of managing partitions is acceptable and can be automated.

* You require the absolute lowest possible latency (<1ms).

By moving beyond basic pgvector queries and embracing these advanced patterns, you can build highly performant, scalable, and accurate semantic search systems that successfully merge the worlds of relational and vector data within PostgreSQL.

The Dichotomy of Vector Search and Relational Filtering

Strategy 1: The Two-Step Query Refactor (CTE Approach)

Strategy 2: Table Partitioning for Metadata Isolation

Advanced HNSW Parameter Tuning for Filtered Scenarios

Production Gotchas and Edge Cases

A Decision Framework for Production

Found this article helpful?