pgvector HNSW Index Tuning for Production Vector Search

October 6, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Defaults: A Senior Engineer's Guide to pgvector HNSW Tuning

If you're reading this, you've already moved past the 'Hello, World' of vector search. You have pgvector installed, you're creating embeddings, and you've likely run your first <-> similarity search queries. You understand that HNSW (Hierarchical Navigable Small Worlds) offers a powerful balance of speed and recall for approximate nearest neighbor (ANN) search. However, the default CREATE INDEX ... USING hnsw is a starting point, not a production-ready solution.

In production, you're balancing a complex equation: query latency, search accuracy (recall), index build time, memory consumption, and storage cost. Simply increasing parameters until performance feels right is a recipe for inefficient resource usage and unpredictable behavior under load. This article deconstructs the key tuning levers within pgvector's HNSW implementation, explores critical architectural patterns for real-world applications, and addresses the operational edge cases that separate a prototype from a resilient, high-performance system.

We will assume you are working with a dataset of embeddings, for example, from a text model like text-embedding-3-small (1536 dimensions).

sql

-- Prerequisite table structure
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    tenant_id uuid NOT NULL,
    content text,
    embedding vector(1536) -- Assuming OpenAI's text-embedding-3-small
);

-- Populate with sample data (conceptual)
-- INSERT INTO documents (tenant_id, content, embedding) VALUES (...);

1. Deconstructing HNSW Build-Time Parameters: `M` and `ef_construction`

The quality of your HNSW graph is determined at index creation time. Two parameters, M and ef_construction, are the primary controls for this process. They directly influence the trade-off between index build time, memory usage, and the maximum achievable recall during a search.

`M`: The Connectivity of the Graph

What it is: M defines the maximum number of bidirectional links (neighbors) each node in the graph can have. It is the most critical parameter for defining the density and quality of the HNSW graph.

Impact:

- Higher M: Creates a denser graph. This increases the chances of finding the true nearest neighbors (higher potential recall) and can sometimes reduce query latency by providing shorter paths. However, it significantly increases index size, memory consumption, and build time.

- Lower M: Creates a sparser graph. This reduces memory/storage requirements and speeds up index creation, but it may cap the maximum achievable recall, as the search algorithm has fewer paths to explore.

Production Guidance:

- The pgvector default is 16.

- For most applications with datasets up to 1-2 million vectors, values between 16 and 48 are effective.

- For very large or complex datasets where high recall is paramount, you might push M to 64 or even 96, but only after benchmarking to prove the benefit outweighs the substantial memory cost.

`ef_construction`: The Search for Good Neighbors

What it is: ef_construction controls the size of the dynamic candidate list used during the index build process. For each new point added to the graph, the algorithm performs a search to find its nearest neighbors. ef_construction determines how exhaustive this search is.

Impact:

- Higher ef_construction: Leads to a higher-quality graph. The algorithm spends more time finding better, more accurate neighbors for each node. This directly translates to better recall at query time. The cost is a significantly longer index build time.

- Lower ef_construction: Speeds up index creation dramatically. However, the resulting graph may be suboptimal, with nodes connected to less-than-ideal neighbors. This can limit the effectiveness of the search algorithm later on.

Production Guidance:

- The pgvector default is 64.

- A good starting point is to set ef_construction to at least 4 * M.

- For applications where index build time is less critical than query performance, values between 128 and 512 are common. Setting it too high yields diminishing returns for a massive increase in build time.

Example and Benchmark Analysis

Let's analyze the creation of an HNSW index on a hypothetical documents table with 1 million 1536-dimensional vectors.

sql

-- Option 1: Fast Build, Lower Quality
CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);

-- Option 2: Balanced Approach (Recommended Start)
CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops)
WITH (m = 32, ef_construction = 128);

-- Option 3: High Recall, Slow Build
CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops)
WITH (m = 48, ef_construction = 256);

Configuration	`M`	`ef_construction`	Relative Build Time	Relative Index Size	Potential Recall Ceiling
Fast Build	16	64	1.0x	1.0x	~95-97%
Balanced	32	128	2.5x	1.8x	~98-99%
High Recall	48	256	6.0x	2.5x	>99.5%

These are illustrative numbers. Actuals depend heavily on hardware and data distribution.

Key Takeaway: You cannot fix a poorly constructed graph at query time. If your M is too low or ef_construction was rushed, no amount of query-time tuning can recover the lost recall. Invest in a quality index build; it's a one-time cost for ongoing query performance.

2. The Query-Time Balancing Act: `hnsw.ef_search`

Once you have a well-built index, hnsw.ef_search is your primary knob for tuning the live query trade-off between latency and recall.

What it is: Similar to ef_construction, ef_search is the size of the dynamic candidate list used during a search. It dictates how widely the algorithm explores the graph from the entry point to find the nearest neighbors to your query vector.

Impact:

- Higher ef_search: The search is more exhaustive, exploring more potential paths and candidates. This directly increases the probability of finding the true nearest neighbors (higher recall). The trade-off is higher query latency and CPU usage.

- Lower ef_search: The search is faster and less resource-intensive but is more 'greedy'. It may settle on locally optimal neighbors without discovering the globally best matches, leading to lower recall.

Tuning `ef_search` in Practice

hnsw.ef_search is a session-level parameter, which gives you incredible flexibility. You can adjust it on a per-query basis.

sql

-- Set for the current session/transaction
BEGIN;
SET LOCAL hnsw.ef_search = 100;

SELECT id, content
FROM documents
ORDER BY embedding <-> '[...your_query_vector...]'
LIMIT 10;

COMMIT;

This allows for dynamic tuning. For instance, a background job might use a high ef_search for maximum accuracy, while a real-time user-facing endpoint might use a lower value to meet strict latency SLAs.

Benchmarking the Latency vs. Recall Curve

To properly tune ef_search, you must benchmark it against a ground truth dataset. The process is:

Create a ground truth: For a sample set of query vectors, run an exact nearest neighbor search to find the actual top K neighbors. This is slow but necessary for validation.

sql

    -- Find the ground truth (slow!)
    SET enable_seqscan = on;
    SET enable_indexscan = off;
    SELECT id FROM documents ORDER BY embedding <-> '[...query_vector...]' LIMIT 100;

Test the HNSW index: Run the same query with the index enabled at various ef_search settings.

Calculate Recall@K: For each ef_search setting, calculate the recall: (number of true neighbors found) / K. For Recall@10, you check how many of the top 10 results from the HNSW search were also in the ground truth top 10.

Illustrative Benchmark Results (for an index with M=32, ef_construction=128):

`hnsw.ef_search`	Avg. Latency (ms)	Recall@10
20	3 ms	92.1%
40	5 ms	97.5%
80	9 ms	99.2%
150	16 ms	99.6%
300	30 ms	99.7%

This curve clearly shows diminishing returns. The jump from ef=20 to ef=80 provides a huge 7.1% recall gain for only 6ms of latency. The jump from ef=80 to ef=150 only yields 0.4% more recall but costs an additional 7ms. For most applications, an ef_search of 80 would be the optimal balance.

3. Production Pattern: Pre-Filtering vs. Post-Filtering

This is one of the most common and critical challenges in real-world vector search applications. Your data is never just a vector; it's associated with metadata like user IDs, tenant IDs, timestamps, or security tags. You need to perform a vector search within a filtered subset of your data.

Consider a multi-tenant RAG application where you must only search documents belonging to a specific tenant_id.

The Naive Approach: Post-Filtering

An intuitive but flawed approach is to fetch a large number of vector neighbors and then apply the metadata filter.

sql

-- ANTI-PATTERN: Post-filtering
SELECT id, content, (embedding <-> '[...query...]') as distance
FROM documents
WHERE tenant_id = 'a1b2c3d4-...' -- This filter is applied AFTER the vector search
ORDER BY embedding <-> '[...query...]'
LIMIT 10;

Why this fails: The ORDER BY ... LIMIT 10 is resolved by the HNSW index first. The index efficiently finds the global top 10 nearest neighbors from the entire table. Then, the WHERE tenant_id = ... clause is applied to this tiny set of 10 results. If none of those global top 10 happen to belong to the correct tenant, you get zero results. Even if a few match, you are not getting the true top 10 for that specific tenant.

To make this work, you'd have to fetch a massive number of neighbors (LIMIT 10000) and hope the top 10 for your tenant are in that set, which is wildly inefficient and unpredictable.

The Advanced Solution: Pre-Filtering with Composite Indexes

The correct and performant solution is to structure your query so that PostgreSQL's planner can use a standard B-tree index on your metadata column to filter the rows before performing the vector search. This is often called pre-filtering.

First, ensure you have a standard index on your filter column:

sql

CREATE INDEX ON documents (tenant_id);

The query remains the same, but the execution plan is radically different.

sql

-- CORRECT PATTERN: Pre-filtering
EXPLAIN ANALYZE
SELECT id, content
FROM documents
WHERE tenant_id = 'a1b2c3d4-...'
ORDER BY embedding <-> '[...query...]'
LIMIT 10;

Analyzing the EXPLAIN Plan:

PostgreSQL (from version 15+ with a modern pgvector) is smart enough to handle this efficiently. You will see an execution plan that looks something like this:

text

Limit  (cost=... rows=10 width=...)
  ->  Index Scan using documents_embedding_idx on documents
        Order By: (embedding <-> '[...query...]')
        Filter: (tenant_id = 'a1b2c3d4-...')

For low-cardinality tenants (few documents), the planner might choose to use the B-tree index first:

text

Limit  (cost=... rows=10 width=...)
  ->  Sort  (cost=...)
        Sort Key: ((embedding <-> '[...query...]'))
        ->  Bitmap Heap Scan on documents
              Recheck Cond: (tenant_id = 'a1b2c3d4-...')
              ->  Bitmap Index Scan on documents_tenant_id_idx
                    Index Cond: (tenant_id = 'a1b2c3d4-...')

In both cases, the planner correctly filters the dataset to the relevant subset before or during the expensive nearest neighbor search, guaranteeing accurate results with high performance. This is the single most important pattern for production multi-tenant vector search systems.

4. Memory Management with Product Quantization (PQ)

As your dataset grows into the tens or hundreds of millions of vectors, the memory footprint of a standard HNSW index can become prohibitive. An HNSW index on 100M 1536-dim float4 vectors can easily consume over 650 GB of RAM. This is where Product Quantization (PQ) becomes a critical tool.

What it is (at a high level): PQ is a lossy compression technique. It works by:

Splitting each vector into multiple sub-vectors.
Running a clustering algorithm (like k-means) on the set of sub-vectors in each segment to find 256 representative centroids.
Replacing each original sub-vector with the ID (a single byte) of its closest centroid.

This dramatically reduces the storage size of the vectors within the index, from 1536 * 4 bytes per vector to a much smaller, fixed number of bytes.

Implementing HNSW with PQ in `pgvector`

pgvector (0.7.0+) integrates PQ directly into the HNSW index build process. You specify the number of segments (pq_segments).

sql

-- The number of segments must be a divisor of the vector dimension.
-- For 1536 dimensions, 8, 12, 16, 24, etc., are valid.
-- 1536 / 16 = 96 segments. Each segment will have 16 dimensions.

CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops)
WITH (m = 32, ef_construction = 128, pq_segments = 96);

The Trade-Offs of PQ

Pro: Massively Reduced Memory/Storage: This is the primary benefit. An index with PQ can be 10-20x smaller than its full-precision counterpart, making it feasible to fit massive indexes into RAM.

Con: Lossy Compression & Lower Recall: Because you are no longer storing the exact vectors, PQ introduces a layer of approximation. The distance calculations are done on the quantized values, which are not perfectly accurate. This results in a lower recall ceiling compared to a full-precision HNSW index.

Con: Slower Index Build: The k-means clustering step during the index build adds significant overhead. Building an index with PQ is slower than a standard HNSW index.

When to use PQ:

When your dataset is so large that the full-precision HNSW index does not fit into available RAM.
When you can tolerate a slight drop in recall (e.g., from 99% to 96%) in exchange for a massive reduction in operational cost.

Always benchmark: Before committing to PQ in production, rigorously test the recall impact on your specific data and queries. For some datasets, the quality loss is minimal; for others, it can be substantial.

5. Operational Edge Cases and Concerns

Index Bloat and VACUUM:

Like any index in PostgreSQL, HNSW indexes are subject to bloat from UPDATE and DELETE operations. Dead tuples are not immediately removed. Over time, this can degrade performance as the search algorithm traverses through now-empty nodes. Regular VACUUM and ANALYZE operations are critical. For tables with high churn, aggressive autovacuum settings for that specific table are recommended.

The Curse of High Dimensionality:

HNSW performance is excellent up to ~2000 dimensions. Beyond that, the concept of 'distance' becomes less meaningful, and the performance of tree-based or graph-based ANN algorithms starts to degrade. If you are working with extremely high-dimensional vectors (>2048), you should first consider dimensionality reduction techniques (like PCA) before indexing. For these extreme cases, an IVFFlat index might even become competitive again, as its performance degrades more gracefully with dimensionality.

Hardware Matters: RAM is King:

The core assumption of HNSW is that the graph structure (the upper layers and connections) can be held in memory. When the index is larger than available RAM, performance falls off a cliff. The system will constantly be swapping index pages from disk to memory (I/O-bound), and query latencies will skyrocket from milliseconds to seconds. Always provision enough RAM to hold your entire HNSW index. Use pg_relation_size('your_index_name') to monitor its size and plan your hardware accordingly. PQ is your escape hatch when this is not feasible.

Conclusion: An Empirical Approach to Tuning

Optimizing pgvector's HNSW implementation is not about finding a single set of 'magic' parameters. It is an engineering discipline that requires a deep understanding of the trade-offs involved and a commitment to empirical validation.

Your tuning strategy should be:

Invest in the Build: Start with a solid foundation by choosing appropriate M and ef_construction values (e.g., M=32, ef_construction=128) and invest the time to build a quality index.

Benchmark for Your SLA: Create a ground truth dataset and systematically test different hnsw.ef_search values to find the sweet spot on the latency vs. recall curve that meets your product's requirements.

Filter First: Always design your application and queries to use a pre-filtering pattern with B-tree indexes on metadata. This is non-negotiable for scalable, multi-tenant systems.

Use PQ Strategically: Only introduce Product Quantization when memory pressure from a massive dataset forces your hand, and be prepared to accept the recall trade-off.

By moving beyond the defaults and applying these advanced patterns, you can build vector search systems that are not only powerful but also predictable, scalable, and cost-effective in production.

Beyond the Defaults: A Senior Engineer's Guide to pgvector HNSW Tuning

1. Deconstructing HNSW Build-Time Parameters: `M` and `ef_construction`

`M`: The Connectivity of the Graph

`ef_construction`: The Search for Good Neighbors

Example and Benchmark Analysis

2. The Query-Time Balancing Act: `hnsw.ef_search`

Tuning `ef_search` in Practice

Benchmarking the Latency vs. Recall Curve

3. Production Pattern: Pre-Filtering vs. Post-Filtering

The Naive Approach: Post-Filtering

The Advanced Solution: Pre-Filtering with Composite Indexes

4. Memory Management with Product Quantization (PQ)

Implementing HNSW with PQ in `pgvector`

The Trade-Offs of PQ

5. Operational Edge Cases and Concerns

Conclusion: An Empirical Approach to Tuning

Found this article helpful?