Optimizing pgvector HNSW Indexes for High-Cardinality Search

October 11, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Senior Engineer's Dilemma: Beyond `CREATE INDEX`

As vector search moves from experimental projects to core product features, the engineering challenge shifts from simple implementation to robust, scalable, and cost-effective operation. You've already integrated pgvector, embedded your data, and run a few successful ORDER BY embedding <-> $1 queries. But now, with a dataset of 10 million, 100 million, or even a billion vectors, the default settings are failing. Index build times are exploding, query latencies are unpredictable, and memory consumption is through the roof.

This is the critical juncture where senior engineering decisions make or break the system. The problem is that Approximate Nearest Neighbor (ANN) search, particularly with an algorithm like Hierarchical Navigable Small Worlds (HNSW), is not a black box. It's a complex system of trade-offs. Choosing the right parameters requires a deep understanding of the algorithm's mechanics and a rigorous, data-driven approach to tuning.

This article is a deep dive into the production-level optimization of HNSW indexes in pgvector. We will systematically dissect the key parameters, establish a robust benchmarking framework, and explore advanced architectural patterns for handling massive scale and complex query patterns. We assume you are familiar with vector embeddings and the basics of pgvector. Our focus is on the hard-won knowledge required to run it reliably in production.

Anatomy of HNSW Parameters: Your Tuning Levers

The HNSW algorithm constructs a multi-layered graph where long-range connections on upper layers facilitate rapid traversal, and shorter-range connections on lower, denser layers enable fine-grained search. In pgvector, you control the geometry and behavior of this graph primarily through three parameters:

m (Max Connections): This integer defines the maximum number of bidirectional links (neighbors) each node can have on any given layer of the graph.

* Impact: A higher m creates a denser graph. This generally improves the accuracy (recall) of search results and makes the index more robust to local minima during traversal. However, it comes at a significant cost: increased index size (memory footprint) and longer index build times. The relationship between m and memory is roughly linear.

* Production Guideline: Values typically range from 8 to 48. A common starting point is 16. For high-dimensional or complex data distributions, pushing m to 24 or 32 might be necessary to achieve desired recall.

ef_construction (Construction-Time Search): This integer controls the size of the dynamic candidate list used when inserting a new element into the graph. For each new node, the algorithm performs a search to find its nearest neighbors to connect to. ef_construction defines the beam width of this search.

* Impact: A larger ef_construction value leads to a higher-quality graph structure. The algorithm explores more potential neighbors for each new node, resulting in better-optimized connections. This translates to higher potential recall at query time. The trade-off is a dramatic increase in index build time. Doubling ef_construction can more than double the time it takes to build your index.

* Production Guideline: Values often range from 64 to 512. If your data is static and you can afford a one-time, lengthy build process (e.g., a nightly batch job), a higher value like 256 or 512 is beneficial. For datasets with frequent writes, a lower value like 128 is a more practical compromise.

ef_search (Query-Time Search): This is your most important runtime parameter. Similar to ef_construction, it defines the size of the dynamic candidate list during a search query. It's a session-level parameter, meaning you can adjust it for each query or transaction.

* Impact: This parameter directly controls the trade-off between search speed (latency) and accuracy (recall). A low ef_search value results in a very fast but less accurate search, as the algorithm explores a smaller portion of the graph. A high ef_search value forces a more exhaustive search, increasing accuracy at the cost of higher latency.

Production Guideline: This value is highly application-dependent. It must be greater than or equal to k (the number of neighbors you are requesting). A good starting point is k 2, but values can go up to 200 or more. The optimal setting can only be found through empirical testing against your specific Service Level Objectives (SLOs) for latency and recall.

A Rigorous Benchmarking Framework

Theory is insufficient. To optimize these parameters, we must measure their impact. Let's establish a reproducible benchmark. We'll simulate a production scenario with 10 million 768-dimensional vectors, typical of models like all-MiniLM-L6-v2.

Step 1: Table Setup

First, ensure you have the pgvector extension installed.

sql

CREATE EXTENSION IF NOT EXISTS vector;

-- Drop the table if it exists to ensure a clean slate
DROP TABLE IF EXISTS articles;

CREATE TABLE articles (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    published_date TIMESTAMPTZ,
    -- Assuming 768 dimensions from a common sentence transformer
    embedding VECTOR(768)
);

Step 2: Data Generation

In a real scenario, you'd populate this from your data source. For our benchmark, we'll use a Python script to generate random data. This isolates the performance of the vector index from any data generation bottlenecks.

python

import psycopg2
import numpy as np
from psycopg2.extras import execute_values
import time

# --- Configuration ---
DB_CONNECT_STRING = "postgresql://user:password@host:port/dbname"
NUM_VECTORS = 10_000_000
DIMENSIONS = 768
BATCH_SIZE = 1000

def generate_random_vectors(count, dims):
    """Generates a list of random vectors."""
    print(f"Generating {count} random vectors of {dims} dimensions...")
    vectors = np.random.rand(count, dims).astype(np.float32)
    return vectors.tolist()

def insert_data_batch(conn, data_batch):
    """Inserts a batch of data into the articles table."""
    with conn.cursor() as cur:
        execute_values(
            cur,
            "INSERT INTO articles (embedding) VALUES %s",
            [(v,) for v in data_batch]
        )

def main():
    vectors = generate_random_vectors(NUM_VECTORS, DIMENSIONS)
    
    print("Connecting to the database...")
    conn = psycopg2.connect(DB_CONNECT_STRING)
    conn.autocommit = False

    start_time = time.time()
    for i in range(0, NUM_VECTORS, BATCH_SIZE):
        batch = vectors[i:i + BATCH_SIZE]
        insert_data_batch(conn, batch)
        if (i // BATCH_SIZE) % 100 == 0:
            conn.commit() # Commit periodically
            progress = (i / NUM_VECTORS) * 100
            print(f"Inserted {i} vectors... ({progress:.2f}%)")
    
    conn.commit() # Final commit
    end_time = time.time()

    print(f"\nFinished inserting {NUM_VECTORS} vectors in {end_time - start_time:.2f} seconds.")
    conn.close()

if __name__ == "__main__":
    main()

Step 3: Defining Metrics

We need to measure:

Index Build Time: How long CREATE INDEX takes.

Index Size: The on-disk size of the index (pg_relation_size).

Recall@10: For a set of test queries, what percentage of the true top 10 nearest neighbors (found via a perfect, slow scan) are captured by our fast ANN query?

P99 Query Latency: The 99th percentile query time, which is a better measure of worst-case user experience than average latency.

To measure recall, we first need the "ground truth." We'll take 100 random vectors from our dataset as queries and find their true nearest neighbors using an exact search (which is only feasible on a smaller subset for benchmarking).

sql

-- To get ground truth for a single query vector '[...]' (replace with actual vector)
-- This will be SLOW. Run it for a sample of query vectors before building the index.
SELECT id FROM articles ORDER BY embedding <-> '[...]' LIMIT 10;

Systematic Tuning: A Practical Walkthrough

Let's apply our framework. We'll tune m and ef_construction first, as they determine the static quality of our index. Then we'll tune ef_search for runtime performance.

Phase 1: Tuning `m` and `ef_construction`

The goal is to find the sweet spot that gives us a high-quality index without exorbitant build times or memory usage. We'll test a few combinations.

Scenario A: Baseline (m=16, ef_construction=128)

sql

-- Use \timing in psql to measure execution time
CREATE INDEX ON articles USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 128);

-- Check index size
SELECT pg_size_pretty(pg_relation_size('articles_embedding_idx'));

* Expected Outcome: A reasonable starting point. Build time might be significant (potentially hours for 10M vectors, depending on hardware). Index size will be substantial.

Scenario B: Higher Quality (m=24, ef_construction=256)

sql

DROP INDEX articles_embedding_idx;
CREATE INDEX ON articles USING hnsw (embedding vector_l2_ops)
WITH (m = 24, ef_construction = 256);

Expected Outcome: Build time will be considerably longer than Scenario A (perhaps 2-3x). Index size will increase by about 50% (due to m increasing from 16 to 24). The resulting index will have the potential* for higher recall.

Scenario C: Max Speed Build (m=16, ef_construction=64)

sql

DROP INDEX articles_embedding_idx;
CREATE INDEX ON articles USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);

* Expected Outcome: The fastest build time, but the index quality will be lower. This might be acceptable for applications where recall is less critical or where data is updated so frequently that long build times are prohibitive.

After running these tests, you'll have a table of results:

`m`	`ef_construction`	Build Time (min)	Index Size (GB)
16	64	~45	~7.5
16	128	~90	~7.5
24	256	~220	~11.2

(Note: Timings and sizes are illustrative and highly dependent on hardware.)

Decision: For our primary use case, let's assume we value quality and can afford the build time. We'll proceed with the index from Scenario B (m=24, ef_construction=256).

Phase 2: Tuning `ef_search` for the Recall-Latency Trade-off

Now we tune the runtime parameter. We'll execute a series of queries against our chosen index, adjusting ef_search for each run.

sql

-- Set a baseline search parameter
SET LOCAL hnsw.ef_search = 40;
EXPLAIN ANALYZE SELECT id FROM articles ORDER BY embedding <-> '[...]' LIMIT 10;

-- Increase the search parameter for higher accuracy
SET LOCAL hnsw.ef_search = 80;
EXPLAIN ANALYZE SELECT id FROM articles ORDER BY embedding <-> '[...]' LIMIT 10;

-- Go further for maximum accuracy
SET LOCAL hnsw.ef_search = 160;
EXPLAIN ANALYZE SELECT id FROM articles ORDER BY embedding <-> '[...]' LIMIT 10;

By running these queries for our 100 test vectors and comparing the results to our ground truth, we can build a performance curve:

`ef_search`	Recall@10 (%)	P99 Latency (ms)
40	92.5%	8
80	98.2%	15
160	99.6%	28
320	99.8%	55

This table is the most critical output of your tuning process. It allows you to make an informed, quantitative decision.

* If your application requires sub-20ms latency, you might choose ef_search = 80 and accept a recall of 98.2%.

* If accuracy is paramount and a 30ms latency is acceptable, ef_search = 160 is the better choice.

Production Pattern: Dynamic ef_search

Different parts of your application may have different requirements. You can expose this trade-off to the application layer. A user-facing, real-time search might use a lower ef_search, while an internal, high-precision analytics job could use a much higher value. This is easily accomplished by setting the parameter at the start of the transaction.

python

# Application code example

def find_similar(query_vector, k, accuracy_tier='balanced'):
    ef_search_map = {
        'fast': k * 2,       # e.g., 20
        'balanced': k * 8,   # e.g., 80
        'high_recall': k * 16 # e.g., 160
    }
    ef_search = ef_search_map.get(accuracy_tier, 80)

    with conn.cursor() as cur:
        cur.execute(f"SET LOCAL hnsw.ef_search = {ef_search};")
        cur.execute(
            "SELECT id FROM articles ORDER BY embedding <-> %s LIMIT %s",
            (query_vector, k)
        )
        results = cur.fetchall()
    return results

Advanced Patterns for Production Scale

Beyond parameter tuning, real-world systems present more complex challenges.

Edge Case: Handling `UPDATE`s, `DELETE`s, and Index Bloat

HNSW indexes in pgvector do not physically remove nodes upon DELETE or UPDATE. Instead, they are marked as "dead" and are ignored during graph traversal. Over time, with a high churn rate, your index can become bloated with these dead nodes. This has two negative consequences:

Performance Degradation: Search traversals may still land on dead nodes, forcing the algorithm to backtrack and find a live entry point, adding latency.

Increased Memory Usage: Dead nodes continue to consume memory until the index is rebuilt.

Solution: Periodic REINDEX

The solution is to periodically rebuild the index using PostgreSQL's REINDEX command. This creates a new, clean index from scratch and swaps it in.

sql

REINDEX INDEX CONCURRENTLY articles_embedding_idx;

CRITICAL CAVEAT: REINDEX (even CONCURRENTLY) takes an ACCESS EXCLUSIVE lock at the beginning and end of the operation. While the CONCURRENTLY option allows reads and writes during the long build phase, these short but powerful locks can be disruptive in a high-throughput environment. You must schedule this during a maintenance window and monitor for lock contention.

The Hybrid Search Problem: Combining Vector and Scalar Filters

A very common production pattern is to search for vectors within a specific subset of your data. For example: "Find articles similar to this one, but only those published in the last 30 days and belonging to tenant_id = 'acme'."

sql

-- The naive hybrid query
SELECT id, content
FROM articles
WHERE 
    published_date > NOW() - INTERVAL '30 days'
    AND tenant_id = 'acme' -- (Assuming we added a tenant_id column)
ORDER BY embedding <-> '[...]' 
LIMIT 10;

The PostgreSQL query planner faces a difficult choice: apply the scalar filters first, or perform the vector search first? In most cases, it will choose to use the HNSW index first to find the top k (e.g., 10) nearest neighbors from the entire dataset. It will then apply the WHERE clause to this small result set.

This is catastrophically inefficient if the filters are highly selective. The planner might find 10 vectors, all of which get disqualified by the WHERE clause, resulting in an empty or incomplete result set. To get a full 10 results, the database might have to retrieve thousands or tens of thousands of candidates from the vector index, only to discard most of them. This is known as the "post-filtering" problem.

Solution Pattern: Partitioning

For datasets reaching hundreds of millions or billions of vectors, the most effective solution is to use PostgreSQL's native table partitioning. By partitioning your data on the columns you filter by most frequently (like tenant_id or a date range), you enable the planner to prune entire partitions before performing the vector search.

Let's re-architect our table for a multi-tenant application:

sql

-- Re-create the table with partitioning
CREATE TABLE articles (
    id BIGSERIAL,
    content TEXT,
    published_date TIMESTAMPTZ,
    embedding VECTOR(768),
    tenant_id TEXT NOT NULL
) PARTITION BY LIST (tenant_id);

-- Create partitions for specific tenants
CREATE TABLE articles_tenant_acme PARTITION OF articles FOR VALUES IN ('acme');
CREATE TABLE articles_tenant_globex PARTITION OF articles FOR VALUES IN ('globex');
-- ... and so on for each tenant

-- IMPORTANT: The index must be created on each partition individually
CREATE INDEX ON articles_tenant_acme USING hnsw (embedding vector_l2_ops) WITH (m=24, ef_construction=256);
CREATE INDEX ON articles_tenant_globex USING hnsw (embedding vector_l2_ops) WITH (m=24, ef_construction=256);

Now, when you run the hybrid query, the planner immediately identifies that it only needs to look at the articles_tenant_acme partition. It performs the fast HNSW search on this much smaller, isolated index, leading to a massive performance improvement. This pre-filtering approach is the gold standard for scaling hybrid search in pgvector.

Conclusion: From Heuristics to Engineering

Optimizing pgvector's HNSW indexes for high-cardinality search is a task of precision engineering, not guesswork. The default settings are a starting point, not a destination. By adopting a systematic benchmarking approach, you can move from heuristics to data-driven decisions.

Key Takeaways for Production Systems:

Tune Systematically: Isolate the impact of m, ef_construction, and ef_search. First, optimize the index build (m, ef_construction) based on your tolerance for build time and memory. Then, optimize query time (ef_search) based on your application's latency/recall SLOs.

ef_search is Your Runtime Knob: Leverage the ability to set ef_search per query to offer different performance tiers within your application.

Plan for Maintenance: For write-heavy workloads, index bloat is inevitable. Schedule and monitor REINDEX CONCURRENTLY to maintain performance.

Partition for Hybrid Search: At scale, naive hybrid queries will fail. Proactively partition your tables on high-selectivity scalar columns (tenant_id, date, region) to enable partition pruning, which is the most effective way to combine vector and relational queries.

By moving beyond the basics and embracing these advanced patterns, you can build a vector search system in PostgreSQL that is not only powerful but also scalable, reliable, and predictable under the pressures of a production environment.