Optimizing pgvector HNSW for Real-time Similarity Search

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond `CREATE INDEX`: A Production Deep Dive into pgvector's HNSW

For any team building AI-native applications, integrating vector similarity search directly into the primary operational database is a compelling proposition. It simplifies the stack, reduces data synchronization complexity, and leverages the robust, ACID-compliant environment of systems like PostgreSQL. The pgvector extension has emerged as a leader in this space, and its support for the HNSW (Hierarchical Navigable Small World) algorithm provides a powerful tool for Approximate Nearest Neighbor (ANN) search.

However, treating HNSW as a black box by simply running CREATE INDEX ... USING hnsw is a direct path to performance bottlenecks, unpredictable recall, and operational headaches in a production environment. The default parameters are a generic starting point, not a universal solution. Achieving low-latency, high-recall similarity search at scale requires a nuanced understanding of HNSW's internal mechanics and a deliberate strategy for tuning, benchmarking, and maintenance.

This article is for engineers who have already moved past the proof-of-concept stage. We will not cover what a vector is or the basics of a cosine distance. Instead, we will dissect the critical tuning parameters (m, ef_construction, ef_search), provide a concrete framework for benchmarking their impact, and address the most significant production challenge: managing a mutable dataset with an HNSW index.


The Core Trade-off Triangle: Recall, Latency, and Build Cost

Every decision in tuning an HNSW index revolves around balancing three competing factors:

  • Recall: The percentage of true nearest neighbors returned by the approximate search. A recall of 1.0 (or 100%) means the search is perfectly accurate, equivalent to an exhaustive k-NN search.
  • Query Latency: The time it takes to perform a similarity search for a given query vector. This is the critical metric for real-time applications.
  • Index Build Cost: The time and computational resources (CPU, memory) required to construct the index. For static datasets, this is a one-time cost. For dynamic datasets, this cost is incurred repeatedly during re-indexing.
  • Two index-time parameters, m and ef_construction, primarily influence the structure and quality of the HNSW graph, setting the stage for the trade-offs. One query-time parameter, ef_search, allows you to navigate this trade-off dynamically.

    1. `m`: The Connectivity of the Graph

    The m parameter defines the maximum number of bidirectional links (connections) each node in the graph can have. It is the single most important factor determining the memory footprint of your index.

    * Low m (e.g., 8-12): Results in a sparse graph. This reduces memory consumption significantly but can lead to lower recall, as the search may not have enough pathways to find the true nearest neighbors. It's suitable for datasets with very distinct clusters or when memory is the absolute primary constraint.

    * High m (e.g., 32-64): Creates a dense, highly connected graph. This drastically increases the probability of finding the optimal path during a search, leading to higher potential recall. However, the memory usage and index size scale non-linearly with m. A higher m can also sometimes slow down queries slightly due to the increased number of neighbors to evaluate at each step.

    Rule of Thumb: Start with m = 16 or m = 24. The optimal value is highly dependent on the intrinsic dimensionality of your data. For high-dimensional, complex data, a larger m may be necessary.

    Memory Impact Analysis:

    The memory required for the HNSW index is roughly (M D 4 + M M 2 4) num_vectors bytes, where M is m, D is the vector dimension, and 4 is the size of a float. The MM2*4 term represents the storage for links. This shows the quadratic impact of m on memory.

    sql
    -- Example of creating an index with a specific 'm'
    -- Let's assume we have a table 'products' with a 'embedding' column
    -- CREATE EXTENSION IF NOT EXISTS vector;
    -- CREATE TABLE products (id bigserial PRIMARY KEY, name text, embedding vector(768));
    -- ... populate data ...
    
    -- Creating an HNSW index with a higher connectivity
    CREATE INDEX ON products USING hnsw (embedding vector_l2_ops) WITH (m = 32);

    2. `ef_construction`: The Quality of the Build

    During index construction, for each new point added, the algorithm performs a search to find its nearest neighbors to connect to. ef_construction controls the size of the dynamic candidate list during this search. It's a build-time parameter that directly impacts the quality of the resulting graph.

    * Low ef_construction (e.g., 40-64): Leads to a faster index build. The resulting graph may be of lower quality, potentially creating local optima that trap the search algorithm later. This can cap the maximum achievable recall, regardless of how high you set the query-time search parameter (ef_search).

    * High ef_construction (e.g., 128-512): Significantly increases index build time but produces a much higher quality graph. The search for neighbors during construction is more exhaustive, leading to better connections and a higher potential for achieving near-perfect recall during queries.

    Rule of Thumb: A good starting point is ef_construction = 4 * m. If you are prioritizing query performance and recall over build time, be generous with this parameter. A one-time investment in a high-quality build pays dividends for every subsequent query.

    sql
    -- Creating an index with higher build quality
    CREATE INDEX ON products USING hnsw (embedding vector_l2_ops) WITH (m = 32, ef_construction = 128);

    3. `ef_search`: The Real-time Tuning Knob

    This is the most critical parameter for real-time applications. Unlike the previous two, ef_search is not set at index creation. It is a session-level parameter that you configure before executing a query. It controls the size of the dynamic candidate list during the search at query time.

    * Low ef_search (e.g., 20-40): Results in very low-latency queries. The search is 'greedier' and explores a smaller portion of the graph, which increases the risk of missing the true nearest neighbors (lower recall).

    * High ef_search (e.g., 100-400): Increases query latency but also improves recall. The search is more exhaustive, exploring more potential paths to find the best candidates.

    This parameter allows you to dynamically balance the recall/latency trade-off per query or per application use case. A background batch job might use a high ef_search for accuracy, while a user-facing real-time search might use a lower value to meet latency SLOs.

    sql
    -- Setting ef_search before running a query
    BEGIN;
    SET LOCAL hnsw.ef_search = 100;
    SELECT id, name FROM products ORDER BY embedding <-> '[0.1,0.2,...,0.n]' LIMIT 10;
    COMMIT;

    Using BEGIN; SET LOCAL ...; COMMIT; ensures the setting is scoped only to the current transaction, which is critical in a connection-pooled environment to prevent settings from one request from leaking into another.


    A Production-Grade Benchmarking Framework

    Theoretical understanding is insufficient. You must benchmark these parameters against your own data and production environment. Here's a Python-based framework using psycopg and numpy to systematically evaluate the performance of your pgvector setup.

    We'll use a synthetic dataset for this example, but you should substitute this with a representative sample of your production data.

    Prerequisites:

    * Python 3.8+

    * psycopg, numpy, pandas, tqdm, matplotlib

    Step 1: Setup and Data Generation

    First, let's set up our database table and generate some sample vector data.

    python
    import psycopg
    import numpy as np
    import time
    import pandas as pd
    from tqdm import tqdm
    
    # --- Configuration ---
    DB_CONN_STRING = "postgresql://user:password@host:port/dbname"
    
    # Vector configuration
    NUM_VECTORS = 100_000
    DIMENSIONS = 768
    
    # HNSW Index Parameters to test
    M = 32
    EF_CONSTRUCTION = 128
    
    # Query parameters
    NUM_QUERIES = 100
    K_NEIGHBORS = 10
    EF_SEARCH_VALUES = [20, 40, 60, 80, 100, 150, 200, 300, 400]
    
    # --- Database Setup ---
    def setup_database(conn):
        with conn.cursor() as cur:
            cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
            cur.execute("DROP TABLE IF EXISTS items;")
            cur.execute(f"CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector({DIMENSIONS}));")
            conn.commit()
    
        print("Generating random data...")
        embeddings = np.random.rand(NUM_VECTORS, DIMENSIONS).astype(np.float32)
        
        print("Inserting data...")
        with conn.cursor() as cur:
            with cur.copy("COPY items (embedding) FROM STDIN WITH (FORMAT BINARY)") as copy:
                copy.set_types([f'vector({DIMENSIONS})'])
                for row in tqdm(embeddings, desc="Copying data"):
                    copy.write_row([row])
        conn.commit()
        print("Data insertion complete.")
    

    Step 2: Ground Truth Calculation

    To measure recall, we need to know the actual nearest neighbors. This requires performing an exhaustive, exact search (which will be slow) and storing the results.

    python
    def calculate_ground_truth(conn, query_vectors):
        ground_truth = []
        print("Calculating ground truth (exact search)...")
        with conn.cursor() as cur:
            # Disable index scan to ensure exact search
            cur.execute("SET enable_seqscan = on;")
            cur.execute("SET enable_indexscan = off;")
            for query_vec in tqdm(query_vectors, desc="Ground Truth Queries"):
                cur.execute(
                    "SELECT id FROM items ORDER BY embedding <-> %s LIMIT %s",
                    (query_vec.tolist(), K_NEIGHBORS)
                )
                ground_truth.append(set([row[0] for row in cur.fetchall()]))
        return ground_truth

    Step 3: Benchmarking Loop

    Now, we create the HNSW index and then loop through our desired ef_search values, measuring latency and recall for each.

    python
    def run_benchmark(conn, query_vectors, ground_truth):
        results = []
        print(f"\nCreating HNSW index with m={M}, ef_construction={EF_CONSTRUCTION}...")
        start_time = time.time()
        with conn.cursor() as cur:
            cur.execute("SET maintenance_work_mem = '2GB';") # Allocate more memory for faster build
            cur.execute(f"CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = {M}, ef_construction = {EF_CONSTRUCTION});")
            conn.commit()
        build_time = time.time() - start_time
        print(f"Index built in {build_time:.2f} seconds.")
    
        for ef_search in EF_SEARCH_VALUES:
            latencies = []
            recalls = []
            
            with conn.cursor() as cur:
                # Use SET LOCAL within a transaction for safety in pooled environments
                cur.execute(f"BEGIN; SET LOCAL hnsw.ef_search = {ef_search};")
    
                for i, query_vec in enumerate(tqdm(query_vectors, desc=f"Querying with ef_search={ef_search}")):
                    start_q_time = time.time()
                    cur.execute(
                        "SELECT id FROM items ORDER BY embedding <-> %s LIMIT %s",
                        (query_vec.tolist(), K_NEIGHBORS)
                    )
                    ann_results = set([row[0] for row in cur.fetchall()])
                    latencies.append((time.time() - start_q_time) * 1000) # milliseconds
    
                    # Calculate recall
                    intersection = len(ann_results.intersection(ground_truth[i]))
                    recalls.append(intersection / K_NEIGHBORS)
                
                cur.execute("COMMIT;")
    
            avg_latency = np.mean(latencies)
            avg_recall = np.mean(recalls)
            p95_latency = np.percentile(latencies, 95)
    
            results.append({
                'ef_search': ef_search,
                'avg_recall': avg_recall,
                'avg_latency_ms': avg_latency,
                'p95_latency_ms': p95_latency,
                'build_time_s': build_time,
                'm': M,
                'ef_construction': EF_CONSTRUCTION
            })
            print(f"ef_search={ef_search}: Recall={avg_recall:.4f}, Avg Latency={avg_latency:.2f}ms")
    
        return pd.DataFrame(results)
    
    # --- Main Execution ---
    if __name__ == "__main__":
        with psycopg.connect(DB_CONN_STRING) as conn:
            setup_database(conn)
            
            # Generate query vectors (should not be in the dataset)
            query_vectors = np.random.rand(NUM_QUERIES, DIMENSIONS).astype(np.float32)
            
            gt = calculate_ground_truth(conn, query_vectors)
            benchmark_results = run_benchmark(conn, query_vectors, gt)
    
            print("\n--- Benchmark Results ---")
            print(benchmark_results.to_string())
    
            # You can now plot these results to visualize the trade-off
            # import matplotlib.pyplot as plt
            # ... plotting code ...

    Running this script will produce a table showing the direct relationship between ef_search, recall, and latency for your specific index configuration. By running the entire script with different m and ef_construction values, you can map out the performance landscape and make an informed decision based on your application's SLOs for both recall and latency.


    The Elephant in the Room: Handling Data Mutability

    While HNSW provides excellent query performance, its graph structure is fundamentally static. In pgvector's implementation, the HNSW index does not directly support UPDATE or DELETE operations. When a row is deleted or its vector is updated, the entry in the HNSW graph is not removed. It becomes a "dead" entry.

    This has two severe consequences in production:

  • Performance Degradation: Over time, the index becomes bloated with dead tuples. Search queries will waste cycles traversing parts of the graph that point to these dead entries, increasing latency.
  • Incorrect Results: A search might return a pointer to a row that has since been deleted. Your application logic must be robust enough to handle this (e.g., by re-checking if the returned ID still exists).
  • Ignoring this will lead to a slow, inevitable decay of your search system's performance and accuracy. The solution is a deliberate re-indexing strategy.

    Production Pattern: Concurrent Re-indexing

    You cannot simply DROP INDEX and CREATE INDEX on a live production table, as this would lock the table and cause significant downtime for your search feature. The correct approach is to use CREATE INDEX CONCURRENTLY.

    This process builds a new, clean index in the background without taking heavy locks on the table. Once the new index is built, you can atomically swap it with the old one.

    Here is a robust SQL script to perform this operation, which can be run periodically via a cron job or a scheduled task runner.

    sql
    DO $$
    DECLARE
        index_name TEXT := 'items_embedding_hnsw_idx';
        new_index_name TEXT := 'items_embedding_hnsw_idx_new';
        old_index_name TEXT := 'items_embedding_hnsw_idx_old';
        m_val INT := 32;
        ef_construction_val INT := 128;
    BEGIN
        RAISE NOTICE 'Starting concurrent re-indexing for %', index_name;
    
        -- 1. Create the new index concurrently. This is the long-running part.
        -- It builds a fresh index from the current state of the table.
        RAISE NOTICE 'Building new index % ...', new_index_name;
        EXECUTE format(
            'CREATE INDEX CONCURRENTLY %I ON items USING hnsw (embedding vector_l2_ops) WITH (m = %s, ef_construction = %s);',
            new_index_name, m_val, ef_construction_val
        );
        RAISE NOTICE 'New index built successfully.';
    
        -- 2. Once the new index is ready, we perform the swap in a single transaction.
        BEGIN;
            -- Lock the table to prevent schema changes during the swap.
            -- An ACCESS EXCLUSIVE lock is heavy, but the transaction is very short.
            LOCK TABLE items IN ACCESS EXCLUSIVE MODE;
    
            -- 3. Rename the old index so we can drop it later.
            RAISE NOTICE 'Renaming old index % to %', index_name, old_index_name;
            EXECUTE format('ALTER INDEX IF EXISTS %I RENAME TO %I;', index_name, old_index_name);
    
            -- 4. Rename the new index to the original name.
            RAISE NOTICE 'Activating new index % by renaming to %', new_index_name, index_name;
            EXECUTE format('ALTER INDEX %I RENAME TO %I;', new_index_name, index_name);
        COMMIT;
    
        -- 5. Drop the old index. This can be done outside the transaction.
        RAISE NOTICE 'Dropping old index %', old_index_name;
        EXECUTE format('DROP INDEX CONCURRENTLY IF EXISTS %I;', old_index_name);
    
        RAISE NOTICE 'Re-indexing complete.';
    
    EXCEPTION
        WHEN OTHERS THEN
            RAISE WARNING 'An error occurred during re-indexing: %', SQLERRM;
            RAISE NOTICE 'Attempting to drop temporary new index if it exists.';
            EXECUTE format('DROP INDEX CONCURRENTLY IF EXISTS %I;', new_index_name);
            RAISE;
    END;
    $$;

    Operational Considerations for Re-indexing:

    * Frequency: The ideal frequency depends on your data churn rate. For a high-traffic system with many updates/deletes, you might need to run this daily or even more frequently. For a system with infrequent changes, weekly or monthly might suffice. Monitor index bloat and query latency to determine the right schedule.

    * Resource Usage: CREATE INDEX CONCURRENTLY is CPU and I/O intensive. Run it during off-peak hours. It will also temporarily double your index storage space on disk.

    * maintenance_work_mem: Ensure this PostgreSQL setting is sufficiently high for the session running the re-index script to speed up the build process significantly.


    Advanced Edge Cases and Final Tuning

    The MVCC and HNSW Interaction

    Even with re-indexing, it's crucial to understand how PostgreSQL's Multi-Version Concurrency Control (MVCC) interacts with pgvector. When you UPDATE a row, Postgres effectively creates a new version of that row and marks the old one as a dead tuple. VACUUM is responsible for cleaning up these dead tuples.

    However, the HNSW index still holds a pointer to the old, dead version of the row. A search can lead you to this dead tuple. If another query has already been VACUUMed, your application might fetch an ID that no longer exists. This is a race condition that must be handled.

    Solution: Always re-validate search results, especially in high-churn environments. Instead of a simple SELECT id ..., perform a JOIN or a subsequent SELECT to ensure the entity is still valid and hasn't been modified since the index was last traversed.

    sql
    -- A more robust query pattern
    WITH ann_results AS (
        SELECT id
        FROM items
        ORDER BY embedding <-> '[...]' 
        LIMIT 20 -- Fetch more than needed to account for dead tuples
    )
    SELECT p.id, p.name
    FROM products p
    JOIN ann_results ar ON p.id = ar.id
    -- Add any additional filtering here, e.g., p.is_active = true
    LIMIT 10;

    Leveraging Quantization for Memory Reduction

    For extremely large datasets where memory becomes the primary bottleneck (billions of vectors), pgvector 0.5.0+ introduced Scalar Quantization. This technique reduces the precision of the vectors, storing them as 8-bit integers (int8) instead of 32-bit floats (float4), resulting in a ~4x reduction in memory and disk footprint.

    This is a trade-off. You lose precision, which can impact recall. However, for many applications, this loss is acceptable in exchange for the massive scalability gains.

    sql
    -- Example of creating an index with Scalar Quantization
    CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) 
    WITH (m = 32, ef_construction = 128, lists = 100); -- `lists` is required for SQ
    
    -- NOTE: The `lists` parameter is part of the IVFFlat algorithm, but its syntax is used
    -- by pgvector to enable quantization for HNSW. This is a quirk of the current API.

    This is an advanced optimization. You should only consider it after exhausting standard HNSW tuning and if your memory pressure is exceptionally high.

    Conclusion

    Using pgvector's HNSW index in production is a powerful strategy, but it demands an engineering discipline that goes far beyond the initial CREATE INDEX command. A successful implementation hinges on a deep, data-driven understanding of the m, ef_construction, and ef_search parameters, which you can only gain through rigorous benchmarking.

    Furthermore, the static nature of the HNSW graph requires a proactive, automated re-indexing strategy to manage data mutability and prevent performance degradation. By combining disciplined tuning with robust operational procedures like concurrent re-indexing, you can build a highly performant, scalable, and reliable real-time similarity search system directly within the comfort and power of the PostgreSQL ecosystem.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles