Optimizing HNSW Indexing in pgvector for High-Recall Similarity Search

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Problem: Beyond Default pgvector HNSW Settings

You've successfully integrated pgvector into your PostgreSQL instance. Your application, whether a RAG pipeline, an e-commerce recommendation engine, or an image similarity search service, is generating embeddings and storing them. You've created an HNSW index with the default parameters, and initial tests look promising. Then, you hit production traffic.

Query latency becomes erratic. The p99 latency for a simple nearest neighbor search spikes from 50ms to 500ms under concurrent load. More alarmingly, your recall metrics, meticulously measured offline, are dropping in production; the system is returning less relevant results than expected. Your default HNSW index, CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);, is failing you.

This is a common scenario for teams scaling vector search workloads. The default parameters for HNSW in pgvector are a safe starting point, but they are rarely optimal for high-performance, high-recall production systems. Achieving consistent low latency and high accuracy requires a deep understanding of the trade-offs governed by three key parameters: m, ef_construction, and ef_search.

This article is not an introduction to HNSW. It assumes you understand the fundamentals of approximate nearest neighbor (ANN) search and the graph-based nature of HNSW. We will dive directly into the advanced tuning strategies, edge cases, and operational patterns required to run pgvector at scale.


HNSW Tuning Levers: A Trilemma of Recall, Latency, and Cost

The performance of an HNSW index is a constant balancing act between three competing factors:

  • Recall: The accuracy of your search. What percentage of the true nearest neighbors are returned?
  • Query Latency: The time it takes to perform a search.
  • Resource Cost: A combination of index build time, index size (disk/memory), and CPU usage during queries.
  • The parameters m, ef_construction, and ef_search are the primary levers you control to navigate this trilemma.

    1. `m`: The Graph's Connectivity (Build-Time)

  • What it is: m defines the maximum number of bidirectional links (neighbors) each node in the graph can have. It's set once at index creation time.
  • Technical Impact: A higher m creates a denser, more connected graph. During a search, this provides more potential pathways to the true nearest neighbor, directly improving the probability of finding it (i.e., increasing recall). However, this density comes at a cost:
  • - Index Size: The index grows proportionally to m. A larger m means more edges stored per node.

    - Build Time: Constructing a denser graph is computationally more expensive, leading to significantly longer index creation times.

    - Query CPU: Traversing a denser graph can sometimes, though not always, increase CPU usage per query.

    Production Strategy: The default m is 16. This is often too low for high-dimensional data (>768 dimensions) where high recall is critical.

  • For RAG/Question-Answering (High Recall is King): Start with m = 32 or even m = 48. The upfront cost in build time and index size is often worth the significant improvement in the quality of retrieved contexts.
  • For E-commerce Recommendations (Balance of Speed/Relevance): m = 24 is often a good sweet spot. You can tolerate slightly lower recall if it means faster responses to the user.
  • Implementation Example:

    sql
    -- For a high-recall document search system
    -- Note: This will be slower to build and larger than the default.
    CREATE INDEX ON documents USING hnsw (embedding vector_l2_ops) WITH (m = 32, ef_construction = 128);

    2. `ef_construction`: The Graph's Quality (Build-Time)

  • What it is: ef_construction controls the size of the dynamic candidate list used during the index-building process. For each new node being inserted, the algorithm searches the graph for the ef_construction nearest neighbors. From this list, it selects the best m candidates to form permanent links.
  • Technical Impact: A higher ef_construction value allows the algorithm to explore more potential neighbors during insertion, resulting in a higher-quality, more globally optimal graph structure. This has a powerful, direct impact on the recall you can achieve at query time. The trade-off is stark: ef_construction is the single biggest factor influencing index build time.
  • Production Strategy: The default ef_construction is 64. This is often insufficient for achieving >98% recall.

  • Rule of Thumb: A good starting point is ef_construction = 4 * m. If you set m = 32, start with ef_construction = 128.
  • Offline vs. Online Indexing: If you build your index offline in a batch process, you can afford to be generous. Pushing ef_construction to 200 or even 400 can yield S-tier recall. If you are indexing data in near real-time, you'll need to benchmark the insertion latency and find a balance.
  • Benchmarking ef_construction vs. Build Time:

    Let's simulate this with a Python script. Assume a table items with 1 million 768-dimensional vectors.

    python
    import psycopg2
    import time
    import os
    
    # --- Configuration ---
    DB_CONN = os.environ.get("DB_CONN_STRING") # "postgresql://user:pass@host/db"
    VECTOR_DIM = 768
    TABLE_NAME = "items_benchmark"
    
    # --- Test Parameters ---
    M_VALUE = 32
    EF_CONSTRUCTION_VALUES = [64, 128, 256, 400]
    
    def run_benchmark():
        conn = psycopg2.connect(DB_CONN)
        conn.autocommit = True
        cur = conn.cursor()
    
        print("--- HNSW ef_construction Benchmark ---")
        for efc in EF_CONSTRUCTION_VALUES:
            index_name = f"idx_hnsw_m{M_VALUE}_efc{efc}"
            
            # Drop previous index if it exists
            cur.execute(f"DROP INDEX IF EXISTS {index_name};")
            print(f"Building index with m={M_VALUE}, ef_construction={efc}...")
    
            start_time = time.time()
            try:
                cur.execute(f"""
                    CREATE INDEX {index_name} 
                    ON {TABLE_NAME} 
                    USING hnsw (embedding vector_l2_ops) 
                    WITH (m = {M_VALUE}, ef_construction = {efc});
                """)
                end_time = time.time()
                build_time = end_time - start_time
                print(f"  -> Success! Build time: {build_time:.2f} seconds")
    
                # Get index size
                cur.execute("SELECT pg_size_pretty(pg_relation_size(%s));", (index_name,))
                index_size = cur.fetchone()[0]
                print(f"  -> Index size: {index_size}")
    
            except Exception as e:
                print(f"  -> FAILED: {e}")
    
        cur.close()
        conn.close()
    
    if __name__ == "__main__":
        # This assumes you have a table populated with random vectors.
        # See pgvector docs for populating sample data.
        run_benchmark()

    Expected Outcome: You will observe a non-linear increase in build time. Moving from ef_construction=64 to 128 might double the build time, while moving from 128 to 256 could triple it. This benchmark is critical for planning your indexing strategy and maintenance windows.

    3. `ef_search`: The Query-Time Precision Knob

  • What it is: ef_search is the query-time equivalent of ef_construction. It defines the size of the dynamic candidate list during a search. It is not an index property; it's a session-level parameter set via SET.
  • Technical Impact: This is your primary tool for balancing latency and recall at runtime. A higher ef_search forces the query to explore more of the graph, dramatically increasing the chance of finding the true nearest neighbors (improving recall) at the cost of higher latency and CPU usage.
  • Production Strategy: Never leave this at the default (40). You must tune it based on your application's specific requirements.

  • Establish a Recall Target: Define your acceptable recall (e.g., 99% recall@10). Use a ground-truth dataset (generated via an exact nearest neighbor search) to measure this.
  • Benchmark Latency vs. Recall: Run a series of queries against your production index, varying ef_search and measuring the resulting recall and p99 latency.
  • Implementation Example: Dynamic ef_search in a Multi-Tenant App

    Imagine a SaaS product where 'Premium' tier users get more accurate results than 'Free' tier users. You can implement this by setting ef_search dynamically per transaction.

    sql
    -- A PL/pgSQL function to wrap our vector search
    CREATE OR REPLACE FUNCTION get_similar_documents(
        query_embedding vector(768),
        user_tier TEXT,
        result_limit INT
    )
    RETURNS TABLE (id UUID, title TEXT, similarity REAL)
    AS $$
    DECLARE
        ef_search_value INT;
    BEGIN
        -- Set ef_search based on user tier
        IF user_tier = 'premium' THEN
            ef_search_value := 150; -- High recall for premium users
        ELSE
            ef_search_value := 75;  -- Good balance for free users
        END IF;
    
        -- Use SET LOCAL to scope the parameter to this transaction only
        EXECUTE 'SET LOCAL hnsw.ef_search = ' || ef_search_value;
    
        -- Perform the search
        RETURN QUERY
        SELECT
            d.id,
            d.title,
            1 - (d.embedding <=> query_embedding) AS similarity -- Cosine similarity
        FROM
            documents d
        ORDER BY
            d.embedding <=> query_embedding
        LIMIT
            result_limit;
    END;
    $$ LANGUAGE plpgsql;
    
    -- Usage from your application:
    -- For a premium user:
    SELECT * FROM get_similar_documents(ARRAY[...], 'premium', 10);
    
    -- For a free user:
    SELECT * FROM get_similar_documents(ARRAY[...], 'free', 10);

    This pattern provides fine-grained control over your performance SLAs without requiring separate indexes or database clusters.


    Advanced Topic: Handling Index Bloat from `UPDATE` and `DELETE`

    A critical operational detail often missed by new pgvector users is how HNSW indexes handle data modification. The HNSW graph structure in pgvector is immutable. When you UPDATE a vector or DELETE a row:

    • The old entry in the index is simply marked as "dead."
  • It is not removed from the graph structure.
  • If you UPDATE, a completely new entry is inserted into the graph.
  • The Consequence: Index Bloat and Performance Degradation

    Over time, with frequent updates or deletes, your HNSW index will accumulate a large number of these dead tuples. This has two severe impacts:

  • Wasted Memory/Disk: The index size grows unnecessarily, consuming valuable RAM in the shared buffer cache.
  • Increased Query Latency: During a search, the graph traversal will still visit these dead nodes. The system has to perform extra work to check the visibility of each node it encounters, discarding the dead ones and continuing the search. This adds significant latency.
  • The Solution: Proactive Re-indexing

    Standard PostgreSQL VACUUM will clean up dead rows in the table (the heap), but it cannot restructure the HNSW index itself to remove the dead entries from the graph. The only way to truly reclaim this space and restore performance is to rebuild the index.

    REINDEX is the command, but a standard REINDEX takes an exclusive lock on the table, causing downtime. The production-safe solution is REINDEX CONCURRENTLY.

    Production-Grade Re-indexing Strategy:

  • Monitor Bloat: You need a way to detect when the index is becoming bloated. While there's no perfect metric for HNSW bloat, you can correlate a drop in query performance with the number of dead tuples in the underlying table.
  • sql
        -- Check for dead tuples in your table
        SELECT 
            relname,
            n_live_tup,
            n_dead_tup,
            n_dead_tup * 100.0 / (n_live_tup + n_dead_tup) as dead_tup_pct
        FROM pg_stat_user_tables
        WHERE relname = 'your_vector_table';
  • Automate Re-indexing: Create a scheduled job (e.g., via pg_cron or an external scheduler) that runs REINDEX CONCURRENTLY during a low-traffic maintenance window when bloat exceeds a certain threshold (e.g., 20%).
  • sql
        -- This command builds a new, clean index in the background.
        -- Once complete, it swaps it with the old one atomically.
        -- It requires more disk space temporarily but avoids locking.
        REINDEX INDEX CONCURRENTLY your_hnsw_index_name;

    This operational procedure is non-negotiable for any pgvector deployment with a non-trivial rate of data modification.


    Architectural Decision: HNSW vs. IVFFlat

    pgvector offers another popular index type: IVFFlat. Senior engineers must understand when to choose one over the other.

    IVFFlat (Inverted File with Flat Compression):

  • How it works: It's a clustering-based approach. It first partitions your vectors into nlist clusters (using k-means). To search, it identifies the nprobe most promising clusters near your query vector and then performs an exhaustive, exact search only within those clusters.
  • Tuning Knobs: lists (number of clusters), probes (clusters to search at query time).
  • Here is a decision matrix for production systems:

    FactorHNSWIVFFlat
    Dataset SizeExcellent for up to ~10M vectors. Performance can degrade beyond.Scales better to very large datasets (10M - 1B+ vectors).
    RecallCan achieve very high recall (>99%) with proper tuning.Generally lower recall than HNSW for the same latency.
    Index Build TimeVery slow, especially with high ef_construction.Significantly faster to build.
    Index SizeLarger index size due to storing graph structure.More compact index size.
    Data UpdatesPoor. Requires periodic re-indexing to combat bloat.Better. Adding new vectors is faster as it only affects one list.
    Memory UsageHigh. The graph benefits greatly from being in memory.Lower. Only the centroids and relevant lists need to be loaded.
    Use CaseSystems where recall is paramount (RAG, legal search).Massive-scale systems where some recall trade-off is acceptable.

    Pragmatic Choice: Start with HNSW. For most applications under 10 million vectors, its superior recall-latency curve is worth the operational overhead. Only consider IVFFlat if you are building an index on a truly massive dataset and your build times with HNSW are becoming operationally infeasible.


    Diagnosing Performance with `EXPLAIN (ANALYZE, BUFFERS)`

    When a vector query is slow, you must be able to diagnose it. EXPLAIN ANALYZE is your most powerful tool.

    Let's analyze a query:

    sql
    SET hnsw.ef_search = 100;
    EXPLAIN (ANALYZE, BUFFERS) 
    SELECT id FROM items ORDER BY embedding <-> '[...]' LIMIT 10;

    Sample Output and What to Look For:

    text
    Limit  (cost=... actual time=25.34..25.35 rows=10 loops=1)
      Buffers: shared hit=15320
      ->  Index Scan using items_embedding_idx on items (cost=... actual time=25.33..25.34 rows=10 loops=1)
            Order By: (embedding <-> '[...]'::vector)
            Buffers: shared hit=15320
    Planning Time: 0.150 ms
    Execution Time: 25.400 ms

    Key Insights from this Output:

  • Index Scan using items_embedding_idx: This is crucial. It confirms the query planner is using your HNSW index. If you see a Sequential Scan, your index is not being used, and you have a major problem.
  • actual time=25.33..25.34: This is your core execution time for the search. This is the number you are trying to optimize.
  • Buffers: shared hit=15320: This is the gold mine. It tells you how many 8kB memory pages were accessed from PostgreSQL's shared buffer cache to satisfy the query. hit means they were already in RAM. If you see shared read=..., it means Postgres had to fetch pages from disk, which is orders of magnitude slower. A high number of buffer hits directly correlates with latency and CPU usage. Watch how this number changes as you increase ef_search. This quantifies the "cost" of higher recall.
  • If your vector search is slow, check for shared read. If it's high, it means your index is not fitting into your shared_buffers, and you need to either increase PostgreSQL's memory allocation or provision a machine with more RAM. This is a common scaling bottleneck.

    Final Recommendations for Production

  • Don't use defaults. Start with m=32, ef_construction=128, and benchmark ef_search to find your application's sweet spot.
  • Benchmark everything. Create a dedicated benchmarking environment with a representative data sample. Plot recall vs. latency curves for different ef_search values to make data-driven decisions.
  • Automate re-indexing. For any write-heavy workload, implement a REINDEX CONCURRENTLY job. Do not wait for performance to degrade before you act.
  • Monitor your buffers. Use EXPLAIN (ANALYZE, BUFFERS) and monitor pg_statio_user_indexes to ensure your HNSW index is being served from RAM. Cache misses are a performance killer.
  • Use SET LOCAL for dynamic tuning. This is a powerful pattern for providing different performance tiers or adapting to varying query complexity without infrastructure changes.
  • By moving beyond the default configuration and embracing these advanced operational and tuning patterns, you can build pgvector-based systems that are not only powerful but also scalable, reliable, and performant under the pressures of production traffic.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles