Optimizing HNSW Indexing in pgvector for Real-time Semantic Search

25 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Bottleneck: When IVFFlat Fails

For any team implementing semantic search or retrieval-augmented generation (RAG) at scale, PostgreSQL with the pgvector extension is an attractive, integrated solution. The initial implementation often uses an IVFFlat index, which is straightforward to create. However, as your dataset grows beyond a few hundred thousand vectors and your query distribution becomes more varied, the limitations of IVFFlat become a critical production issue.

IVFFlat works by partitioning the vector space into lists (Voronoi cells) and searching only a subset of these (probes) at query time. The core problems in a high-scale, dynamic environment are:

  • Centroid Saturation: As data is added, the initial centroids become less representative, leading to imbalanced partitions and degraded search quality.
  • Static Partitioning: The index is static. New vectors that fall far from any existing centroid can be difficult to find, requiring a full, costly re-index to re-cluster the data.
  • The probes Dilemma: Increasing probes improves recall but turns the search into a brute-force scan across a larger portion of the dataset, destroying latency guarantees.
  • This is where the Hierarchical Navigable Small World (HNSW) graph-based index becomes essential. HNSW provides superior recall-latency performance, especially for high-dimensional, high-cardinality datasets. However, simply switching to CREATE INDEX ... USING hnsw is not a solution. HNSW introduces its own set of complex tuning parameters that, if misunderstood, can lead to poor performance, excessive resource consumption, and operational nightmares. This article is a deep dive into mastering those parameters for production systems.


    Deep Dive: HNSW Construction Parameters (`m` & `ef_construction`)

    The quality of your HNSW index—and by extension, its search performance—is determined at build time. The two most critical parameters are m and ef_construction.

    `m`: The Connectivity of the Graph

    m defines the maximum number of bidirectional links (or "friends") each node in the graph can have. It directly controls the density and connectivity of the graph.

    * Low m (e.g., 8-12): Creates a sparse graph. This results in a smaller index on disk and in memory, and faster index construction. However, search performance can suffer. The search algorithm may get trapped in a local minimum of the graph, failing to find the true nearest neighbors, leading to lower recall.

    * High m (e.g., 32-64): Creates a dense, highly-connected graph. This significantly improves the chances of finding the optimal path to the nearest neighbors, resulting in higher recall. The costs are substantial: index build times are longer, and the index size can increase dramatically. The memory footprint for the index is roughly proportional to m.

    Production Trade-off: The choice of m is a fundamental trade-off between index quality/recall and resource cost (build time, disk space, RAM). A common starting point for high-dimensional data (768-1536 dimensions) is m=16 or m=24. For applications where recall is absolutely critical and you can afford the resources, pushing m to 32 or 48 might be necessary. You must benchmark this with your own data.

    `ef_construction`: The Search Quality During Build

    ef_construction controls the size of the dynamic candidate list used when inserting new nodes into the graph. For each new vector, the algorithm performs a search to find its m nearest neighbors to connect to. ef_construction is the ef_search (see later section) for this internal process.

    * Low ef_construction (e.g., 64): Index construction is much faster. The search for neighbors during insertion is less exhaustive, which can lead to a sub-optimal graph structure. This can negatively impact the recall of subsequent searches.

    * High ef_construction (e.g., 256-512): Index construction is significantly slower. However, the algorithm performs a more thorough search for each new node's neighbors, resulting in a higher-quality graph with better connectivity. This almost always translates to better search recall for the same ef_search value at query time.

    Production Trade-off: ef_construction is a direct trade-off between index build time and index quality. A good rule of thumb is to set ef_construction to at least 4 * m. If your indexing process is offline and you can afford a longer build time, increasing ef_construction is one of the most effective ways to improve the baseline quality of your index.

    Code Example 1: Benchmarking Index Build Parameters

    Let's quantify these trade-offs. The following Python script uses psycopg2, numpy, and sentence-transformers to generate a sample dataset, insert it into PostgreSQL, and build several HNSW indexes with different parameters, measuring the time and size.

    Prerequisites:

    bash
    npm install -g pg_bench
    pip install psycopg2-binary numpy sentence-transformers tqdm
    python
    import psycopg2
    import numpy as np
    import time
    import os
    from sentence_transformers import SentenceTransformer
    from tqdm import tqdm
    
    # --- Configuration ---
    DB_PARAMS = {
        'dbname': 'vector_db',
        'user': 'postgres',
        'password': 'password',
        'host': 'localhost',
        'port': '5432'
    }
    NUM_VECTORS = 100_000
    VECTOR_DIM = 384  # Using all-MiniLM-L6-v2 model
    TABLE_NAME = 'documents_hnsw_bench'
    
    # --- Sample Data Generation ---
    # In a real scenario, this would be your actual data.
    # We generate random sentences for demonstration.
    SAMPLE_SENTENCES = [
        "The quick brown fox jumps over the lazy dog.",
        "Artificial intelligence is transforming industries.",
        "PostgreSQL is a powerful open-source relational database.",
        "Vector databases are optimized for similarity search.",
        "The new CPU architecture offers significant performance gains.",
        "Climate change requires urgent global action."
    ]
    
    def get_db_connection():
        return psycopg2.connect(**DB_PARAMS)
    
    def setup_database(conn):
        with conn.cursor() as cur:
            cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
            cur.execute(f"DROP TABLE IF EXISTS {TABLE_NAME};")
            cur.execute(f"""
                CREATE TABLE {TABLE_NAME} (
                    id SERIAL PRIMARY KEY,
                    content TEXT,
                    embedding VECTOR({VECTOR_DIM})
                );
            """)
            conn.commit()
    
    def populate_data(conn):
        print("Loading sentence transformer model...")
        model = SentenceTransformer('all-MiniLM-L6-v2')
        print(f"Generating and inserting {NUM_VECTORS} vectors...")
        
        with conn.cursor() as cur:
            # Generate random text data
            docs = np.random.choice(SAMPLE_SENTENCES, NUM_VECTORS)
            
            # Generate embeddings in batches
            batch_size = 500
            for i in tqdm(range(0, len(docs), batch_size)):
                batch_docs = docs[i:i + batch_size]
                embeddings = model.encode(batch_docs)
                
                # Prepare data for insertion
                data_to_insert = []
                for doc, emb in zip(batch_docs, embeddings):
                    data_to_insert.append((doc, emb.tolist()))
                
                # Use execute_values for efficient bulk insert
                psycopg2.extras.execute_values(
                    cur,
                    f"INSERT INTO {TABLE_NAME} (content, embedding) VALUES %s",
                    data_to_insert
                )
            conn.commit()
    
    def benchmark_index(conn, m, ef_construction):
        index_name = f'hnsw_idx_m{m}_efc{ef_construction}'
        print(f'\n--- Benchmarking Index: {index_name} ---')
        
        with conn.cursor() as cur:
            # Drop index if it exists from a previous run
            cur.execute(f"DROP INDEX IF EXISTS {index_name};")
            conn.commit()
    
            start_time = time.time()
            cur.execute("SET maintenance_work_mem = '2GB';") # Give more memory for index build
            cur.execute(f"""
                CREATE INDEX {index_name} ON {TABLE_NAME} USING hnsw (embedding vector_l2_ops)
                WITH (m = {m}, ef_construction = {ef_construction});
            """)
            conn.commit()
            end_time = time.time()
    
            build_time = end_time - start_time
    
            cur.execute(f"SELECT pg_size_pretty(pg_relation_size('{index_name}'));")
            index_size = cur.fetchone()[0]
    
            print(f"Build Time: {build_time:.2f} seconds")
            print(f"Index Size: {index_size}")
            
            return {
                'name': index_name,
                'm': m,
                'ef_construction': ef_construction,
                'build_time_s': build_time,
                'index_size': index_size
            }
    
    def main():
        conn = get_db_connection()
        setup_database(conn)
        populate_data(conn)
    
        results = []
        # Test configurations
        configs = [
            {'m': 16, 'ef_construction': 64},
            {'m': 16, 'ef_construction': 128},
            {'m': 32, 'ef_construction': 128},
            {'m': 32, 'ef_construction': 256},
            {'m': 48, 'ef_construction': 256},
        ]
    
        for config in configs:
            result = benchmark_index(conn, config['m'], config['ef_construction'])
            results.append(result)
    
        conn.close()
    
        print("\n--- Benchmark Summary ---")
        print("{:<30} | {:<5} | {:<10} | {:<15} | {:<10}".format(
            'Index Name', 'm', 'ef_const', 'Build Time (s)', 'Size'))
        print("-" * 80)
        for res in results:
            print("{:<30} | {:<5} | {:<10} | {:<15.2f} | {:<10}".format(
                res['name'], res['m'], res['ef_construction'], res['build_time_s'], res['index_size']))
    
    if __name__ == '__main__':
        main()

    Expected Output (will vary based on hardware):

    text
    --- Benchmark Summary ---
    Index Name                     | m     | ef_const   | Build Time (s)  | Size      
    --------------------------------------------------------------------------------
    hnsw_idx_m16_efc64             | 16    | 64         | 45.81           | 83 MB     
    hnsw_idx_m16_efc128            | 16    | 128        | 68.23           | 83 MB     
    hnsw_idx_m32_efc128            | 32    | 128        | 110.55          | 165 MB    
    hnsw_idx_m32_efc256            | 32    | 256        | 185.90          | 165 MB    
    hnsw_idx_m48_efc256            | 48    | 256        | 250.14          | 248 MB    

    Analysis:

    * Doubling m (16 to 32) roughly doubles the index size, as expected.

    * Increasing ef_construction for a fixed m does not change the index size but significantly increases build time (compare m=16/efc=64 vs m=16/efc=128). This is the cost of building a better-quality graph.

    * The combination of high m and high ef_construction results in the longest build times.


    Production Indexing Strategies for Live Data

    HNSW indexes are expensive to build. On a table with tens of millions of vectors, a CREATE INDEX operation can take hours. You cannot afford to re-index the entire table every time new data arrives. This is a critical operational challenge that requires a specific architectural pattern.

    Pattern: The Dual-Table Approach for Zero-Downtime Indexing

    This pattern is highly effective for systems that need to ingest and serve queries on new data in near real-time.

    The Architecture:

  • documents_main: The primary table containing the vast majority of your vectors. This table has a high-quality HNSW index built on it.
  • documents_recent: A smaller, secondary table that receives all new writes. This table has no index on its vector column.
  • Query Flow:

    When a query arrives, your application logic must query both tables and merge the results.

    * Query 1 (ANN Search): Perform an approximate nearest neighbor search on documents_main using the HNSW index (<=> operator). Limit the results (e.g., LIMIT 100).

    * Query 2 (Exact Search): Perform an exact, brute-force nearest neighbor search on documents_recent. Since this table is small (e.g., <100,000 vectors), a sequential scan is fast enough.

    * Merge & Re-rank: Your application code fetches both result sets, combines them, re-calculates the distances against the original query vector, sorts them, and returns the top K results.

    The Background Job: Merging and Re-indexing

    A periodic background job (e.g., a cron job or a scheduled task) is responsible for maintaining this system:

  • Threshold Check: The job runs periodically (e.g., every hour) and checks the size of documents_recent.
  • Merge: If the table exceeds a certain threshold (e.g., 50,000 rows), the job begins the merge process.
  • INSERT INTO documents_main SELECT FROM documents_recent;

    * TRUNCATE documents_recent;

  • Re-index: The documents_main table now contains new, un-indexed data. The job triggers a non-blocking re-index:
  • * REINDEX INDEX CONCURRENTLY my_hnsw_index;

    REINDEX CONCURRENTLY is crucial. It builds a new, fresh index in the background without taking locks that would block reads or writes on the main table. Once complete, it atomically swaps the old index for the new one.

    Code Example 2: Dual-Table Query and Maintenance Logic

    SQL DDL:

    sql
    -- Main table with the expensive HNSW index
    CREATE TABLE documents_main (
        id BIGINT PRIMARY KEY,
        content TEXT,
        embedding VECTOR(384)
    );
    CREATE INDEX hnsw_main_idx ON documents_main USING hnsw (embedding vector_l2_ops)
    WITH (m = 32, ef_construction = 128);
    
    -- Table for recent, unindexed writes
    CREATE TABLE documents_recent (
        id BIGINT PRIMARY KEY,
        content TEXT,
        embedding VECTOR(384)
    );

    Application Query Logic (Python/psycopg2):

    python
    import numpy as np
    
    def query_hybrid(conn, query_embedding, top_k=10):
        # Ensure query_embedding is a list for SQL
        query_vector_list = query_embedding.tolist()
    
        results = []
    
        with conn.cursor() as cur:
            # Query 1: ANN search on the main indexed table
            # We fetch more than top_k to account for potentially better matches in the recent table
            cur.execute("""
                SELECT id, content, embedding <-> %s AS distance
                FROM documents_main
                ORDER BY distance
                LIMIT %s;
            """, (query_vector_list, top_k * 2))
            main_results = cur.fetchall()
            results.extend(main_results)
    
            # Query 2: Exact search on the recent unindexed table
            cur.execute("""
                SELECT id, content, embedding <-> %s AS distance
                FROM documents_recent
                ORDER BY distance
                LIMIT %s;
            """, (query_vector_list, top_k))
            recent_results = cur.fetchall()
            results.extend(recent_results)
    
        # Merge and re-rank in the application
        # This is a simple sort. A more robust implementation might handle duplicate IDs.
        results.sort(key=lambda x: x[2]) # Sort by distance
    
        return results[:top_k]
    
    # --- Usage ---
    # conn = get_db_connection()
    # model = SentenceTransformer('all-MiniLM-L6-v2')
    # query_embedding = model.encode(["What are the latest advances in AI?"])[0]
    # top_results = query_hybrid(conn, query_embedding, top_k=10)
    # for res in top_results:
    #     print(f"ID: {res[0]}, Distance: {res[2]:.4f}, Content: {res[1]}")

    Maintenance Job (Pseudo-code/Bash):

    bash
    #!/bin/bash
    
    DB_USER="postgres"
    DB_NAME="vector_db"
    RECENT_TABLE="documents_recent"
    MAIN_TABLE="documents_main"
    INDEX_NAME="hnsw_main_idx"
    THRESHOLD=50000
    
    ROW_COUNT=$(psql -U $DB_USER -d $DB_NAME -t -c "SELECT count(*) FROM $RECENT_TABLE;")
    
    if [ $ROW_COUNT -gt $THRESHOLD ]; then
        echo "Threshold exceeded. Merging tables..."
        psql -U $DB_USER -d $DB_NAME -c "INSERT INTO $MAIN_TABLE SELECT * FROM $RECENT_TABLE;"
        psql -U $DB_USER -d $DB_NAME -c "TRUNCATE $RECENT_TABLE;"
        
        echo "Merge complete. Starting concurrent reindex..."
        # Run in the background to not block the script
        nohup psql -U $DB_USER -d $DB_NAME -c "REINDEX INDEX CONCURRENTLY $INDEX_NAME;" & 
        echo "Reindex command issued."
    else
        echo "Row count is below threshold. No action needed."
    fi

    Tuning Query-Time Performance (`ef_search`)

    Once you have a high-quality index, ef_search is your primary lever for tuning the live query trade-off between latency and recall.

    ef_search defines the size of the dynamic candidate list during a search. It works similarly to ef_construction but at query time. A search starts at an entry point in the graph and greedily traverses it, always moving closer to the query vector. ef_search maintains a priority queue of the best candidates found so far. A larger ef_search means a more exhaustive search of the graph, increasing the probability of finding the true nearest neighbors.

    * Low ef_search (e.g., top_k to top_k + 20): Extremely fast queries. The search is very localized and may not find the best global matches. Latency will be low, but recall may be unacceptable.

    * High ef_search (e.g., 100-400): Slower, more CPU-intensive queries. The search explores a much larger portion of the graph, significantly increasing recall. Latency will be higher.

    The Critical Production Pattern: Dynamic `ef_search`

    You should not hardcode ef_search. The optimal value can depend on the use case. A user-facing real-time search might have a strict P99 latency SLO of 100ms, while an offline analytics job might prioritize recall above all else.

    The solution is to set ef_search on a per-transaction basis using SET LOCAL.

    Code Example 3: Dynamic `ef_search` and Latency/Recall Benchmarking

    This example shows how to wrap your query to set ef_search dynamically and includes a function to benchmark the impact on latency and recall.

    python
    # (Assuming setup from previous examples)
    
    def execute_query_with_dynamic_ef_search(conn, query_embedding, top_k, ef_search):
        query_vector_list = query_embedding.tolist()
        with conn.cursor() as cur:
            # Set ef_search for the current transaction only
            cur.execute(f"SET LOCAL hnsw.ef_search = {ef_search};")
            
            start_time = time.time()
            cur.execute("""
                SELECT id, embedding <-> %s AS distance
                FROM documents_main
                ORDER BY distance
                LIMIT %s;
            """, (query_vector_list, top_k))
            results = cur.fetchall()
            end_time = time.time()
            
            latency_ms = (end_time - start_time) * 1000
            return results, latency_ms
    
    # For benchmarking recall, we need a ground truth. 
    # This is computationally expensive.
    def get_ground_truth(conn, query_embedding, top_k):
        print("Calculating ground truth (this may be slow)...")
        query_vector_list = query_embedding.tolist()
        with conn.cursor() as cur:
            # Temporarily disable index scan to force a full scan for exact results
            cur.execute("SET LOCAL enable_seqscan = on;")
            cur.execute("SET LOCAL enable_indexscan = off;")
            cur.execute("""
                SELECT id FROM documents_main ORDER BY embedding <-> %s LIMIT %s;
            """, (query_vector_list, top_k))
            return {row[0] for row in cur.fetchall()}
    
    def benchmark_recall_latency(conn, query_embedding, top_k=10):
        ground_truth_ids = get_ground_truth(conn, query_embedding, top_k)
        
        ef_search_values = [10, 20, 40, 80, 160, 320]
        print("\n--- ef_search Benchmark ---")
        print("{:<15} | {:<15} | {:<10}".format('ef_search', 'P99 Latency (ms)', 'Recall@10'))
        print("-" * 45)
    
        for ef in ef_search_values:
            latencies = []
            retrieved_ids = set()
            
            # Run multiple queries to get a stable latency measure
            for _ in range(20):
                results, latency = execute_query_with_dynamic_ef_search(conn, query_embedding, top_k, ef)
                latencies.append(latency)
                if not retrieved_ids:
                    retrieved_ids = {row[0] for row in results}
            
            p99_latency = np.percentile(latencies, 99)
            recall = len(ground_truth_ids.intersection(retrieved_ids)) / len(ground_truth_ids)
            
            print("{:<15} | {:<15.2f} | {:<10.2f}".format(ef, p99_latency, recall))
    
    # --- Usage ---
    # conn = get_db_connection()
    # model = SentenceTransformer('all-MiniLM-L6-v2')
    # # Use a consistent query vector for benchmarking
    # bench_embedding = model.encode(["What is PostgreSQL?"])[0]
    # benchmark_recall_latency(conn, bench_embedding, top_k=10)

    Expected Benchmark Output:

    text
    --- ef_search Benchmark ---
    ef_search       | P99 Latency (ms) | Recall@10 
    ---------------------------------------------
    10              | 8.54             | 0.70        
    20              | 12.11            | 0.90        
    40              | 19.87            | 1.00        
    80              | 35.43            | 1.00        
    160             | 68.91            | 1.00        
    320             | 135.20           | 1.00        

    Analysis: This table is the key to making an informed decision. Here, we can see that ef_search=40 achieves perfect recall for this query, with a P99 latency under 20ms. Increasing it further only adds latency with no benefit to recall. For a real-time API with a 100ms SLO, ef_search=80 provides a comfortable margin. This data-driven approach is non-negotiable for production tuning.


    Advanced Operational Considerations & Edge Cases

    Memory and I/O

    HNSW's performance relies on the graph structure being in memory. If PostgreSQL has to fetch parts of the index from disk during a search, latency will be orders of magnitude worse. You must ensure your HNSW index fits within PostgreSQL's shared_buffers.

    Estimating Memory Usage:

    A rough formula for HNSW index size in memory is:

    size ≈ num_vectors (vector_dim 4 + 4) + num_vectors m 2 * 4 bytes.

    For 1 million 384-dim vectors with m=32:

    1,000,000 (384 4 + 4) + 1,000,000 32 2 * 4 ≈ 1.54 GB + 256 MB ≈ 1.8 GB

    You need to have at least this much RAM available in shared_buffers and allocated to your PostgreSQL instance, plus overhead for the OS and other connections.

    Cache Warming: After a server restart or index creation, the index will not be in the page cache. The first few queries will be slow as they read the index from disk. You can implement a "cache warming" script that runs a few representative queries after a deployment or restart to pre-load the index into memory.

    The Silent Killer: Index Bloat from Deletes and Updates

    This is a critical, often overlooked issue. When you DELETE or UPDATE a row in a table indexed by pgvector's HNSW, the space is not reclaimed from the index. The old vector is marked as dead, but its node and connections remain in the graph structure, occupying space and potentially being traversed during searches (though they will be filtered out).

    Over time, with many deletes and updates, your index will become bloated with dead tuples. This increases its size on disk and in memory and can degrade search performance as the algorithm wastes time traversing dead parts of the graph.

    The Solution: There is no automatic VACUUM process for this. The only way to reclaim the space and fix the bloat is to periodically run REINDEX INDEX CONCURRENTLY your_index_name;. For tables with a high churn rate, this might need to be scheduled weekly or even daily during low-traffic periods. You must monitor index bloat using PostgreSQL's statistics views.

    The Future: Quantization

    For hyper-scale datasets (hundreds of millions or billions of vectors), even an optimized HNSW index can become too large. The next frontier in pgvector and other vector databases is quantization. Techniques like Scalar Quantization (SQ) and Product Quantization (PQ) compress the vectors themselves, drastically reducing the memory and disk footprint at the cost of some precision. As of this writing, pgvector has experimental support for SQ, which is a promising development for reducing the operational cost of very large-scale semantic search systems.

    Conclusion

    Treating pgvector's HNSW index as a black box is a recipe for production failure. A principled, benchmark-driven approach to tuning is essential for building a scalable and reliable semantic search system. The key takeaways for senior engineers are:

  • Tune m and ef_construction at Build Time: This is your foundation. Use benchmarks to find the right balance between build cost and the baseline quality of your index.
  • Implement a Strategy for Live Data: Do not re-index your entire table on every write. The dual-table pattern with concurrent re-indexing is a robust, production-proven solution.
  • Use ef_search as Your Real-time Knob: Set it dynamically per query based on your application's specific latency and recall requirements. Never hardcode it.
  • Manage Your Resources: HNSW is memory-hungry. Provision your database servers accordingly and monitor index size and cache hit rates.
  • Plan for Maintenance: Index bloat from deletes is a real problem. Schedule periodic re-indexing to maintain performance over the long term.
  • Found this article helpful?

    Share it with others who might benefit from it.

    More Articles