Optimizing HNSW Indexes in pgvector for Real-time Semantic Search

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Production Challenge: Beyond Naive Vector Search

If you're reading this, you've likely already implemented a semantic search or RAG (Retrieval-Augmented Generation) system using pgvector. You've embedded your documents, stored them in a vector column, and are using the cosine distance operator (<=>) to find the nearest neighbors. The initial results were promising. But now, in a staging or production environment with millions of vectors and concurrent users, you're facing the difficult questions:

  • Why is our p99 query latency spiking above our 100ms SLA?
  • Our recall is hovering around 85%, but our product team is demanding 95%+. How do we get there without quadrupling our hardware costs?
  • Index build times are taking hours, blocking our CI/CD pipeline and slowing down data ingestion. Can we speed this up?
  • In our multi-tenant application, queries with a WHERE tenant_id = ? clause are sometimes slow and return fewer results than expected. Why?
  • These are not beginner problems. They are the complex realities of running vector search at scale. The solution lies not in changing your model or your database, but in a deep, rigorous understanding of your chosen Approximate Nearest Neighbor (ANN) index. For pgvector, that often means HNSW (Hierarchical Navigable Small Worlds).

    This article is a comprehensive guide to mastering the HNSW index in pgvector. We will dissect its core tuning parameters—m, ef_construction, and ef_search—and provide a systematic methodology with production-grade code to benchmark and optimize your specific workload. We will move beyond the documentation and into the territory of edge cases, memory management, and architectural patterns that separate a proof-of-concept from a resilient, high-performance production system.


    Dissecting the HNSW Tuning Knobs

    HNSW builds a multi-layered graph of vectors where upper layers contain long-range connections and lower layers contain short-range connections. Search starts at the top layer and navigates greedily toward the target until it reaches the bottom layer, where a more exhaustive search is performed. The efficiency and accuracy of this process are governed by three critical parameters you set.

    1. `m`: The Graph's Foundation

  • What it is: m defines the maximum number of bidirectional connections (or degree) each node in the graph can have. It is set only at index creation time.
  • Technical Impact: m directly controls the density of the graph. A low m (e.g., 8) creates a sparse graph, while a high m (e.g., 48) creates a dense, highly interconnected one.
  • The Trade-off:
  • - Higher m:

    - Pro: Significantly improves recall. A denser graph provides more pathways to the true nearest neighbors, making the search more robust.

    - Con: Drastically increases index build time. The process of finding the best m neighbors for each new node is computationally expensive.

    - Con: Increases the index size and, critically, the memory footprint. A simple rule of thumb is that index size is proportional to d m 2 4 bytes per vector, where d is the vector dimension. For a 1536-dimension vector (like OpenAI's text-embedding-ada-002) and m=32, this is 1536 32 2 4 = ~393 KB per vector, on top of the base vector storage.

  • Production Guideline: Start with a default of m = 16. For applications requiring very high recall (>98%), you may need to increase it to 24 or 32. Values above 48 often yield diminishing returns and come with a severe penalty in build time and memory. Never change m lightly; it requires a full index rebuild.
  • sql
    -- Example: Creating an HNSW index with a specific 'm'
    -- Assuming a table 'documents' with a vector column 'embedding' of 1536 dimensions
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 24);

    2. `ef_construction`: Building a Quality Index

  • What it is: During index construction, for each new vector being added, HNSW performs a search to find its nearest neighbors to connect to. ef_construction is the size of the dynamic candidate list used during this search.
  • Technical Impact: A larger ef_construction allows the construction algorithm to explore more potential neighbors, leading to a higher-quality graph structure. It's a build-time-only parameter.
  • The Trade-off:
  • - Higher ef_construction:

    - Pro: Improves the quality of the index graph, which indirectly leads to better recall at query time for a given ef_search value.

    - Con: Increases index build time. The relationship is often linear; doubling ef_construction can nearly double build time.

  • Production Guideline: A good starting point is ef_construction = 2 m. If your build times are acceptable and you are struggling to hit your recall target, increasing ef_construction to 4 m can provide a noticeable boost in index quality. For a table with 10 million vectors, this change could mean the difference between a 4-hour build and an 8-hour build.
  • sql
    -- Example: Creating an HNSW index with both 'm' and 'ef_construction'
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 24, ef_construction = 64);

    3. `ef_search`: The Latency vs. Recall Lever

  • What it is: This is the query-time equivalent of ef_construction. It defines the size of the dynamic candidate list during a search operation. Unlike the others, this is not set at index creation. It is a runtime parameter.
  • Technical Impact: This is your primary tool for tuning the live performance of your application. A higher ef_search value makes the search more exhaustive, increasing the probability of finding the true nearest neighbors (higher recall) at the cost of higher query latency.
  • The Trade-off:
  • - Higher ef_search:

    - Pro: Directly increases recall.

    - Con: Directly increases query latency. The search algorithm has to perform more distance calculations and traverse more of the graph.

  • Production Guideline: This parameter must be tuned empirically. It can be set for the entire session or per-transaction. This allows for dynamic adjustment. For example, a background job might use a high ef_search for maximum accuracy, while a user-facing API endpoint uses a lower value to meet latency SLAs.
  • sql
    -- Setting ef_search for the current session
    SET hnsw.ef_search = 100;
    
    -- Run your query
    SELECT id, content FROM documents ORDER BY embedding <=> $1 LIMIT 10;
    
    -- It's often safer to set it within a transaction for a specific query
    BEGIN;
    SET LOCAL hnsw.ef_search = 100;
    SELECT id, content FROM documents ORDER BY embedding <=> $1 LIMIT 10;
    COMMIT;

    A Systematic Methodology for Production Tuning

    Guessing these parameters is a recipe for failure. You need a rigorous, data-driven approach. Here is a step-by-step methodology using Python and psycopg2.

    Step 1: Establish Ground Truth

    You cannot measure recall without knowing the actual correct answers. The first step is to take a representative sample of your query vectors (e.g., 1,000 queries) and find their true k-nearest neighbors using an exact, full-scan search. This will be slow, but it only needs to be done once.

    python
    import psycopg2
    import numpy as np
    
    def get_ground_truth(conn, query_vectors, k=100):
        """Calculates the exact nearest neighbors for a set of query vectors."""
        ground_truth = {}
        with conn.cursor() as cur:
            # Temporarily disable index scans to ensure a full scan
            cur.execute("SET enable_seqscan = on;")
            cur.execute("SET enable_indexscan = off;")
            cur.execute("SET enable_bitmapscan = off;")
    
            for i, vec in enumerate(query_vectors):
                print(f"Calculating ground truth for query {i+1}/{len(query_vectors)}...")
                cur.execute(
                    "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
                    (vec.tolist(), k)
                )
                ground_truth[i] = {row[0] for row in cur.fetchall()}
    
        # Reset settings
        with conn.cursor() as cur:
            cur.execute("RESET enable_seqscan;")
            cur.execute("RESET enable_indexscan;")
            cur.execute("RESET enable_bitmapscan;")
            
        return ground_truth
    
    # Usage:
    # conn = psycopg2.connect(...)
    # sample_queries = ... # Load your N query vectors
    # true_neighbors = get_ground_truth(conn, sample_queries)

    Step 2: Benchmark Matrix for Index Build

    Next, create and time the builds for a matrix of m and ef_construction values. This will inform you about the operational cost (time and disk space) of your choices.

    python
    import time
    import psycopg2
    
    def benchmark_index_builds(conn, m_values, ef_construction_values):
        """Builds HNSW indexes with different parameters and records metrics."""
        results = []
        with conn.cursor() as cur:
            for m in m_values:
                for efc in ef_construction_values:
                    print(f"Building index with m={m}, ef_construction={efc}...")
                    cur.execute("DROP INDEX IF EXISTS documents_embedding_idx;")
                    
                    start_time = time.time()
                    cur.execute(f"""
                    CREATE INDEX documents_embedding_idx ON documents 
                    USING hnsw (embedding vector_cosine_ops) 
                    WITH (m = {m}, ef_construction = {efc});
                    """)
                    build_time = time.time() - start_time
    
                    cur.execute("SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));")
                    index_size = cur.fetchone()[0]
    
                    results.append({
                        'm': m,
                        'ef_construction': efc,
                        'build_time_seconds': build_time,
                        'index_size': index_size
                    })
                    print(f"  -> Done in {build_time:.2f}s, Size: {index_size}")
        return results
    
    # Usage:
    # m_vals = [16, 24, 32]
    # efc_vals = [64, 96, 128]
    # build_metrics = benchmark_index_builds(conn, m_vals, efc_vals)
    # print(build_metrics)

    Step 3: Comprehensive Query Performance Benchmarking

    This is the most critical step. For each index you built, you will now run your sample queries with a range of ef_search values, measuring latency and recall against your ground truth.

    python
    import time
    import numpy as np
    import psycopg2
    from tqdm import tqdm
    
    def calculate_recall(retrieved_ids, ground_truth_ids):
        """Calculates recall for a single query."""
        return len(retrieved_ids.intersection(ground_truth_ids)) / len(ground_truth_ids)
    
    def benchmark_query_performance(conn, query_vectors, ground_truth, ef_search_values, k=10):
        """Runs queries with varying ef_search and measures latency and recall."""
        results = []
        with conn.cursor() as cur:
            for efs in ef_search_values:
                print(f"Benchmarking with ef_search = {efs}...")
                latencies = []
                recalls = []
                
                cur.execute(f"SET hnsw.ef_search = {efs};")
    
                for i, vec in enumerate(tqdm(query_vectors)):
                    start_time = time.time()
                    cur.execute(
                        "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
                        (vec.tolist(), k)
                    )
                    retrieved_ids = {row[0] for row in cur.fetchall()}
                    latency = (time.time() - start_time) * 1000  # in ms
    
                    latencies.append(latency)
                    recalls.append(calculate_recall(retrieved_ids, ground_truth[i]))
    
                p99_latency = np.percentile(latencies, 99)
                avg_recall = np.mean(recalls)
    
                results.append({
                    'ef_search': efs,
                    'p99_latency_ms': p99_latency,
                    'avg_recall': avg_recall
                })
                print(f"  -> p99 Latency: {p99_latency:.2f}ms, Avg Recall: {avg_recall:.4f}")
        return results
    
    # Putting it all together:
    # Assume we are testing the index built with m=24, efc=64
    # efs_values = [20, 40, 60, 80, 100, 150, 200]
    # query_metrics = benchmark_query_performance(conn, sample_queries, true_neighbors, efs_values)
    # print(query_metrics)

    By combining these steps, you can generate a complete performance profile for each index configuration. You can plot Latency vs. Recall curves to visually determine the "knee" of the curve—the point where you get the best recall for an acceptable latency. This data allows you to make an informed decision based on your specific product requirements.

    Advanced Edge Cases & Production Architecture

    The Multi-Tenancy Post-Filtering Trap

    A common pattern in SaaS applications is to partition data by tenant_id.

    A naive query looks like this:

    sql
    -- WARNING: This can be inefficient!
    BEGIN;
    SET LOCAL hnsw.ef_search = 100;
    SELECT id FROM documents
    WHERE tenant_id = 'some-tenant-uuid'
    ORDER BY embedding <=> $1
    LIMIT 10;
    COMMIT;

    Here's the problem: pgvector's HNSW implementation performs post-filtering. The database will first use the HNSW index to find the ef_search closest candidates from the entire table, regardless of tenant_id. Then, it applies the WHERE tenant_id = 'some-tenant-uuid' filter to that small candidate set. Finally, it sorts the remaining candidates and returns the top 10.

    If a tenant's data is sparse or not clustered together in the vector space, the initial ef_search candidates might not contain any vectors from the target tenant, leading to empty or incomplete results. To get 10 results, you might need to dramatically increase ef_search to, say, 5000, which will destroy your latency.

    Solutions:

  • Table Partitioning: The most robust solution. Partition the documents table by tenant_id. This way, the HNSW index is built only on a single tenant's data, and the search is constrained to that partition. This is highly effective but adds operational complexity.
  • sql
        -- Simplified example
        CREATE TABLE documents (id UUID, tenant_id UUID, embedding vector(1536)) PARTITION BY LIST (tenant_id);
        CREATE TABLE documents_tenant_A PARTITION OF documents FOR VALUES IN ('tenant-A-uuid');
        -- An index must be created on each partition
        CREATE INDEX ON documents_tenant_A USING hnsw (embedding vector_cosine_ops);
  • Increase ef_search and LIMIT: A less ideal but simpler solution is to fetch more results initially and filter in the application layer. This is a trade-off, increasing DB load to simplify the architecture.
  • sql
        -- Fetch 100 candidates, hope to find 10 for our tenant
        SELECT id, tenant_id FROM documents ORDER BY embedding <=> $1 LIMIT 100;

    Memory, I/O, and Cache Performance

    HNSW performance is highly sensitive to whether the index can fit in memory. PostgreSQL relies on its shared_buffers and the OS page cache to keep hot data in RAM.

  • Index Size Matters: As calculated earlier, HNSW indexes can be massive. A 10 million vector table with d=1536 and m=32 can result in an index over 3TB. If this size exceeds your available RAM, the database will constantly perform disk I/O during graph traversal, leading to multi-second latencies.
  • Monitoring Cache Hit Rate: You must monitor your index cache hit rate.
  • sql
        SELECT 
            relname,
            heap_blks_read,
            heap_blks_hit,
            idx_blks_read,
            idx_blks_hit,
            (idx_blks_hit * 100) / (idx_blks_hit + idx_blks_read) as idx_hit_rate
        FROM pg_statio_user_tables
        WHERE relname = 'documents';

    If your idx_hit_rate is consistently below 99%, you are likely I/O bound. The solution is to either provision a machine with more RAM or optimize your index (e.g., reduce m, or explore quantization techniques like IVFFlat for less critical data).

  • Storage Performance: When you are I/O bound, the performance of your underlying storage is paramount. Using high-IOPS NVMe drives (like AWS io2 EBS volumes) is non-negotiable for large-scale vector search workloads that don't fit entirely in memory.
  • The Static Nature of HNSW and Re-indexing Strategy

    HNSW is a static index. Adding new vectors doesn't restructure the existing graph to optimize for the new data points. This can lead to a degradation of index quality over time as new data is inserted.

    For many applications, a periodic full re-indexing is necessary. In PostgreSQL, this can be done with REINDEX INDEX CONCURRENTLY, which rebuilds the index without taking a hard lock on the table, allowing reads and writes to continue.

    A Production Re-indexing Pattern:

  • Never run REINDEX manually. Automate it.
  • Schedule it during off-peak hours. Even CONCURRENTLY adds significant load to the system.
  • Monitor index bloat and query performance. Trigger a re-index based on metrics, not just a fixed schedule. If you see recall dipping or latency creeping up after a month of heavy writes, it's time to re-index.
  • Conclusion: From Heuristics to Engineering

    Optimizing pgvector's HNSW index is a microcosm of performance engineering. It requires moving away from default settings and anecdotal advice towards a rigorous, empirical process tailored to your specific data distribution, query patterns, and product requirements.

    The key takeaways for senior engineers are:

  • Embrace the Trade-offs: There is no single "best" configuration. Every choice is a balance between recall, latency, index build time, and memory/storage cost.
  • Build a Benchmarking Harness: The Python scripts provided are a template. Adapt them into your MLOps or DevOps pipeline to continuously evaluate performance as your data grows and changes.
  • Master the Parameters:
  • - Use m to define the fundamental quality/cost of your graph. Choose it carefully, as it requires a full rebuild.

    - Use ef_construction to fine-tune the build quality vs. build time.

    - Use ef_search as your live, runtime lever to meet your application's specific latency/recall SLA.

  • Architect for Scale: Be aware of the post-filtering limitation in multi-tenant scenarios and plan for it with partitioning or other strategies. Proactively manage memory and I/O, as they are the ultimate arbiters of performance at scale.
  • By applying this systematic approach, you can transform your pgvector implementation from a functional prototype into a highly performant and reliable production system capable of delivering real-time semantic search with confidence.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles