Tuning pgvector HNSW Indexes for Real-time Semantic Search

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Defaults: Production-Grade HNSW Tuning in pgvector

As engineering teams integrate vector search into production systems for Retrieval-Augmented Generation (RAG), semantic search, or recommendation engines, the initial proof-of-concept using pgvector often hits a performance wall. The default HNSW (Hierarchical Navigable Small World) index parameters provide a reasonable starting point, but they are rarely optimal for applications demanding both high recall and low query latency under load.

This article is not an introduction to pgvector or HNSW. It assumes you have already built an HNSW index and are now facing the critical task of tuning it for a production environment. We will dissect the complex interplay between index build parameters (m, ef_construction) and query-time parameters (ef_search), providing a robust framework for benchmarking and identifying the optimal configuration for your specific use case. We will also explore advanced, production-hardening topics like managing index bloat, the critical problem of metadata filtering, and dynamic parameter tuning.

Our goal is to move from a "it works" implementation to a highly-performant, reliable system capable of serving real-time semantic search queries at scale.

The Core Trade-off Triangle: Recall, Latency, and Build Cost

Effective HNSW tuning is a balancing act between three competing factors:

  • Recall: The accuracy of your search, measured as the proportion of true nearest neighbors returned by the approximate nearest neighbor (ANN) search. A recall of 1.0 (or 100%) means the ANN search returned the exact same results as a brute-force exact search.
  • Query Latency: The time it takes to execute a nearest neighbor search query. For real-time applications, this is often the most critical metric, with SLOs typically in the sub-100ms range.
  • Build Cost: The combination of time and computational resources required to build the index, plus the final on-disk size of the index.
  • The primary levers pgvector gives us to navigate this triangle are:

    * m: The maximum number of connections per node in the graph's layers. A higher m creates a denser, more complex graph.

    * Impact: Increases recall, significantly increases index build time and memory usage, and increases the final index size. It has a minor impact on query latency.

    * ef_construction: The size of the dynamic candidate list during index construction. A larger value means a more exhaustive search for neighbors when inserting new nodes.

    * Impact: Significantly increases index quality, leading to higher potential recall. It has a massive impact on index build time but no direct impact on index size or query latency.

    * ef_search: The size of the dynamic candidate list during a query. This is the runtime equivalent of ef_construction.

    * Impact: The most direct lever for tuning the recall-latency trade-off at query time. A higher ef_search increases recall at the cost of higher query latency. It has no impact on the index itself.

    Understanding these relationships is foundational. Let's move from theory to practice by establishing a rigorous benchmarking environment.

    A Practical Benchmarking Framework

    To make informed tuning decisions, you must benchmark. We'll simulate a realistic production scenario with 1 million 768-dimensional embeddings, typical of models like all-mpnet-base-v2.

    1. Database Setup

    First, set up the table and extension in PostgreSQL.

    sql
    -- Ensure the extension is created
    CREATE EXTENSION IF NOT EXISTS vector;
    
    -- Create the table to hold our items and their embeddings
    CREATE TABLE items (
        id BIGSERIAL PRIMARY KEY,
        embedding VECTOR(768)
    );
    
    -- Optional: Add metadata for filtering tests later
    ALTER TABLE items ADD COLUMN category TEXT;
    ALTER TABLE items ADD COLUMN created_at TIMESTAMPTZ;

    2. Data Population and Ground Truth Generation

    We'll use a Python script to populate the database. A crucial step for measuring recall is to establish a "ground truth"—the actual k-nearest neighbors for a set of test queries. We achieve this by running a brute-force exact search before creating the HNSW index.

    python
    import os
    import numpy as np
    import psycopg2
    from psycopg2.extras import execute_values
    from pgvector.psycopg2 import register_vector
    import time
    
    # --- Configuration ---
    DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
    NUM_ITEMS = 1_000_000
    DIMENSIONS = 768
    NUM_TEST_QUERIES = 100
    K = 10 # Number of nearest neighbors to find
    
    # --- Database Connection ---
    def get_db_connection():
        conn = psycopg2.connect(DB_CONNECTION_STRING)
        register_vector(conn)
        return conn
    
    # --- Data Generation ---
    def populate_data(conn):
        print(f"Populating table with {NUM_ITEMS} random vectors...")
        cursor = conn.cursor()
        cursor.execute("TRUNCATE TABLE items RESTART IDENTITY;")
        
        # Generate data in batches
        batch_size = 10000
        for i in range(0, NUM_ITEMS, batch_size):
            print(f"Inserting batch {i // batch_size + 1}/{(NUM_ITEMS // batch_size)}")
            # Generate normalized vectors
            embeddings = np.random.rand(batch_size, DIMENSIONS).astype(np.float32)
            embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
            execute_values(cursor, "INSERT INTO items (embedding) VALUES %s", [(e,) for e in embeddings])
        conn.commit()
        cursor.close()
        print("Data population complete.")
    
    # --- Ground Truth Generation ---
    def get_ground_truth(conn, test_vectors):
        print("Generating ground truth for recall calculation...")
        ground_truth = []
        cursor = conn.cursor()
        
        # Ensure we are doing a full sequential scan for exact results
        cursor.execute("SET enable_seqscan = on;")
        cursor.execute("SET enable_indexscan = off;")
        
        for i, vector in enumerate(test_vectors):
            if (i+1) % 10 == 0:
                print(f"  Processing query {i+1}/{len(test_vectors)}")
            cursor.execute(
                "SELECT id FROM items ORDER BY embedding <-> %s LIMIT %s", 
                (vector, K)
            )
            result_ids = {row[0] for row in cursor.fetchall()}
            ground_truth.append(result_ids)
            
        cursor.close()
        print("Ground truth generated.")
        return ground_truth
    
    if __name__ == "__main__":
        conn = get_db_connection()
        
        # Step 1: Populate data if needed
        # populate_data(conn)
        
        # Step 2: Generate test vectors and ground truth
        test_vectors = np.random.rand(NUM_TEST_QUERIES, DIMENSIONS).astype(np.float32)
        test_vectors /= np.linalg.norm(test_vectors, axis=1, keepdims=True)
        
        ground_truth_ids = get_ground_truth(conn, test_vectors)
        
        # Save for later use
        np.save('test_vectors.npy', test_vectors)
        import pickle
        with open('ground_truth.pkl', 'wb') as f:
            pickle.dump(ground_truth_ids, f)
            
        conn.close()

    Note: Generating the ground truth is computationally expensive and should be done once. The script saves the test vectors and ground truth IDs for use in subsequent benchmark runs.

    Deep Dive: Tuning `m` and `ef_construction` for Index Quality

    These two parameters define the structure and quality of your HNSW graph. They are set at index creation and cannot be changed without a full REINDEX. The goal is to build the highest quality index you can afford in terms of build time.

    Let's create a benchmark script to test different combinations.

    python
    # (Continuing from the previous script context)
    
    def benchmark_index_build(conn, m, ef_construction):
        cursor = conn.cursor()
        index_name = f"items_embedding_idx_m{m}_efc{ef_construction}"
        print(f"\n--- Benchmarking M={m}, EF_CONSTRUCTION={ef_construction} ---")
        
        # Drop previous index if it exists
        cursor.execute("DROP INDEX IF EXISTS items_embedding_idx;")
        conn.commit()
    
        # Build the new index and measure time
        start_time = time.time()
        cursor.execute(f"""
            CREATE INDEX items_embedding_idx 
            ON items 
            USING hnsw (embedding vector_l2_ops) 
            WITH (m = {m}, ef_construction = {ef_construction});
        """)
        conn.commit()
        build_time = time.time() - start_time
        
        # Get index size
        cursor.execute("SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));")
        index_size = cursor.fetchone()[0]
        
        cursor.close()
        return build_time, index_size
    
    def calculate_recall(ann_results, ground_truth):
        total_recall = 0
        for i, ann_set in enumerate(ann_results):
            true_positives = len(ann_set.intersection(ground_truth[i]))
            total_recall += true_positives / K
        return total_recall / len(ann_results)
    
    def benchmark_query_performance(conn, test_vectors, ground_truth, ef_search):
        latencies = []
        ann_results = []
        cursor = conn.cursor()
        
        # Set query-time parameter
        cursor.execute(f"SET hnsw.ef_search = {ef_search};")
        
        for vector in test_vectors:
            start_time = time.time()
            cursor.execute("SELECT id FROM items ORDER BY embedding <-> %s LIMIT %s", (vector, K))
            result_ids = {row[0] for row in cursor.fetchall()}
            latencies.append((time.time() - start_time) * 1000) # milliseconds
            ann_results.append(result_ids)
            
        recall = calculate_recall(ann_results, ground_truth)
        p95_latency = np.percentile(latencies, 95)
        
        cursor.close()
        return recall, p95_latency
    
    if __name__ == '__main__':
        # Load ground truth data
        test_vectors = np.load('test_vectors.npy')
        with open('ground_truth.pkl', 'rb') as f:
            ground_truth = pickle.load(f)
            
        conn = get_db_connection()
    
        # --- Build Parameter Benchmarks ---
        build_configs = [
            {'m': 16, 'ef_construction': 64},
            {'m': 24, 'ef_construction': 96},
            {'m': 32, 'ef_construction': 128},
            {'m': 48, 'ef_construction': 192},
        ]
    
        print("| M  | ef_construction | Build Time (s) | Index Size | Recall @ ef_search=40 | p95 Latency (ms) |")
        print("|----|-----------------|----------------|------------|-----------------------|------------------|")
    
        for config in build_configs:
            m, efc = config['m'], config['ef_construction']
            build_time, index_size = benchmark_index_build(conn, m, efc)
            
            # Test with a fixed ef_search to evaluate index quality
            recall, p95 = benchmark_query_performance(conn, test_vectors, ground_truth, ef_search=40)
            
            print(f"| {m:<2} | {efc:<15} | {build_time:<14.2f} | {index_size:<10} | {recall:<21.4f} | {p95:<16.2f} |")
    
        conn.close()

    Expected Benchmark Results (Illustrative):

    Mef_constructionBuild Time (s)Index SizeRecall @ ef_search=40p95 Latency (ms)
    1664650.12780 MB0.975012.51
    24961105.451.1 GB0.989014.88
    321281850.911.5 GB0.994018.03
    481923521.602.2 GB0.996025.40

    Analysis and Production Pattern:

    From these results, we can see clear diminishing returns. Moving from m=32 to m=48 almost doubles the build time and significantly increases the index size for a mere 0.2% gain in recall. The cost is not justified.

    Production Pattern: For most applications, an m value between 24 and 32 offers the best balance. The ef_construction parameter should be set as high as your build time budget allows, with a common starting point being 4 m to 8 m. Since index building is often a one-time or infrequent batch process, it's worth investing time here to create a high-quality graph that will yield better recall at lower query latencies.

    The Real-time Lever: Tuning `ef_search` for Latency vs. Recall

    Once you have a well-built index (e.g., using m=32, ef_construction=128), ef_search becomes your primary tool for runtime tuning. It directly controls the depth of the search at query time.

    Let's modify our script to test a single, well-built index against a range of ef_search values.

    python
    # (Assuming the index from m=32, ef_construction=128 is already built)
    if __name__ == '__main__':
        # Load ground truth data
        test_vectors = np.load('test_vectors.npy')
        with open('ground_truth.pkl', 'rb') as f:
            ground_truth = pickle.load(f)
            
        conn = get_db_connection()
    
        search_configs = [20, 30, 40, 60, 80, 120, 160]
        
        print("| ef_search | Recall   | p95 Latency (ms) |")
        print("|-----------|----------|------------------|")
    
        for ef_search in search_configs:
            recall, p95 = benchmark_query_performance(conn, test_vectors, ground_truth, ef_search)
            print(f"| {ef_search:<9} | {recall:<8.4f} | {p95:<16.2f} |")
    
        conn.close()

    Expected Benchmark Results (Illustrative):

    ef_searchRecallp95 Latency (ms)
    200.96109.85
    300.982013.50
    400.994018.03
    600.997025.11
    800.998033.90
    1200.998550.24
    1600.998568.77

    Analysis and Production Pattern:

    This table is the key to mapping your business requirements to technical parameters.

  • Do you need >99% recall at all costs for a critical internal process? An ef_search of 40 or 60 is appropriate.
  • Are you serving a public-facing API where a p95 latency under 15ms is paramount, and 98% recall is acceptable? An ef_search of 30 is your target.
  • The relationship is clear: latency increases almost linearly with ef_search, while recall gains plateau quickly.

    Production Pattern: Dynamic ef_search Tuning

    A powerful feature of pgvector is that ef_search can be set on a per-transaction or per-session basis. This allows you to tailor performance for different parts of your application from the same database index.

    Consider an application with two endpoints:

  • /api/search/fast: A user-facing interactive search where low latency is key.
  • /api/search/accurate: An internal endpoint for a data science team that requires the highest possible recall.
  • Your application code could look like this (using Python's psycopg2):

    python
    def fast_search(query_vector):
        with get_db_connection() as conn:
            with conn.cursor() as cursor:
                # Set a low ef_search for this transaction only
                cursor.execute("SET LOCAL hnsw.ef_search = 30;")
                cursor.execute("SELECT id FROM items ORDER BY embedding <-> %s LIMIT 10", (query_vector,))
                return [row[0] for row in cursor.fetchall()]
    
    def accurate_search(query_vector):
        with get_db_connection() as conn:
            with conn.cursor() as cursor:
                # Set a high ef_search for this transaction only
                cursor.execute("SET LOCAL hnsw.ef_search = 80;")
                cursor.execute("SELECT id FROM items ORDER BY embedding <-> %s LIMIT 10", (query_vector,))
                return [row[0] for row in cursor.fetchall()]

    Using SET LOCAL ensures the setting only applies to the current transaction, preventing side effects. This dynamic approach is far superior to setting a single global value and is a hallmark of a well-architected vector search service.

    Edge Cases and Production Hardening

    Real-world systems present challenges beyond simple tuning.

    1. Index Bloat and VACUUM

    HNSW indexes in pgvector are particularly susceptible to bloat from UPDATE and DELETE operations. When a vector is deleted, it is merely marked as deleted within the index structure; the space is not reclaimed. Subsequent searches still traverse these dead nodes, increasing latency for no benefit.

    Problem: A table with heavy churn will see its HNSW query performance degrade over time, even if the total number of rows remains constant.

    Solution:

  • Aggressive Autovacuum: For your vector table, you may need to tune autovacuum parameters more aggressively than the database defaults. Consider lowering autovacuum_vacuum_scale_factor to a small value (e.g., 0.05) and autovacuum_vacuum_threshold to trigger vacuums more frequently.
  • Periodic REINDEX: For tables with extremely high churn, a periodic, scheduled REINDEX CONCURRENTLY during off-peak hours may be necessary to rebuild the index from scratch, completely eliminating bloat. This is a heavy operation and should be used judiciously.
  • 2. The Critical Challenge: Metadata Filtering

    A purely semantic search is rare. Most production queries combine vector search with traditional metadata filters, like ... WHERE category = 'electronics' AND created_at > '2023-01-01'. The way HNSW handles this is a major performance consideration.

    The HNSW Post-Filtering Problem:

    pgvector's HNSW implementation performs post-filtering. This means the database first finds the k (your LIMIT clause) nearest neighbors in the entire vector space and then applies the WHERE clause to that small result set.

    Example Query:

    sql
    SELECT id, (embedding <-> %s) as distance
    FROM items
    WHERE category = 'electronics'
    ORDER BY embedding <-> %s
    LIMIT 10;

    Execution Flow:

  • Find the 10 vectors closest to the query vector from the entire 1 million item index.
  • Of those 10 results, check which ones have category = 'electronics'.
    • Return the matching rows.

    The Catastrophic Recall Failure: If the true 10 nearest neighbors for 'electronics' are not within the top overall nearest neighbors, you will get few or even zero results. This is especially problematic with highly selective filters (e.g., filtering by a specific user_id).

    Solutions & Patterns:

  • Increase the LIMIT: The simplest workaround is to retrieve more candidates from the HNSW index and filter in the application. This increases the probability of finding matches but also increases database load and latency.
  • sql
        -- Fetch 100 candidates, then filter in the app to get the top 10
        SELECT id, category, (embedding <-> %s) as distance
        FROM items
        ORDER BY embedding <-> %s
        LIMIT 100;
  • Consider IVFFlat for Pre-Filtering: pgvector also supports IVFFlat indexes. IVFFlat can perform pre-filtering (or more accurately, it can scan only the relevant partitions), which is much more efficient for selective filters. However, IVFFlat has its own tuning complexity (choosing lists and probes) and often has lower recall than a well-tuned HNSW index for non-filtered queries.
  • Partitioning (Advanced): For a common, low-cardinality filter key (like tenant_id in a multi-tenant application), you can use table partitioning. You would create a partitioned table items partitioned by category, and build a separate HNSW index on each partition. Your queries would then target a specific partition, effectively scoping the search space.
  • sql
        -- Querying a specific partition
        SELECT id FROM items_electronics ORDER BY embedding <-> %s LIMIT 10;

    This provides perfect isolation and the best performance but adds significant architectural complexity.

    Conclusion: A Synthesis of Strategy

    Tuning pgvector HNSW indexes is an iterative, data-driven process, not a one-time configuration. For senior engineers tasked with building robust AI systems, the path to production excellence involves:

  • Establish a Rigorous Benchmark: You cannot optimize what you cannot measure. Create a repeatable test suite with a stable ground truth dataset.
  • Invest in Build Quality: Use the highest m and ef_construction values your build-time budget can tolerate. A common starting point of m=24-32 and ef_construction=128 is robust. This is a one-time cost for long-term query performance.
  • Leverage Dynamic ef_search: This is your primary runtime control. Use SET LOCAL to tailor the recall/latency trade-off for different API endpoints or use cases, mapping SLOs directly to this parameter.
  • Plan for Churn: Actively monitor for index bloat. Implement an aggressive autovacuum strategy for vector tables and be prepared to REINDEX if necessary.
  • Master the Filtering Dilemma: Understand the limitations of HNSW's post-filtering. For highly selective queries, architect your solution around this by fetching a larger candidate set, considering IVFFlat, or implementing table partitioning for critical filter keys.
  • By moving beyond the default settings and applying these advanced patterns, you can transform a functional pgvector implementation into a production-ready, highly performant semantic search engine capable of meeting demanding, real-time requirements.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles