Tuning HNSW for Billion-Scale ANN: M & ef_construction Deep Dive

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Defaults: Mastering the Core of High-Performance HNSW

In the realm of large-scale information retrieval and recommender systems, Approximate Nearest Neighbor (ANN) search over high-dimensional vectors has become a foundational technology. While many engineers are familiar with the concept of vector embeddings and the need for ANN, the transition from a proof-of-concept on a few million items to a production system serving billions of vectors reveals a chasm of complexity. At the heart of this challenge lies the Hierarchical Navigable Small World (HNSW) algorithm, the de facto industry standard for its exceptional performance.

However, treating HNSW as a black box with default parameters is a critical mistake that leads to excessive memory usage, slow build times, and poor recall-latency trade-offs. Senior engineers tasked with building these systems must understand that HNSW is not an algorithm to be used, but a complex data structure to be tuned.

This article bypasses introductory material entirely. We assume you understand what vector embeddings are, why cosine similarity or L2 distance is used, and why a brute-force O(N) search is intractable. Instead, we will focus exclusively on the two most critical build-time parameters that define the very structure and quality of your HNSW graph: M, the connectivity parameter, and ef_construction, the build-time search beam width. We will explore:

  • The fundamental interplay between M and ef_construction: How they collaboratively shape the graph's topology and its search efficiency.
  • A systematic benchmarking methodology: A production-oriented approach to find the optimal parameter set for your specific data distribution and performance requirements (QPS vs. Recall).
  • Advanced runtime tuning with ef_search: How to leverage the pre-built graph to dynamically adjust the performance curve without costly rebuilds.
  • Memory footprint analysis and scaling to a billion vectors: We'll break down the memory formula for HNSW and demonstrate why Product Quantization (PQ) is a non-negotiable component for massive-scale deployments.
  • Our analysis will use Python with industry-standard libraries like hnswlib and faiss, providing complete, runnable code examples that model a real-world tuning process.


    Section 1: The Core Architectural Trade-off: M vs. ef_construction

    The HNSW graph is a multi-layered structure where the top layers contain sparse long-range connections for coarse navigation, and the bottom layers contain dense short-range connections for fine-grained search. The quality of this graph is paramount, and it is almost entirely dictated by M and ef_construction during the initial indexing phase.

    M: Defining Graph Connectivity

    M specifies the maximum number of bi-directional links (neighbors) that any given node can have on a single layer of the graph (with the exception of layer 0, which can have up to 2M links). It directly controls the density of the graph.

    * Low M (e.g., 8-12): Creates a sparse graph.

    * Pros: Lower memory consumption per vector, faster index build times.

    * Cons: The navigational paths are less robust. The search might get trapped in a local minimum, failing to find the true nearest neighbors, thus leading to lower peak recall. A sparse graph requires a larger search-time beam (ef_search) to achieve acceptable recall, potentially increasing query latency.

    * High M (e.g., 48-96): Creates a dense, highly interconnected graph.

    * Pros: More robust and numerous paths to any given node, enabling higher recall ceilings even with a smaller ef_search.

    * Cons: Significantly increased memory footprint. The index size on disk and in RAM can explode. Index build times are much longer as more neighbor candidates must be considered and linked for each insertion.

    ef_construction: Defining Build-Time Quality

    While M sets the static limit on connections, ef_construction controls the dynamic quality of those connections during the build process. When a new vector is inserted, the algorithm traverses the graph from a top-level entry point to find its approximate location. ef_construction defines the size of the dynamic candidate list (a beam search or priority queue) used during this traversal.

    * Low ef_construction (e.g., 100): The insertion process performs a relatively shallow search for potential neighbors.

    * Pros: Dramatically faster index build times.

    * Cons: The algorithm may only find locally optimal neighbors for the new node, not the globally best ones. This leads to a lower-quality graph structure, which directly harms the maximum achievable recall, regardless of how high you set the search-time ef_search parameter.

    * High ef_construction (e.g., 500+): The insertion process performs a much more exhaustive search for neighbors.

    * Pros: The resulting graph has connections that are much closer to the true nearest neighbors. This creates a highly efficient navigational structure, leading to higher recall for a given ef_search.

    * Cons: Extremely slow index build times. The cost is often super-linear.

    The Critical Interplay:

    A common mistake is to tune these parameters in isolation. A high M is wasted if ef_construction is too low, as you'll be creating a dense graph with poorly chosen, suboptimal connections. Conversely, an extremely high ef_construction with a very low M is inefficient; you're spending immense computational effort to find the best 100 neighbors, only to discard most of them and keep just M=8.

    The goal is to find a balance: ef_construction should be high enough to populate the M connections with high-quality candidates. A common rule of thumb is ef_construction >= 2 * M, but this is highly data-dependent and requires empirical validation.


    Section 2: A Production-Grade Benchmarking Workflow

    Theoretical understanding is insufficient. To choose the right parameters, you must benchmark against your own data or a representative dataset. Let's design a systematic experiment.

    For this example, we'll use a 1 million vector subset of the SIFT1B dataset (128 dimensions) to simulate a realistic scenario while keeping runtimes manageable for an article. The ground truth (true nearest neighbors) is available for this dataset, which is crucial for calculating recall.

    Setup

    First, let's install the necessary libraries and prepare our data.

    bash
    # We'll use Faiss for its robust HNSW implementation and utility functions
    pip install faiss-cpu numpy
    
    # In a real scenario, you'd download and prepare the SIFT dataset.
    # For this example, we'll generate synthetic data with a similar structure.
    python
    import numpy as np
    import faiss
    import time
    
    # --- Data Generation (Simulating SIFT1M) ---
    def generate_data(num_vectors, dim):
        print(f"Generating {num_vectors} vectors of dimension {dim}...")
        # Generate clustered data to make search non-trivial
        np.random.seed(42)
        num_clusters = 100
        centroids = np.random.rand(num_clusters, dim).astype('float32')
        data = []
        for i in range(num_vectors):
            centroid = centroids[i % num_clusters]
            # Add some noise
            vec = centroid + np.random.normal(0, 0.1, dim).astype('float32')
            data.append(vec)
        data = np.array(data)
        # L2 normalize the vectors, as is common practice
        faiss.normalize_L2(data)
        return data
    
    dimension = 128
    num_database_vectors = 1_000_000
    num_query_vectors = 1_000
    
    db_vectors = generate_data(num_database_vectors, dimension)
    query_vectors = generate_data(num_query_vectors, dimension)
    
    # --- Ground Truth Calculation ---
    # This is computationally expensive and why we do it once offline.
    print("Calculating ground truth...")
    ground_truth_k = 100
    index_flat = faiss.IndexFlatL2(dimension)
    index_flat.add(db_vectors)
    D_gt, I_gt = index_flat.search(query_vectors, ground_truth_k)
    
    print("Setup complete.")

    Code Example 1: Systematic Tuning Script

    Now, we'll create a loop to build and evaluate HNSW indexes with different parameter combinations.

    python
    def calculate_recall(I_ann, I_gt, k):
        """Calculates recall@k for the search results."""
        # I_ann: (num_queries, k) array of ANN results
        # I_gt: (num_queries, k) array of ground truth results
        # For each query, count how many of the top k ANN results are in the top k ground truth
        recall_at_k = (I_ann[:, :k].reshape(-1, 1) == I_gt[:, :k].reshape(1, -1)).any(axis=1).sum()
        return recall_at_k / (len(I_ann) * k)
    
    def benchmark_hnsw(m_val, ef_construction_val, ef_search_val):
        """Builds, benchmarks, and returns stats for an HNSW index."""
        print(f"\n--- Benchmarking M={m_val}, efC={ef_construction_val}, efS={ef_search_val} ---")
        
        # 1. Build the index
        index = faiss.IndexHNSWFlat(dimension, m_val)
        index.hnsw.efConstruction = ef_construction_val
        
        start_build_time = time.time()
        index.add(db_vectors)
        end_build_time = time.time()
        build_time = end_build_time - start_build_time
        
        # 2. Set search-time parameter
        index.hnsw.efSearch = ef_search_val
        
        # 3. Perform the search and measure latency
        start_search_time = time.time()
        D_ann, I_ann = index.search(query_vectors, k=10)
        end_search_time = time.time()
        
        search_time_ms = (end_search_time - start_search_time) * 1000
        qps = num_query_vectors / (search_time_ms / 1000)
        
        # 4. Calculate recall
        recall = calculate_recall(I_ann, I_gt, k=10)
    
        # 5. Measure memory (approximate)
        # In a real scenario, you'd use a more robust method to get RSS
        # For Faiss, we can estimate from the components.
        # VECTORS: num_vectors * dim * 4 bytes
        # GRAPH: num_vectors * M * 2 * 4 bytes (approx for bi-directional links)
        mem_vectors_mb = (db_vectors.nbytes) / (1024**2)
        mem_graph_mb = (num_database_vectors * m_val * 2 * 4) / (1024**2)
        total_mem_mb = mem_vectors_mb + mem_graph_mb
        
        print(f"Build Time: {build_time:.2f}s")
        print(f"QPS: {qps:.2f}")
        print(f"Recall@10: {recall:.4f}")
        print(f"Est. Memory: {total_mem_mb:.2f} MB")
        
        return {
            "M": m_val,
            "efConstruction": ef_construction_val,
            "efSearch": ef_search_val,
            "build_time_s": build_time,
            "qps": qps,
            "recall_at_10": recall,
            "memory_mb": total_mem_mb
        }
    
    # --- Run the experiment ---
    results = []
    M_values = [16, 32, 64]
    ef_construction_values = [200, 400]
    ef_search_values = [32, 64, 128] # We'll explore this more in the next section
    
    for m in M_values:
        for efc in ef_construction_values:
            # We only need to build the index once for each (M, efC) pair
            print(f"\nBuilding index for M={m}, efC={efc}...")
            index = faiss.IndexHNSWFlat(dimension, m)
            index.hnsw.efConstruction = efc
            build_start = time.time()
            index.add(db_vectors)
            build_time = time.time() - build_start
    
            for efs in ef_search_values:
                print(f"  Searching with efS={efs}...")
                index.hnsw.efSearch = efs
                
                search_start = time.time()
                D_ann, I_ann = index.search(query_vectors, k=10)
                search_time = time.time() - search_start
                
                qps = num_query_vectors / search_time
                recall = calculate_recall(I_ann, I_gt, k=10)
                
                results.append({
                    "M": m, "efC": efc, "efS": efs,
                    "build_time": build_time,
                    "qps": qps, "recall": recall
                })
    
    # --- Analyze Results ---
    import pandas as pd
    df = pd.DataFrame(results)
    print("\n--- BENCHMARK RESULTS ---")
    print(df.to_string())

    Analysis of Results (Hypothetical Output)

    text
    --- BENCHMARK RESULTS ---
        M  efC  efS  build_time         qps    recall
    0  16  200   32   15.541245  15032.1234  0.9215
    1  16  200   64   15.541245   8123.4567  0.9532
    2  16  200  128   15.541245   4321.9876  0.9610
    3  16  400   32   25.123456  15543.8765  0.9450
    4  16  400   64   25.123456   8456.1234  0.9710
    5  16  400  128   25.123456   4567.5432  0.9785
    6  32  200   32   28.987654  14567.1234  0.9510  <-- Poor efC for M
    7  32  200   64   28.987654   7987.4321  0.9750
    8  32  200  128   28.987654   4123.8765  0.9810
    9  32  400   32   48.543210  15876.5432  0.9790
    10 32  400   64   48.543210   8876.1234  0.9910  <-- Sweet spot
    11 32  400  128   48.543210   4987.5432  0.9945
    12 64  400   32   85.123456  15999.1234  0.9850
    13 64  400   64   85.123456   8998.4321  0.9950
    14 64  400  128   85.123456   5123.8765  0.9972

    From this data, we can draw critical conclusions:

    * Higher ef_construction consistently yields better recall. Compare rows (0-2) with (3-5). For the same M=16, increasing efC from 200 to 400 gives a significant recall boost (e.g., 0.9532 -> 0.9710 at efS=64) for a moderate increase in build time.

    * The diminishing returns of M. Moving from M=32 to M=64 (rows 10 vs 13) provides a tiny recall increase (0.9910 -> 0.9950) but nearly doubles the build time and, critically, the memory required for the graph structure. For many applications, M=32 is a much better trade-off.

    ef_search is the runtime knob. For a single built index* (e.g., M=32, efC=400), we can see a clear trade-off curve: increasing efS from 32 to 128 improves recall from 0.9790 to 0.9945, but QPS drops by nearly 70%. This is the primary lever you will use in production.


    Section 3: Advanced Considerations and Edge Cases

    The Latency/Recall Curve and Dynamic `ef_search`

    The most powerful feature of HNSW is the ability to tune the ef_search parameter per query. This allows for a multi-tiered service level. For instance, a background batch job could use a high ef_search for maximum quality, while an interactive user-facing query could use a lower ef_search to meet strict latency SLAs.

    Your goal after the build-time tuning is to select a single (M, ef_construction) pair that produces the best potential graph. Then, you characterize its performance across a range of ef_search values to create a recall-vs-latency curve.

    Code Example 2: Plotting the Performance Curve

    python
    import matplotlib.pyplot as plt
    
    # Let's choose our best index from the benchmark: M=32, efC=400
    chosen_m = 32
    chosen_efc = 400
    
    print(f"Characterizing index M={chosen_m}, efC={chosen_efc}")
    index = faiss.IndexHNSWFlat(dimension, chosen_m)
    index.hnsw.efConstruction = chosen_efc
    index.add(db_vectors)
    
    search_ef_range = [16, 24, 32, 48, 64, 96, 128, 192, 256]
    curve_qps = []
    curve_recall = []
    
    for efs in search_ef_range:
        index.hnsw.efSearch = efs
        
        start = time.time()
        D, I = index.search(query_vectors, k=10)
        end = time.time()
        
        qps = num_query_vectors / (end - start)
        recall = calculate_recall(I, I_gt, k=10)
        
        curve_qps.append(qps)
        curve_recall.append(recall)
    
    # Plotting
    fig, ax1 = plt.subplots(figsize=(10, 6))
    
    color = 'tab:red'
    ax1.set_xlabel('Recall@10')
    ax1.set_ylabel('QPS (Queries per Second)', color=color)
    ax1.plot(curve_recall, curve_qps, color=color, marker='o', label='QPS')
    ax1.tick_params(axis='y', labelcolor=color)
    ax1.grid(True)
    
    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
    color = 'tab:blue'
    ax2.set_ylabel('Latency (ms per query)', color=color)
    latencies_ms = [1000 / qps for qps in curve_qps]
    ax2.plot(curve_recall, latencies_ms, color=color, marker='x', label='Latency')
    ax2.tick_params(axis='y', labelcolor=color)
    
    fig.tight_layout() 
    plt.title(f'Performance Curve for HNSW (M={chosen_m}, efC={chosen_efc})')
    plt.show()

    This plot is the single most important artifact for communicating performance to stakeholders. It allows you to make data-driven decisions like, "We can achieve 99% recall with a p99 latency of 15ms, but achieving 99.5% recall would increase latency to 40ms. Is that trade-off acceptable?"

    Handling Deletes and Updates

    Standard HNSW implementations (including the base ones in Faiss and hnswlib) do not efficiently support vector deletion or updates. Deleting a node would require relinking all of its neighbors, a complex and costly operation that can degrade the graph's global properties.

    The common production patterns are:

  • Tombstoning + Periodic Rebuilds: The most prevalent strategy. When a vector is deleted, its ID is added to a blacklist (a tombstone). During search, any results from the blacklist are filtered out. This is fast but leads to graph degradation as searches traverse through dead nodes. The entire index must be rebuilt from scratch periodically (e.g., daily or weekly) to purge the deleted vectors and restore performance.
  • Segmented Indexing: Similar to systems like Lucene, data is written to immutable, time-partitioned segments (smaller HNSW graphs). Deletions are handled with segment-level tombstones. Queries are sent to all relevant segments, and results are merged. Old segments with many deletions can be merged and rebuilt in the background. Vector databases like Milvus and Vespa employ variations of this strategy.

  • Section 4: The Billion-Scale Leap with Product Quantization (PQ)

    Our 1M vector example was manageable. Now, let's consider a 1 billion vector index.

    * Data: 1,000,000,000 vectors

    * Dimensions: 128

    * Data Type: float32 (4 bytes)

    Memory for Raw Vectors: 10^9 128 4 bytes = 512,000,000,000 bytes = 512 GB

    This is for the vectors alone. Now add the HNSW graph overhead with M=32:

    Memory for Graph: 10^9 32 2 * 4 bytes = 256,000,000,000 bytes = 256 GB

    Total RAM Required: 512 GB + 256 GB = 768 GB

    This is prohibitively expensive for most applications. The solution is to compress the vectors using quantization.

    HNSW + Product Quantization (PQ)

    Product Quantization is a vector compression technique that dramatically reduces memory footprint at the cost of some precision. At a high level, it works by:

  • Splitting each high-dimensional vector into m sub-vectors.
  • Running a separate k-means clustering algorithm on the sub-vectors within each subspace across the entire dataset. This generates m codebooks, each with k centroids (typically k=256).
  • To compress a vector, each of its sub-vectors is replaced by the ID of its nearest centroid in the corresponding codebook. If k=256, each ID can be stored in 8 bits (1 byte).
  • Faiss provides a seamless integration with IndexHNSWPQ. This index type builds the HNSW navigational graph using the full-precision vectors but only stores the compressed PQ codes. At search time, the graph is traversed to find candidate vectors, and then distances are calculated asymmetrically: the full-precision query vector is compared against the decompressed PQ codes of the candidates.

    Code Example 3: Building and Searching IndexHNSWPQ

    python
    # Let's scale up our simulation to 10M vectors to see the impact
    num_database_vectors_large = 10_000_000
    db_vectors_large = generate_data(num_database_vectors_large, dimension)
    
    # --- PQ Parameters ---
    # m: number of sub-vectors. 128 is divisible by 8, 16, 32...
    m = 16  # Each 128-dim vector becomes 16 sub-vectors of 8-dim
    nbits = 8 # Each sub-vector is encoded with 8 bits (256 centroids)
    
    # This means each vector will be stored in m * nbits / 8 = 16 * 8 / 8 = 16 bytes.
    # This is a 128*4 / 16 = 32x compression ratio!
    
    # --- Build the Index ---
    # The quantizer is a flat index used to train the PQ codebooks.
    quantizer = faiss.IndexHNSWFlat(dimension, 32)
    index_hnswpq = faiss.IndexHNSWPQ(dimension, 16, 32, 16, 8)
    # Set HNSW parameters on the internal coarse quantizer
    index_hnswpq.hnsw.efConstruction = 400
    index_hnswpq.hnsw.efSearch = 64
    
    print("Training PQ...")
    # We need to train the PQ codebooks on a representative sample of the data
    # Faiss recommends training on at least 30-100x the number of centroids
    num_centroids = 1 << nbits # 256
    training_size = num_centroids * 50
    training_vectors = db_vectors_large[np.random.permutation(db_vectors_large.shape[0])[:training_size]]
    index_hnswpq.train(training_vectors)
    
    print("Adding vectors to IndexHNSWPQ...")
    index_hnswpq.add(db_vectors_large)
    
    # --- Memory Calculation ---
    mem_pq_vectors_mb = (num_database_vectors_large * m) / (1024**2)
    mem_pq_graph_mb = (num_database_vectors_large * 32 * 2 * 4) / (1024**2)
    total_pq_mem_mb = mem_pq_vectors_mb + mem_pq_graph_mb
    
    print(f"\nEstimated Memory for HNSWFlat (10M vecs): {((10e6*128*4)+(10e6*32*2*4))/(1024**2):.2f} MB")
    print(f"Estimated Memory for HNSWPQ (10M vecs): {total_pq_mem_mb:.2f} MB")
    
    # --- Benchmarking HNSWPQ ---
    # ... (search and recall calculation would be similar to before)
    # Note: Recall will be lower than the HNSWFlat index due to quantization error.

    Expected Output:

    text
    Estimated Memory for HNSWFlat (10M vecs): 7324.22 MB
    Estimated Memory for HNSWPQ (10M vecs): 253.91 MB

    The memory savings are astronomical. This is the only viable path to billion-scale ANN. The trade-off is a lower recall ceiling. The art of tuning then expands to include the PQ parameters (m, nbits) alongside the HNSW parameters, searching for the best balance of memory, latency, and accuracy that the business requirements can tolerate.


    Conclusion: From Practitioner to Expert

    Moving beyond default HNSW parameters is the defining step from being a user of ANN libraries to an architect of high-performance retrieval systems. The core principles are universal:

  • Build-time parameters (M, ef_construction) are an investment. You pay a high, one-time computational cost to build a high-quality graph structure. This investment dictates the maximum possible performance of your system. Skimping here will create a permanent bottleneck that no amount of search-time tuning can fix.
  • ef_search is your dynamic, runtime lever. It allows you to navigate the recall/latency trade-off curve defined by your index's build quality. Characterize this curve meticulously.
  • Benchmarking is non-negotiable. The optimal parameters are entirely dependent on your data's dimensionality, distribution, and size. The systematic process of building, measuring, and analyzing is the only path to a truly optimized system.
  • At massive scale, quantization is mandatory. Raw float vectors are a luxury reserved for small-to-medium-sized indexes. Mastering combined indexes like IndexHNSWPQ is essential for tackling billion-scale problems without exorbitant hardware costs.
  • By internalizing these concepts and applying a rigorous tuning methodology, you can engineer vector search systems that are not only functional but also efficient, scalable, and precisely aligned with your production requirements.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles