Tuning HNSW for Billion-Scale ANN: M & ef_construction Deep Dive

October 1, 2025

19 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Defaults: Mastering the Core of High-Performance HNSW

In the realm of large-scale information retrieval and recommender systems, Approximate Nearest Neighbor (ANN) search over high-dimensional vectors has become a foundational technology. While many engineers are familiar with the concept of vector embeddings and the need for ANN, the transition from a proof-of-concept on a few million items to a production system serving billions of vectors reveals a chasm of complexity. At the heart of this challenge lies the Hierarchical Navigable Small World (HNSW) algorithm, the de facto industry standard for its exceptional performance.

However, treating HNSW as a black box with default parameters is a critical mistake that leads to excessive memory usage, slow build times, and poor recall-latency trade-offs. Senior engineers tasked with building these systems must understand that HNSW is not an algorithm to be used, but a complex data structure to be tuned.

This article bypasses introductory material entirely. We assume you understand what vector embeddings are, why cosine similarity or L2 distance is used, and why a brute-force O(N) search is intractable. Instead, we will focus exclusively on the two most critical build-time parameters that define the very structure and quality of your HNSW graph: M, the connectivity parameter, and ef_construction, the build-time search beam width. We will explore:

The fundamental interplay between M and ef_construction: How they collaboratively shape the graph's topology and its search efficiency.

A systematic benchmarking methodology: A production-oriented approach to find the optimal parameter set for your specific data distribution and performance requirements (QPS vs. Recall).

Advanced runtime tuning with ef_search: How to leverage the pre-built graph to dynamically adjust the performance curve without costly rebuilds.

Memory footprint analysis and scaling to a billion vectors: We'll break down the memory formula for HNSW and demonstrate why Product Quantization (PQ) is a non-negotiable component for massive-scale deployments.

Our analysis will use Python with industry-standard libraries like hnswlib and faiss, providing complete, runnable code examples that model a real-world tuning process.

Section 1: The Core Architectural Trade-off: M vs. ef_construction

The HNSW graph is a multi-layered structure where the top layers contain sparse long-range connections for coarse navigation, and the bottom layers contain dense short-range connections for fine-grained search. The quality of this graph is paramount, and it is almost entirely dictated by M and ef_construction during the initial indexing phase.

M: Defining Graph Connectivity

M specifies the maximum number of bi-directional links (neighbors) that any given node can have on a single layer of the graph (with the exception of layer 0, which can have up to 2M links). It directly controls the density of the graph.

* Low M (e.g., 8-12): Creates a sparse graph.

* Pros: Lower memory consumption per vector, faster index build times.

* Cons: The navigational paths are less robust. The search might get trapped in a local minimum, failing to find the true nearest neighbors, thus leading to lower peak recall. A sparse graph requires a larger search-time beam (ef_search) to achieve acceptable recall, potentially increasing query latency.

* High M (e.g., 48-96): Creates a dense, highly interconnected graph.

* Pros: More robust and numerous paths to any given node, enabling higher recall ceilings even with a smaller ef_search.

* Cons: Significantly increased memory footprint. The index size on disk and in RAM can explode. Index build times are much longer as more neighbor candidates must be considered and linked for each insertion.

ef_construction: Defining Build-Time Quality

While M sets the static limit on connections, ef_construction controls the dynamic quality of those connections during the build process. When a new vector is inserted, the algorithm traverses the graph from a top-level entry point to find its approximate location. ef_construction defines the size of the dynamic candidate list (a beam search or priority queue) used during this traversal.

* Low ef_construction (e.g., 100): The insertion process performs a relatively shallow search for potential neighbors.

* Pros: Dramatically faster index build times.

* Cons: The algorithm may only find locally optimal neighbors for the new node, not the globally best ones. This leads to a lower-quality graph structure, which directly harms the maximum achievable recall, regardless of how high you set the search-time ef_search parameter.

* High ef_construction (e.g., 500+): The insertion process performs a much more exhaustive search for neighbors.

* Pros: The resulting graph has connections that are much closer to the true nearest neighbors. This creates a highly efficient navigational structure, leading to higher recall for a given ef_search.

* Cons: Extremely slow index build times. The cost is often super-linear.

The Critical Interplay:

A common mistake is to tune these parameters in isolation. A high M is wasted if ef_construction is too low, as you'll be creating a dense graph with poorly chosen, suboptimal connections. Conversely, an extremely high ef_construction with a very low M is inefficient; you're spending immense computational effort to find the best 100 neighbors, only to discard most of them and keep just M=8.

The goal is to find a balance: ef_construction should be high enough to populate the M connections with high-quality candidates. A common rule of thumb is ef_construction >= 2 * M, but this is highly data-dependent and requires empirical validation.

Section 2: A Production-Grade Benchmarking Workflow

Theoretical understanding is insufficient. To choose the right parameters, you must benchmark against your own data or a representative dataset. Let's design a systematic experiment.

For this example, we'll use a 1 million vector subset of the SIFT1B dataset (128 dimensions) to simulate a realistic scenario while keeping runtimes manageable for an article. The ground truth (true nearest neighbors) is available for this dataset, which is crucial for calculating recall.

Setup

First, let's install the necessary libraries and prepare our data.

bash

# We'll use Faiss for its robust HNSW implementation and utility functions
pip install faiss-cpu numpy

# In a real scenario, you'd download and prepare the SIFT dataset.
# For this example, we'll generate synthetic data with a similar structure.

python

import numpy as np
import faiss
import time

# --- Data Generation (Simulating SIFT1M) ---
def generate_data(num_vectors, dim):
    print(f"Generating {num_vectors} vectors of dimension {dim}...")
    # Generate clustered data to make search non-trivial
    np.random.seed(42)
    num_clusters = 100
    centroids = np.random.rand(num_clusters, dim).astype('float32')
    data = []
    for i in range(num_vectors):
        centroid = centroids[i % num_clusters]
        # Add some noise
        vec = centroid + np.random.normal(0, 0.1, dim).astype('float32')
        data.append(vec)
    data = np.array(data)
    # L2 normalize the vectors, as is common practice
    faiss.normalize_L2(data)
    return data

dimension = 128
num_database_vectors = 1_000_000
num_query_vectors = 1_000

db_vectors = generate_data(num_database_vectors, dimension)
query_vectors = generate_data(num_query_vectors, dimension)

# --- Ground Truth Calculation ---
# This is computationally expensive and why we do it once offline.
print("Calculating ground truth...")
ground_truth_k = 100
index_flat = faiss.IndexFlatL2(dimension)
index_flat.add(db_vectors)
D_gt, I_gt = index_flat.search(query_vectors, ground_truth_k)

print("Setup complete.")

Code Example 1: Systematic Tuning Script

Now, we'll create a loop to build and evaluate HNSW indexes with different parameter combinations.

python

def calculate_recall(I_ann, I_gt, k):
    """Calculates recall@k for the search results."""
    # I_ann: (num_queries, k) array of ANN results
    # I_gt: (num_queries, k) array of ground truth results
    # For each query, count how many of the top k ANN results are in the top k ground truth
    recall_at_k = (I_ann[:, :k].reshape(-1, 1) == I_gt[:, :k].reshape(1, -1)).any(axis=1).sum()
    return recall_at_k / (len(I_ann) * k)

def benchmark_hnsw(m_val, ef_construction_val, ef_search_val):
    """Builds, benchmarks, and returns stats for an HNSW index."""
    print(f"\n--- Benchmarking M={m_val}, efC={ef_construction_val}, efS={ef_search_val} ---")
    
    # 1. Build the index
    index = faiss.IndexHNSWFlat(dimension, m_val)
    index.hnsw.efConstruction = ef_construction_val
    
    start_build_time = time.time()
    index.add(db_vectors)
    end_build_time = time.time()
    build_time = end_build_time - start_build_time
    
    # 2. Set search-time parameter
    index.hnsw.efSearch = ef_search_val
    
    # 3. Perform the search and measure latency
    start_search_time = time.time()
    D_ann, I_ann = index.search(query_vectors, k=10)
    end_search_time = time.time()
    
    search_time_ms = (end_search_time - start_search_time) * 1000
    qps = num_query_vectors / (search_time_ms / 1000)
    
    # 4. Calculate recall
    recall = calculate_recall(I_ann, I_gt, k=10)

    # 5. Measure memory (approximate)
    # In a real scenario, you'd use a more robust method to get RSS
    # For Faiss, we can estimate from the components.
    # VECTORS: num_vectors * dim * 4 bytes
    # GRAPH: num_vectors * M * 2 * 4 bytes (approx for bi-directional links)
    mem_vectors_mb = (db_vectors.nbytes) / (1024**2)
    mem_graph_mb = (num_database_vectors * m_val * 2 * 4) / (1024**2)
    total_mem_mb = mem_vectors_mb + mem_graph_mb
    
    print(f"Build Time: {build_time:.2f}s")
    print(f"QPS: {qps:.2f}")
    print(f"Recall@10: {recall:.4f}")
    print(f"Est. Memory: {total_mem_mb:.2f} MB")
    
    return {
        "M": m_val,
        "efConstruction": ef_construction_val,
        "efSearch": ef_search_val,
        "build_time_s": build_time,
        "qps": qps,
        "recall_at_10": recall,
        "memory_mb": total_mem_mb
    }

# --- Run the experiment ---
results = []
M_values = [16, 32, 64]
ef_construction_values = [200, 400]
ef_search_values = [32, 64, 128] # We'll explore this more in the next section

for m in M_values:
    for efc in ef_construction_values:
        # We only need to build the index once for each (M, efC) pair
        print(f"\nBuilding index for M={m}, efC={efc}...")
        index = faiss.IndexHNSWFlat(dimension, m)
        index.hnsw.efConstruction = efc
        build_start = time.time()
        index.add(db_vectors)
        build_time = time.time() - build_start

        for efs in ef_search_values:
            print(f"  Searching with efS={efs}...")
            index.hnsw.efSearch = efs
            
            search_start = time.time()
            D_ann, I_ann = index.search(query_vectors, k=10)
            search_time = time.time() - search_start
            
            qps = num_query_vectors / search_time
            recall = calculate_recall(I_ann, I_gt, k=10)
            
            results.append({
                "M": m, "efC": efc, "efS": efs,
                "build_time": build_time,
                "qps": qps, "recall": recall
            })

# --- Analyze Results ---
import pandas as pd
df = pd.DataFrame(results)
print("\n--- BENCHMARK RESULTS ---")
print(df.to_string())

Analysis of Results (Hypothetical Output)

text

--- BENCHMARK RESULTS ---
    M  efC  efS  build_time         qps    recall
0  16  200   32   15.541245  15032.1234  0.9215
1  16  200   64   15.541245   8123.4567  0.9532
2  16  200  128   15.541245   4321.9876  0.9610
3  16  400   32   25.123456  15543.8765  0.9450
4  16  400   64   25.123456   8456.1234  0.9710
5  16  400  128   25.123456   4567.5432  0.9785
6  32  200   32   28.987654  14567.1234  0.9510  <-- Poor efC for M
7  32  200   64   28.987654   7987.4321  0.9750
8  32  200  128   28.987654   4123.8765  0.9810
9  32  400   32   48.543210  15876.5432  0.9790
10 32  400   64   48.543210   8876.1234  0.9910  <-- Sweet spot
11 32  400  128   48.543210   4987.5432  0.9945
12 64  400   32   85.123456  15999.1234  0.9850
13 64  400   64   85.123456   8998.4321  0.9950
14 64  400  128   85.123456   5123.8765  0.9972

From this data, we can draw critical conclusions:

* Higher ef_construction consistently yields better recall. Compare rows (0-2) with (3-5). For the same M=16, increasing efC from 200 to 400 gives a significant recall boost (e.g., 0.9532 -> 0.9710 at efS=64) for a moderate increase in build time.

* The diminishing returns of M. Moving from M=32 to M=64 (rows 10 vs 13) provides a tiny recall increase (0.9910 -> 0.9950) but nearly doubles the build time and, critically, the memory required for the graph structure. For many applications, M=32 is a much better trade-off.

ef_search is the runtime knob. For a single built index* (e.g., M=32, efC=400), we can see a clear trade-off curve: increasing efS from 32 to 128 improves recall from 0.9790 to 0.9945, but QPS drops by nearly 70%. This is the primary lever you will use in production.

Section 3: Advanced Considerations and Edge Cases

The Latency/Recall Curve and Dynamic `ef_search`

The most powerful feature of HNSW is the ability to tune the ef_search parameter per query. This allows for a multi-tiered service level. For instance, a background batch job could use a high ef_search for maximum quality, while an interactive user-facing query could use a lower ef_search to meet strict latency SLAs.

Your goal after the build-time tuning is to select a single (M, ef_construction) pair that produces the best potential graph. Then, you characterize its performance across a range of ef_search values to create a recall-vs-latency curve.

Code Example 2: Plotting the Performance Curve

python

import matplotlib.pyplot as plt

# Let's choose our best index from the benchmark: M=32, efC=400
chosen_m = 32
chosen_efc = 400

print(f"Characterizing index M={chosen_m}, efC={chosen_efc}")
index = faiss.IndexHNSWFlat(dimension, chosen_m)
index.hnsw.efConstruction = chosen_efc
index.add(db_vectors)

search_ef_range = [16, 24, 32, 48, 64, 96, 128, 192, 256]
curve_qps = []
curve_recall = []

for efs in search_ef_range:
    index.hnsw.efSearch = efs
    
    start = time.time()
    D, I = index.search(query_vectors, k=10)
    end = time.time()
    
    qps = num_query_vectors / (end - start)
    recall = calculate_recall(I, I_gt, k=10)
    
    curve_qps.append(qps)
    curve_recall.append(recall)

# Plotting
fig, ax1 = plt.subplots(figsize=(10, 6))

color = 'tab:red'
ax1.set_xlabel('Recall@10')
ax1.set_ylabel('QPS (Queries per Second)', color=color)
ax1.plot(curve_recall, curve_qps, color=color, marker='o', label='QPS')
ax1.tick_params(axis='y', labelcolor=color)
ax1.grid(True)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:blue'
ax2.set_ylabel('Latency (ms per query)', color=color)
latencies_ms = [1000 / qps for qps in curve_qps]
ax2.plot(curve_recall, latencies_ms, color=color, marker='x', label='Latency')
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout() 
plt.title(f'Performance Curve for HNSW (M={chosen_m}, efC={chosen_efc})')
plt.show()

This plot is the single most important artifact for communicating performance to stakeholders. It allows you to make data-driven decisions like, "We can achieve 99% recall with a p99 latency of 15ms, but achieving 99.5% recall would increase latency to 40ms. Is that trade-off acceptable?"

Handling Deletes and Updates

Standard HNSW implementations (including the base ones in Faiss and hnswlib) do not efficiently support vector deletion or updates. Deleting a node would require relinking all of its neighbors, a complex and costly operation that can degrade the graph's global properties.

The common production patterns are:

Tombstoning + Periodic Rebuilds: The most prevalent strategy. When a vector is deleted, its ID is added to a blacklist (a tombstone). During search, any results from the blacklist are filtered out. This is fast but leads to graph degradation as searches traverse through dead nodes. The entire index must be rebuilt from scratch periodically (e.g., daily or weekly) to purge the deleted vectors and restore performance.

Segmented Indexing: Similar to systems like Lucene, data is written to immutable, time-partitioned segments (smaller HNSW graphs). Deletions are handled with segment-level tombstones. Queries are sent to all relevant segments, and results are merged. Old segments with many deletions can be merged and rebuilt in the background. Vector databases like Milvus and Vespa employ variations of this strategy.

Section 4: The Billion-Scale Leap with Product Quantization (PQ)

Our 1M vector example was manageable. Now, let's consider a 1 billion vector index.

* Data: 1,000,000,000 vectors

* Dimensions: 128

* Data Type: float32 (4 bytes)

Memory for Raw Vectors: 10^9 128 4 bytes = 512,000,000,000 bytes = 512 GB

This is for the vectors alone. Now add the HNSW graph overhead with M=32:

Memory for Graph: 10^9 32 2 * 4 bytes = 256,000,000,000 bytes = 256 GB

Total RAM Required: 512 GB + 256 GB = 768 GB

This is prohibitively expensive for most applications. The solution is to compress the vectors using quantization.

HNSW + Product Quantization (PQ)

Product Quantization is a vector compression technique that dramatically reduces memory footprint at the cost of some precision. At a high level, it works by:

Splitting each high-dimensional vector into m sub-vectors.

Running a separate k-means clustering algorithm on the sub-vectors within each subspace across the entire dataset. This generates m codebooks, each with k centroids (typically k=256).

To compress a vector, each of its sub-vectors is replaced by the ID of its nearest centroid in the corresponding codebook. If k=256, each ID can be stored in 8 bits (1 byte).

Faiss provides a seamless integration with IndexHNSWPQ. This index type builds the HNSW navigational graph using the full-precision vectors but only stores the compressed PQ codes. At search time, the graph is traversed to find candidate vectors, and then distances are calculated asymmetrically: the full-precision query vector is compared against the decompressed PQ codes of the candidates.

Code Example 3: Building and Searching IndexHNSWPQ

python

# Let's scale up our simulation to 10M vectors to see the impact
num_database_vectors_large = 10_000_000
db_vectors_large = generate_data(num_database_vectors_large, dimension)

# --- PQ Parameters ---
# m: number of sub-vectors. 128 is divisible by 8, 16, 32...
m = 16  # Each 128-dim vector becomes 16 sub-vectors of 8-dim
nbits = 8 # Each sub-vector is encoded with 8 bits (256 centroids)

# This means each vector will be stored in m * nbits / 8 = 16 * 8 / 8 = 16 bytes.
# This is a 128*4 / 16 = 32x compression ratio!

# --- Build the Index ---
# The quantizer is a flat index used to train the PQ codebooks.
quantizer = faiss.IndexHNSWFlat(dimension, 32)
index_hnswpq = faiss.IndexHNSWPQ(dimension, 16, 32, 16, 8)
# Set HNSW parameters on the internal coarse quantizer
index_hnswpq.hnsw.efConstruction = 400
index_hnswpq.hnsw.efSearch = 64

print("Training PQ...")
# We need to train the PQ codebooks on a representative sample of the data
# Faiss recommends training on at least 30-100x the number of centroids
num_centroids = 1 << nbits # 256
training_size = num_centroids * 50
training_vectors = db_vectors_large[np.random.permutation(db_vectors_large.shape[0])[:training_size]]
index_hnswpq.train(training_vectors)

print("Adding vectors to IndexHNSWPQ...")
index_hnswpq.add(db_vectors_large)

# --- Memory Calculation ---
mem_pq_vectors_mb = (num_database_vectors_large * m) / (1024**2)
mem_pq_graph_mb = (num_database_vectors_large * 32 * 2 * 4) / (1024**2)
total_pq_mem_mb = mem_pq_vectors_mb + mem_pq_graph_mb

print(f"\nEstimated Memory for HNSWFlat (10M vecs): {((10e6*128*4)+(10e6*32*2*4))/(1024**2):.2f} MB")
print(f"Estimated Memory for HNSWPQ (10M vecs): {total_pq_mem_mb:.2f} MB")

# --- Benchmarking HNSWPQ ---
# ... (search and recall calculation would be similar to before)
# Note: Recall will be lower than the HNSWFlat index due to quantization error.

Expected Output:

text

Estimated Memory for HNSWFlat (10M vecs): 7324.22 MB
Estimated Memory for HNSWPQ (10M vecs): 253.91 MB

The memory savings are astronomical. This is the only viable path to billion-scale ANN. The trade-off is a lower recall ceiling. The art of tuning then expands to include the PQ parameters (m, nbits) alongside the HNSW parameters, searching for the best balance of memory, latency, and accuracy that the business requirements can tolerate.

Conclusion: From Practitioner to Expert

Moving beyond default HNSW parameters is the defining step from being a user of ANN libraries to an architect of high-performance retrieval systems. The core principles are universal:

Build-time parameters (M, ef_construction) are an investment. You pay a high, one-time computational cost to build a high-quality graph structure. This investment dictates the maximum possible performance of your system. Skimping here will create a permanent bottleneck that no amount of search-time tuning can fix.

ef_search is your dynamic, runtime lever. It allows you to navigate the recall/latency trade-off curve defined by your index's build quality. Characterize this curve meticulously.

Benchmarking is non-negotiable. The optimal parameters are entirely dependent on your data's dimensionality, distribution, and size. The systematic process of building, measuring, and analyzing is the only path to a truly optimized system.

At massive scale, quantization is mandatory. Raw float vectors are a luxury reserved for small-to-medium-sized indexes. Mastering combined indexes like IndexHNSWPQ is essential for tackling billion-scale problems without exorbitant hardware costs.

By internalizing these concepts and applying a rigorous tuning methodology, you can engineer vector search systems that are not only functional but also efficient, scalable, and precisely aligned with your production requirements.

Beyond Defaults: Mastering the Core of High-Performance HNSW

Section 1: The Core Architectural Trade-off: M vs. ef_construction

M: Defining Graph Connectivity

ef_construction: Defining Build-Time Quality

Section 2: A Production-Grade Benchmarking Workflow

Setup

Code Example 1: Systematic Tuning Script

Analysis of Results (Hypothetical Output)

Section 3: Advanced Considerations and Edge Cases

The Latency/Recall Curve and Dynamic `ef_search`

Handling Deletes and Updates

Section 4: The Billion-Scale Leap with Product Quantization (PQ)

HNSW + Product Quantization (PQ)

Conclusion: From Practitioner to Expert

Found this article helpful?