Optimizing HNSW Indexes for High-Recall Vector Search at Scale

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Naive k-NN: Why HNSW Dominates Production Vector Search

As senior engineers building AI-powered features, we've moved past the novelty of vector embeddings and are now entrenched in the operational challenges of serving them. The textbook k-Nearest Neighbor (k-NN) search, with its O(N*D) complexity, is a non-starter for any dataset exceeding a few thousand vectors. This brings us to Approximate Nearest Neighbor (ANN) search, the bedrock of modern semantic search, recommendation engines, and RAG pipelines.

Among the pantheon of ANN algorithms—LSH, Annoy, ScaNN—one has emerged as the de facto industry standard for its remarkable balance of speed and accuracy: Hierarchical Navigable Small World (HNSW). Yet, treating HNSW as a black box is a common path to production pitfalls: runaway memory usage, unacceptable query latency, or disappointingly low recall.

This article is not an introduction to HNSW. It assumes you understand its layered graph structure and the greedy search traversal mechanism. Instead, we will dissect the critical tuning parameters and advanced implementation patterns required to deploy HNSW effectively at scale. We will explore the fundamental trade-offs, benchmark configurations, and solve complex real-world problems like memory optimization, filtered search, and data mutability.


The Core Tuning Levers: `M`, `ef_construction`, and `ef_search`

The performance of an HNSW index is not magic; it's a direct result of the graph's topology, which you control through three primary parameters during index construction and querying.

  • M (Max Connections): This parameter defines the maximum number of bidirectional links (neighbors) each node can have on any given layer of the graph. It's set during index initialization.
  • Impact: A higher M creates a denser, more connected graph. This increases the probability of finding the true nearest neighbors (higher recall) and makes the graph more robust. However, it comes at a significant cost: memory usage increases linearly with M (M sizeof(pointer) per node per layer), and index build times are longer as more candidate connections must be evaluated for each new point.

    * Typical Values: 8-64. A common starting point is M=16. For high-dimensional data or when high recall is paramount, values like 32 or 48 are often used.

  • ef_construction (Effective Construction): This build-time parameter controls the size of the dynamic candidate list used during the insertion of a new element. The algorithm maintains a priority queue of the closest ef_construction nodes it has found so far while traversing the graph to find the insertion point and its neighbors.
  • * Impact: A larger ef_construction value leads to a higher quality index. The algorithm explores more potential neighbors for each new node, resulting in better graph connectivity and ultimately higher search recall. The trade-off is a significantly slower index build time. This parameter has a more pronounced effect on recall than M.

    * Typical Values: 64-1024. A good default is often 200 or 400. If your index build time is not a constraint (e.g., a nightly batch job), pushing this value higher can yield a superior index.

  • ef_search (Effective Search): This query-time parameter is analogous to ef_construction but applied during the search process. It defines the size of the dynamic candidate list (priority queue) used in the greedy search algorithm to find the nearest neighbors.
  • * Impact: This is the most critical parameter for tuning the recall-latency trade-off. A larger ef_search forces the algorithm to explore more paths in the graph, drastically increasing the chance of finding the true nearest neighbors (improving recall). However, this exploration comes at the direct cost of query latency. ef_search must be greater than or equal to k, the number of neighbors you want to retrieve.

    * Typical Values: k to 4096. If you need 10 neighbors (k=10), starting ef_search at 40 or 50 is reasonable. For high-recall applications, this can be pushed into the hundreds or thousands.

    Practical Benchmarking: The Recall-Latency Curve

    Talking about these trade-offs is abstract. Let's make it concrete. The single most important exercise when deploying HNSW is to generate a recall-vs-latency (or recall-vs-QPS) curve for your specific dataset and hardware. This allows you to make an informed, data-driven decision about your production configuration.

    Here is a Python script using the hnswlib library to demonstrate this process. We'll generate synthetic data, build an index, and then test query performance with varying ef_search values.

    python
    import hnswlib
    import numpy as np
    import time
    import matplotlib.pyplot as plt
    
    # 1. --- Data and Index Configuration ---
    dim = 128  # Vector dimension
    num_elements = 100_000 # Number of vectors in the dataset
    num_queries = 1_000 # Number of vectors to use for querying
    k = 10 # Number of nearest neighbors to retrieve
    
    # Generate random data
    data = np.float32(np.random.random((num_elements, dim)))
    query_data = np.float32(np.random.random((num_queries, dim)))
    
    # HNSW index parameters
    M = 16
    ef_construction = 200
    space = 'l2' # 'l2' for Euclidean, 'ip' for cosine similarity
    
    # 2. --- Ground Truth Calculation ---
    # For benchmarking recall, we need the true nearest neighbors.
    # This is computationally expensive and done offline.
    print("Calculating ground truth (brute force). This may take a while...")
    start_time = time.time()
    
    # Using hnswlib's brute force index for simplicity
    ground_truth_index = hnswlib.BFIndex(space=space, dim=dim)
    ground_truth_index.init_index(max_elements=num_elements)
    ground_truth_index.add_items(data)
    
    ground_truth_labels, _ = ground_truth_index.knn_query(query_data, k=k)
    
    end_time = time.time()
    print(f"Ground truth calculation took {end_time - start_time:.2f} seconds.")
    
    # 3. --- Index Building ---
    print(f"Building HNSW index with M={M}, ef_construction={ef_construction}...")
    
    hnsw_index = hnswlib.Index(space=space, dim=dim)
    hnsw_index.init_index(max_elements=num_elements, ef_construction=ef_construction, M=M)
    
    start_time = time.time()
    hnsw_index.add_items(data)
    end_time = time.time()
    print(f"HNSW index build took {end_time - start_time:.2f} seconds.")
    
    # 4. --- Benchmarking Loop ---
    
    ef_search_values = [10, 20, 40, 80, 120, 200, 300, 400]
    recalls = []
    qps_values = []
    
    for ef in ef_search_values:
        print(f"\nBenchmarking with ef_search = {ef}...")
        hnsw_index.set_ef(ef)
    
        start_time = time.time()
        hnsw_labels, _ = hnsw_index.knn_query(query_data, k=k)
        end_time = time.time()
    
        # Calculate query performance
        total_time = end_time - start_time
        qps = num_queries / total_time
        qps_values.append(qps)
    
        # Calculate recall@k
        correct_predictions = 0
        for i in range(num_queries):
            # Intersection of ground truth and HNSW results
            common_neighbors = len(set(ground_truth_labels[i]) & set(hnsw_labels[i]))
            correct_predictions += common_neighbors
        
        # Recall is the fraction of true neighbors found across all queries
        recall = correct_predictions / (num_queries * k)
        recalls.append(recall)
    
        print(f"  Recall@{k}: {recall:.4f}")
        print(f"  QPS: {qps:.2f}")
    
    # 5. --- Plotting the Results ---
    
    fig, ax1 = plt.subplots(figsize=(10, 6))
    
    # Plot Recall
    color = 'tab:blue'
    ax1.set_xlabel('ef_search')
    ax1.set_ylabel('Recall@10', color=color)
    ax1.plot(ef_search_values, recalls, marker='o', color=color, label='Recall@10')
    ax1.tick_params(axis='y', labelcolor=color)
    ax1.grid(True)
    
    # Create a second y-axis for QPS
    ax2 = ax1.twinx()
    color = 'tab:red'
    ax2.set_ylabel('Queries Per Second (QPS)', color=color)
    ax2.plot(ef_search_values, qps_values, marker='x', linestyle='--', color=color, label='QPS')
    ax2.tick_params(axis='y', labelcolor=color)
    
    fig.tight_layout()
    plt.title('HNSW Performance: Recall vs. QPS for M=16')
    plt.show()
    

    Running this script will produce a plot showing that as ef_search increases, recall climbs towards 1.0, while QPS (the inverse of latency) drops significantly. This curve is the key artifact for your system design review. It allows you to answer questions like: "To achieve 98% recall for our RAG pipeline, what latency can we expect?" or "If our latency budget is 50ms, what is the maximum recall we can guarantee?"


    Production Pattern 1: Taming Memory with Product Quantization (PQ)

    As your dataset grows from 100,000 to 100 million vectors, the memory footprint of your index becomes a dominant operational cost. A dataset of 100M vectors with 768 dimensions using 32-bit floats (float32) requires:

    100,000,000 (vectors) 768 (dims) 4 (bytes/float) ≈ 307.2 GB

    This is just for the raw vectors, before accounting for the HNSW graph overhead (M pointers per node). This memory cost can be prohibitive.

    Product Quantization (PQ) is a vector quantization technique that dramatically reduces this memory footprint by compressing the vectors themselves. It's often combined with HNSW for large-scale deployments.

    How PQ Works (Simplified):

  • Divide: Each high-dimensional vector is split into m sub-vectors. For a 768-dim vector, you might split it into m=96 sub-vectors of 8 dimensions each.
  • Cluster: For each sub-vector position, run a k-means clustering algorithm (typically with k=256 centroids) on all the sub-vectors from that position across the entire dataset.
  • Encode: This creates m separate "codebooks," each with 256 small (8-dim) centroid vectors. Now, to compress a full 768-dim vector, you replace each of its 96 sub-vectors with the ID of its closest centroid in the corresponding codebook. Since there are 256 centroids, each ID can be stored in just 8 bits (1 byte).
  • The Result: The original 768 4 = 3072 byte vector is now represented by 96 1 = 96 bytes. That's a 32x reduction in memory!

    The HNSW graph is then built on these quantized vectors. During search, distances are approximated using the codebooks, which is much faster than full-precision distance calculations.

    The Re-ranking Pattern: Best of Both Worlds

    The obvious downside of PQ is the loss of precision. The compressed vectors are approximations, so the recall of an HNSW+PQ index will be lower than one with full-precision vectors.

    We can mitigate this with a two-stage re-ranking pattern:

  • Stage 1 (Approximate Search): Query the HNSW+PQ index to retrieve a larger number of candidates, e.g., k=100.
  • Stage 2 (Exact Re-ranking): Fetch the full-precision vectors for these 100 candidates from a separate key-value store (like Redis, DynamoDB, or even a file on disk). Then, perform an exact, brute-force k-NN search on this small subset to find the true top k=10 neighbors.
  • This pattern leverages the speed and memory efficiency of the approximate index to narrow down the search space, then uses the accuracy of the original vectors for the final ranking, delivering high-quality results with a manageable performance profile.

    Implementation with FAISS:

    Facebook AI's faiss library provides a powerful implementation of this pattern.

    python
    import faiss
    import numpy as np
    import time
    
    # --- Data Configuration ---
    dim = 128
    num_elements = 500_000
    data = np.float32(np.random.random((num_elements, dim)))
    
    # --- PQ Index Configuration ---
    # Number of sub-vectors for PQ
    m = 8 
    # Number of bits per sub-vector ID (2^8 = 256 centroids)
    bits = 8
    
    # Sanity check: dimension must be divisible by m
    assert dim % m == 0
    
    # HNSW parameters
    M = 32
    
    # 1. --- Build the HNSW+PQ Index ---
    # The index string 'HNSW32,PQ8' defines an HNSW index with M=32
    # and a PQ quantizer with m=8.
    # faiss.METRIC_L2 is for Euclidean distance.
    quantizer = faiss.IndexFlatL2(dim)
    index = faiss.IndexHNSW_PQ(quantizer, dim, m, M, bits)
    
    print("Training PQ codebooks...")
    # Training requires a subset of the data
    index.train(data)
    
    print("Adding vectors to the index...")
    start_time = time.time()
    index.add(data)
    end_time = time.time()
    print(f"Index build took {end_time - start_time:.2f} seconds.")
    
    # Memory usage of the index can be estimated or checked via system tools
    
    # 2. --- Querying with Re-ranking (Conceptual) ---
    # In a real system, full vectors would be stored elsewhere
    full_precision_vectors = data # For this example
    
    query_vector = np.float32(np.random.random((1, dim)))
    
    # Stage 1: Approximate search
    num_candidates_to_fetch = 100
    
    # Set ef_search via the 'hnsw.efSearch' parameter
    faiss.ParameterSpace().set_index_parameter(index, 'efSearch', 128)
    
    start_time = time.time()
    # D_approx are approximate distances, I_approx are the indices (IDs)
    D_approx, I_approx = index.search(query_vector, num_candidates_to_fetch)
    end_time = time.time()
    print(f"Stage 1 (HNSW+PQ search) took {(end_time - start_time) * 1000:.4f} ms")
    
    # Stage 2: Re-ranking
    candidate_ids = I_approx[0]
    candidate_vectors = full_precision_vectors[candidate_ids]
    
    start_time = time.time()
    # Calculate exact L2 distances between the query and the candidates
    diff = candidate_vectors - query_vector
    exact_distances = np.linalg.norm(diff, axis=1)
    
    # Get the top 10 from the re-ranked candidates
    top_k = 10
    reranked_indices = np.argsort(exact_distances)[:top_k]
    
    final_ids = candidate_ids[reranked_indices]
    final_distances = exact_distances[reranked_indices]
    end_time = time.time()
    print(f"Stage 2 (Re-ranking {num_candidates_to_fetch} vectors) took {(end_time - start_time) * 1000:.4f} ms")
    
    print("\nFinal Top-10 IDs:", final_ids)

    Production Pattern 2: The Intricacy of Metadata Filtering

    A purely semantic search is rare in production. The business requirement is almost always: "Find products semantically similar to 'winter jacket' that are in stock, under $100, and have a rating above 4 stars."

    This is the problem of filtered ANN search, and it's notoriously difficult to implement efficiently.

    The Naive (and Flawed) Approaches:

  • Pre-filtering: Filter the dataset first, then run HNSW on the subset. This is a catastrophic performance killer. It requires a full scan of the metadata to create the subset, completely negating the sub-linear time complexity of the ANN index. It's only viable if the filtered subset is tiny and frequently reused.
  • Post-filtering: Perform the HNSW search to get the top k results, then filter them. This is fast but terrible for recall. If the 10 most semantically similar jackets are all out of stock, your search will return zero results, even if the 11th through 20th results were perfect matches for the filter.
  • The Advanced Solution: Filtering During Traversal

    The correct approach, implemented by modern vector databases (Weaviate, Pinecone, Milvus, etc.), is to integrate the filtering logic directly into the HNSW graph traversal.

    During the search, as the algorithm navigates from node to node, it maintains its priority queue of candidates. Before adding a potential neighbor to this queue, it performs a crucial check:

    Does this node's metadata match the user's filter?

    If it doesn't, the node is discarded and not explored further. This ensures that only vectors matching the filter criteria are ever considered as candidates.

    Performance Implications and Edge Cases:

    * Latency Impact: This is not free. The metadata lookup for every considered node adds overhead. The performance impact depends heavily on the selectivity of your filter. A very broad filter (e.g., price < 1000) will have minimal impact, while a highly selective filter (e.g., product_id = 'abc-123') will force the algorithm to traverse many more nodes to find k valid matches, increasing latency.

    * Index Build Strategy: To make this work, the system needs an efficient way to map from a vector's ID in the HNSW graph to its metadata. This often involves a secondary index (e.g., an inverted index on metadata fields) that the search process can query rapidly.

    * The "Needle in a Haystack" Problem: If a filter is so selective that very few items in the entire dataset match it, the HNSW traversal can be highly inefficient. The algorithm might spend a long time navigating through dense regions of the graph where no nodes satisfy the filter. In these extreme cases, falling back to a pre-filter strategy might be the only option.

    While implementing this from scratch is a massive undertaking, understanding the mechanic is crucial for using managed vector databases effectively. When you write a query with a where clause, you are invoking this complex traversal logic.

    json
    // Example query to a managed vector database (conceptual)
    {
      "query": "winter jacket",
      "top_k": 10,
      "filter": {
        "operator": "And",
        "operands": [
          {
            "path": ["in_stock"],
            "operator": "Equal",
            "valueBoolean": true
          },
          {
            "path": ["price"],
            "operator": "LessThan",
            "valueNumber": 100
          },
          {
            "path": ["rating"],
            "operator": "GreaterThan",
            "valueNumber": 4.0
          }
        ]
      }
    }

    When designing your system, you must account for the increased latency of filtered queries and expose this to the user. A search with no filters might have a p99 latency of 50ms, while a complex one could be 250ms. This is an expected and necessary trade-off.


    Edge Case: Handling Deletes and Updates

    The HNSW graph is fundamentally a static data structure. Once built, its intricate web of connections is optimized for fast traversal. Deleting a node is problematic because it can sever critical paths, partitioning the graph and degrading search quality for all subsequent queries. Directly updating a vector is equivalent to a delete followed by an insert.

    The Production Pattern: Soft Deletes and Periodic Re-indexing

    Since direct deletion is infeasible in a high-throughput environment, the standard pattern is to use soft deletes.

  • Marking for Deletion: Maintain a separate data structure, typically a bloom filter or a simple hash set of deleted vector IDs. When a delete request comes in, add the vector's ID to this set. The HNSW graph itself remains untouched.
  • Query-Time Filtering: During a search query, retrieve k + x candidates from the HNSW index, where x is a buffer (e.g., 50% of k). After retrieving the results, iterate through them and discard any whose IDs are present in the deletion set. Return the top k remaining results.
  • Consequences and Mitigation:

    * Index Bloat: The HNSW index will continue to grow in size, holding vectors that are no longer active.

    * Performance Degradation: As the percentage of deleted items increases, query performance suffers. You have to fetch more and more candidates (k+x) to find k valid ones. Recall can also drop as deleted nodes might still be part of optimal search paths for active nodes.

    This leads to the second part of the pattern: periodic re-indexing. On a schedule (e.g., nightly or weekly), build a completely new HNSW index from scratch using only the currently active vectors. Once the new index is built and validated, atomically swap it into production. This purges all the deleted data, compacts the index, and restores optimal performance.

    This strategy provides a robust and performant way to handle data mutability in a system built on a static index structure.

    Conclusion: HNSW is a Process, Not a Product

    Deploying HNSW at scale is a masterclass in engineering trade-offs. There is no single "best" configuration. The optimal solution is a function of your specific data distribution, hardware, and, most importantly, your product requirements.

    * For a recommendation engine where recall is king and latency can be higher, you'll favor a dense graph (high M) and deep search paths (high ef_search).

    * For a real-time search-as-you-type feature, you'll prioritize QPS, accepting slightly lower recall by using a smaller ef_search.

    * For massive datasets, you'll architect a re-ranking pipeline using PQ to manage memory costs.

    * For any real-world application, you'll need to understand the performance profile of filtered search and design your system accordingly.

    By moving beyond the default parameters and engaging deeply with these advanced patterns, you can transform HNSW from a powerful algorithm into a reliable, scalable, and cost-effective production system.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles