Vector DB Indexing: HNSW vs. IVF for Metadata Filtering

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Hidden Cost of `WHERE` Clauses in Vector Search

In modern AI applications, particularly Retrieval-Augmented Generation (RAG), vector search is never just about finding the nearest neighbors to a query vector. Production systems demand a combination of semantic similarity and structured metadata filtering. A user might ask, "What were my recent support tickets about?" which translates to a query that must be semantically similar to "recent support tickets" and filtered by user_id = 'u-123'.

The deceptive simplicity of a vector DB API like db.search(vector, filter={'user_id': 'u-123'}) masks a profound performance challenge. How does the database efficiently apply this filter without compromising the speed of its Approximate Nearest Neighbor (ANN) index? The answer determines whether your application scales or grinds to a halt.

The core problem lies in the conflict between the geometric structure of an ANN index (like HNSW or IVF) and the arbitrary nature of metadata filters. The index organizes vectors based on spatial proximity, not on arbitrary key-value pairs in their metadata. A naive implementation of filtering can lead to two disastrous scenarios:

  • Post-filtering: The ANN index retrieves the top k nearest neighbors, and then the metadata filter is applied. If none of the top k results match the filter, you return an empty set, even if relevant documents exist just beyond the initial k candidates. This severely damages recall and is often unacceptable.
  • Inefficient Pre-filtering: The index attempts to apply the filter during the search. In many index types, particularly graph-based ones like HNSW, this can force the traversal algorithm to visit a massive number of nodes that will ultimately be discarded, effectively negating the index's benefits and approaching a brute-force scan.
  • This article dissects the internal mechanics of two dominant indexing strategies—IVF (Inverted File) and HNSW (Hierarchical Navigable Small World)—specifically through the lens of high-cardinality metadata filtering. We will move beyond theoretical descriptions to implement and benchmark production-ready patterns that address this critical challenge head-on.

    Setting Up Our Testbed

    To demonstrate these concepts with concrete, runnable code, we'll use the faiss-cpu library from Meta AI, a foundational tool for vector search. We'll simulate a multi-tenant document system with 1,000,000 vectors, each with associated metadata for tenant_id and doc_type.

    python
    import numpy as np
    import faiss
    import time
    import pandas as pd
    
    # --- 1. Dataset Simulation ---
    def create_dataset(num_vectors=1_000_000, dim=128, num_tenants=1000):
        print(f"Generating {num_vectors} vectors of dimension {dim}...")
        # Generate random vectors
        vectors = np.float32(np.random.rand(num_vectors, dim))
        # Normalize vectors - crucial for cosine similarity
        faiss.normalize_L2(vectors)
        
        # Generate metadata
        tenant_ids = np.random.randint(0, num_tenants, size=num_vectors)
        doc_types = np.random.choice(['type_a', 'type_b', 'type_c'], size=num_vectors, p=[0.5, 0.3, 0.2])
        
        # Store metadata separately. In a real DB, this would be indexed.
        metadata = pd.DataFrame({
            'tenant_id': tenant_ids,
            'doc_type': doc_types
        })
        
        print("Dataset created.")
        return vectors, metadata
    
    # --- 2. Evaluation Helper ---
    def evaluate_search(index, query_vectors, ground_truth_indices, k, filter_mask=None):
        start_time = time.time()
        # In FAISS, filtering is done by passing an IDSelector
        # For this simulation, we'll do post-filtering unless a specific pattern is used
        distances, labels = index.search(query_vectors, k * 10 if filter_mask is not None else k) # Fetch more to filter from
        end_time = time.time()
    
        latency = (end_time - start_time) * 1000 # in ms
        
        # Apply the filter if one is provided (simulating post-filtering)
        if filter_mask is not None:
            final_labels = []
            for i in range(query_vectors.shape[0]):
                query_labels = labels[i]
                # -1 indicates an empty result from FAISS
                valid_indices = query_labels[query_labels != -1]
                # Apply the filter mask
                mask = filter_mask[valid_indices]
                filtered_indices = valid_indices[mask]
                final_labels.append(filtered_indices[:k])
        else:
            final_labels = [row[row != -1] for row in labels]
    
        # Calculate recall
        recall_at_k = 0
        for i in range(query_vectors.shape[0]):
            retrieved = set(final_labels[i])
            truth = set(ground_truth_indices[i])
            if len(truth) > 0:
              recall_at_k += len(retrieved.intersection(truth)) / len(truth)
        
        recall = recall_at_k / query_vectors.shape[0]
        
        return latency, recall
    
    # --- Main Setup ---
    DIM = 128
    NUM_VECTORS = 1_000_000
    NUM_TENANTS = 1000
    K = 10
    
    vectors, metadata = create_dataset(NUM_VECTORS, DIM, NUM_TENANTS)
    
    # Create a query set
    query_vectors = np.float32(np.random.rand(100, DIM))
    faiss.normalize_L2(query_vectors)
    
    # For recall calculation, we need ground truth
    # This is a brute-force search, slow but accurate
    print("Calculating ground truth for queries...")
    index_flat = faiss.IndexFlatL2(DIM)
    index_flat.add(vectors)
    _, ground_truth_indices = index_flat.search(query_vectors, K)
    print("Setup complete.")

    This setup gives us a realistic dataset and a way to measure latency and recall, the two most important metrics for evaluating our filtering strategies.

    Strategy 1: IVF and Partition-Aware Filtering

    The Inverted File (IVF) index works by partitioning the vector space into cells using a clustering algorithm like k-means. Each vector is assigned to its nearest cluster centroid. A search then involves two steps: first, identifying the nprobe closest cluster centroids to the query vector, and second, performing an exhaustive search only on the vectors within those nprobe cells.

    This partitioning is IVF's greatest strength and its greatest weakness for filtering.

    The Strength: If your filter criteria align with the partitions, you can achieve incredibly efficient pre-filtering. You simply don't search the partitions that don't match the filter.

    The Weakness: If your filter is orthogonal to the partitions (e.g., filtering by a random doc_type), the database must still visit all nprobe cells and then apply the filter to every vector within them—a costly operation.

    Advanced Pattern: Multi-Index IVF for Tenant Isolation

    A powerful production pattern for multi-tenant systems is to create a separate IVF index for each tenant. This physically isolates tenant data, making filtering by tenant_id a zero-cost operation—you just select the correct index to query. This is the ultimate form of partition-aware filtering.

    Let's implement this and compare it to a naive, single large index.

    python
    # --- Naive IVF with Post-Filtering ---
    print("\n--- Testing Naive IVF with Post-Filtering ---")
    
    nlist = 1024  # Number of clusters/partitions
    quantizer = faiss.IndexFlatL2(DIM)
    index_ivf = faiss.IndexIVFFlat(quantizer, DIM, nlist)
    
    print("Training Naive IVF index...")
    index_ivf.train(vectors)
    index_ivf.add(vectors)
    index_ivf.nprobe = 10 # Set how many partitions to search
    
    # Simulate a query for a specific tenant
    target_tenant_id = 42
    filter_mask = (metadata['tenant_id'] == target_tenant_id).values
    
    # Calculate ground truth for this specific filter
    filtered_vectors = vectors[filter_mask]
    index_flat_filtered = faiss.IndexFlatL2(DIM)
    index_flat_filtered.add(filtered_vectors)
    _, gt_filtered_indices_relative = index_flat_filtered.search(query_vectors, K)
    
    # Map relative indices back to original indices
    original_indices = np.where(filter_mask)[0]
    gt_filtered_indices_absolute = [original_indices[inds[inds != -1]] for inds in gt_filtered_indices_relative]
    
    naive_latency, naive_recall = evaluate_search(index_ivf, query_vectors, gt_filtered_indices_absolute, K, filter_mask)
    
    print(f"Naive IVF Latency: {naive_latency:.2f} ms")
    print(f"Naive IVF Recall: {naive_recall:.4f}")
    
    # --- Partition-Aware Multi-Index IVF ---
    print("\n--- Testing Partition-Aware Multi-Index IVF ---")
    
    tenant_indexes = {}
    build_start_time = time.time()
    for tenant_id in range(NUM_TENANTS):
        tenant_mask = (metadata['tenant_id'] == tenant_id)
        tenant_vectors = vectors[tenant_mask]
        
        if len(tenant_vectors) > 0:
            # Each tenant gets their own small IVF index
            # For very small tenants, a Flat index might be better, but we use IVF for consistency
            nlist_tenant = min(128, max(4, len(tenant_vectors) // 40))
            quantizer_tenant = faiss.IndexFlatL2(DIM)
            index_tenant = faiss.IndexIVFFlat(quantizer_tenant, DIM, nlist_tenant)
            index_tenant.train(tenant_vectors)
            index_tenant.add(tenant_vectors)
            index_tenant.nprobe = 5
            tenant_indexes[tenant_id] = index_tenant
    
    build_end_time = time.time()
    print(f"Multi-Index Build Time: {(build_end_time - build_start_time):.2f} s")
    
    # Search using the dedicated index for the target tenant
    search_start_time = time.time()
    if target_tenant_id in tenant_indexes:
        target_index = tenant_indexes[target_tenant_id]
        # No filtering needed at search time!
        distances, labels_relative = target_index.search(query_vectors, K)
    else:
        labels_relative = np.full((query_vectors.shape[0], K), -1)
    search_end_time = time.time()
    
    # Evaluation requires mapping back to original indices
    original_tenant_indices = np.where(metadata['tenant_id'] == target_tenant_id)[0]
    final_labels_absolute = []
    for row in labels_relative:
        valid_indices = row[row != -1]
        absolute_indices = original_tenant_indices[valid_indices]
        final_labels_absolute.append(absolute_indices)
    
    # Calculate recall against the filtered ground truth
    recall_at_k_multi = 0
    for i in range(query_vectors.shape[0]):
        retrieved = set(final_labels_absolute[i])
        truth = set(gt_filtered_indices_absolute[i])
        if len(truth) > 0:
            recall_at_k_multi += len(retrieved.intersection(truth)) / len(truth)
    recall_multi = recall_at_k_multi / query_vectors.shape[0]
    
    latency_multi = (search_end_time - search_start_time) * 1000
    
    print(f"Multi-Index IVF Latency: {latency_multi:.2f} ms")
    print(f"Multi-Index IVF Recall: {recall_multi:.4f}")

    Analysis of Results:

    * Naive IVF Latency: Will be relatively high. The search probes 10 large partitions and then filters the results. If a tenant's data is spread across many partitions, this is inefficient.

    * Multi-Index IVF Latency: Will be significantly lower. We are searching a much smaller, dedicated index. The tenant_id filter is applied at O(1) time by simply choosing the right index.

    Recall: The Multi-Index approach should have higher recall. The naive post-filtering approach might fetch k* results that are all for the wrong tenant, resulting in zero recall for that query.

    * Trade-offs: The Multi-Index pattern has a higher operational complexity. You must manage thousands of indices, and the total memory footprint might be larger due to the overhead of each index. Indexing new data requires routing it to the correct tenant index. This pattern is ideal for SaaS applications where tenant_id is the primary filter in over 95% of queries.

    Strategy 2: HNSW and the Two-Step Hybrid Search

    HNSW is a graph-based index. It builds a multi-layered graph where nodes are vectors and edges connect spatially close vectors. A search starts at a random entry point in the top layer and greedily traverses the graph, getting closer to the query vector until it finds a local minimum. It then drops to a lower, denser layer and repeats the process.

    This structure provides excellent speed and recall for pure vector search, but it's fundamentally incompatible with arbitrary metadata filters. Applying a filter during graph traversal is an open research problem. If a node is filtered out, should the traversal stop? Should it backtrack? Any choice risks destroying the logarithmic time complexity of the search.

    Therefore, most systems that use HNSW with filters resort to post-filtering, which, as we've discussed, is often unacceptable due to poor recall.

    Advanced Pattern: The Two-Step Hybrid Search

    This production pattern decouples metadata filtering from vector search. It acknowledges that traditional databases are exceptionally good at filtering and retrieval based on structured attributes.

  • Step 1 (Metadata Filtering): Query a traditional database (e.g., PostgreSQL with a GIN index on metadata, or Elasticsearch) with the filter criteria (e.g., WHERE tenant_id = 42 AND doc_type = 'type_a'). This step returns a list of candidate vector IDs that match the filter. This list might contain thousands of IDs.
  • Step 2 (Targeted Vector Search): Perform a vector search only on the subset of vectors identified in Step 1. Since HNSW in FAISS doesn't directly support searching on a list of IDs, we can simulate this by creating a temporary index or, more efficiently, by using an IDSelector.
  • Let's implement this pattern.

    python
    import faiss.contrib.callbacks as faiss_callbacks
    
    # --- HNSW with Naive Post-Filtering ---
    print("\n--- Testing HNSW with Naive Post-Filtering ---")
    M = 32 # Number of connections per node
    index_hnsw = faiss.IndexHNSWFlat(DIM, M)
    
    print("Building HNSW index...")
    index_hnsw.add(vectors)
    
    # Again, query for tenant 42
    filter_mask_hnsw = (metadata['tenant_id'] == target_tenant_id).values
    
    hnsw_naive_latency, hnsw_naive_recall = evaluate_search(index_hnsw, query_vectors, gt_filtered_indices_absolute, K, filter_mask_hnsw)
    
    print(f"Naive HNSW Latency: {hnsw_naive_latency:.2f} ms")
    print(f"Naive HNSW Recall: {hnsw_naive_recall:.4f}") # Expect this to be low!
    
    # --- HNSW with Two-Step Hybrid Search ---
    print("\n--- Testing HNSW with Two-Step Hybrid Search ---")
    
    # Step 1: Metadata Filtering (simulated)
    step1_start = time.time()
    candidate_ids = np.where(metadata['tenant_id'] == target_tenant_id)[0]
    step1_end = time.time()
    print(f"Step 1 (Metadata Filter) took: {(step1_end - step1_start)*1000:.2f} ms and found {len(candidate_ids)} candidates.")
    
    # Step 2: Targeted Vector Search using an IDSelector
    # An IDSelector tells FAISS to only consider a specific set of IDs during search.
    # This is the most efficient way to implement pre-filtering in FAISS.
    if len(candidate_ids) > 0:
        id_selector = faiss.IDSelectorArray(candidate_ids)
        
        # The search parameters object allows passing the selector
        params = faiss.SearchParametersHNSW()
        params.sel = id_selector
        
        step2_start = time.time()
        # Note: we can now ask for just K, as filtering is done during the search
        distances, labels = index_hnsw.search(query_vectors, K, params=params)
        step2_end = time.time()
        
        hybrid_latency = ((step1_end - step1_start) + (step2_end - step2_start)) * 1000
    
        # Evaluate recall
        final_labels_hybrid = [row[row != -1] for row in labels]
        recall_at_k_hybrid = 0
        for i in range(query_vectors.shape[0]):
            retrieved = set(final_labels_hybrid[i])
            truth = set(gt_filtered_indices_absolute[i])
            if len(truth) > 0:
                recall_at_k_hybrid += len(retrieved.intersection(truth)) / len(truth)
        hybrid_recall = recall_at_k_hybrid / query_vectors.shape[0]
    else:
        hybrid_latency = ((step1_end - step1_start)) * 1000
        hybrid_recall = 1.0 # Or undefined, as there are no results to find
    
    print(f"Two-Step Hybrid HNSW Latency: {hybrid_latency:.2f} ms")
    print(f"Two-Step Hybrid HNSW Recall: {hybrid_recall:.4f}")

    Analysis of Results:

    Naive HNSW Recall: This will likely be very poor. We ask for k10=100 neighbors. If the 100 closest vectors in the entire 1M dataset don't happen to belong to tenant_id=42, our recall will be zero, even if that tenant has thousands of documents.

    * Two-Step Hybrid Latency: The total latency is the sum of the metadata query and the targeted vector search. While the vector search part might be slightly slower than an unfiltered HNSW search (because the IDSelector adds checks), the overall approach is far more robust.

    * Two-Step Hybrid Recall: This will be near-perfect (close to 1.0). Because we are searching within the complete, correctly filtered set of candidate vectors, we are guaranteed to find the true nearest neighbors that satisfy the filter.

    * Edge Case: This pattern's performance depends heavily on the selectivity of the filter. If a filter like doc_type='type_a' returns 50% of the dataset (500,000 IDs), the IDSelector becomes less efficient, and the performance may degrade. This pattern shines when filters are highly selective (e.g., returning <1-5% of the total dataset).

    Strategy 3: The Composite Index (IVFADC+HNSW)

    What if you have a primary, low-cardinality filter (like tenant_id) but also need the high recall of HNSW for semantic search within that partition? This is where composite indexes come in. Some vector databases allow you to combine IVF and HNSW. The idea is to use IVF as a coarse-grained partitioning mechanism and then build an HNSW graph inside each IVF cell.

    A query first selects the nprobe partitions (which can be pre-filtered if the metadata aligns) and then performs an HNSW search only within the graphs of those selected partitions.

    While vanilla FAISS doesn't have a direct IndexIVF_HNSW out of the box, the principle is similar to our Multi-Index IVF pattern, but where each tenant_index is an IndexHNSWFlat instead of an IndexIVFFlat. This gives the best of both worlds for tenant-based searches: perfect isolation via partitioning and high-recall graph search within the partition.

    python
    # --- Composite Index (IVF partitions containing HNSW graphs) ---
    # We simulate this using our multi-index pattern, but with HNSW inside.
    print("\n--- Testing Composite 'Multi-Index HNSW' ---")
    
    tenant_hnsw_indexes = {}
    comp_build_start = time.time()
    for tenant_id in range(NUM_TENANTS):
        tenant_mask = (metadata['tenant_id'] == tenant_id)
        tenant_vectors = vectors[tenant_mask]
        
        if len(tenant_vectors) > 1: # HNSW needs some data
            # Each tenant gets their own HNSW index
            M_tenant = min(16, len(tenant_vectors) // 2) # Adjust M for smaller datasets
            if M_tenant < 4: M_tenant = 4
    
            index_tenant_hnsw = faiss.IndexHNSWFlat(DIM, M_tenant)
            index_tenant_hnsw.add(tenant_vectors)
            tenant_hnsw_indexes[tenant_id] = index_tenant_hnsw
    
    comp_build_end = time.time()
    print(f"Composite Index Build Time: {(comp_build_end - comp_build_start):.2f} s")
    
    # Search using the dedicated HNSW index for the target tenant
    search_start_time_comp = time.time()
    if target_tenant_id in tenant_hnsw_indexes:
        target_index_comp = tenant_hnsw_indexes[target_tenant_id]
        distances_comp, labels_relative_comp = target_index_comp.search(query_vectors, K)
    else:
        labels_relative_comp = np.full((query_vectors.shape[0], K), -1)
    search_end_time_comp = time.time()
    
    # Evaluation
    original_tenant_indices_comp = np.where(metadata['tenant_id'] == target_tenant_id)[0]
    final_labels_absolute_comp = []
    for row in labels_relative_comp:
        valid_indices = row[row != -1]
        absolute_indices = original_tenant_indices_comp[valid_indices]
        final_labels_absolute_comp.append(absolute_indices)
    
    recall_at_k_comp = 0
    for i in range(query_vectors.shape[0]):
        retrieved = set(final_labels_absolute_comp[i])
        truth = set(gt_filtered_indices_absolute[i])
        if len(truth) > 0:
            recall_at_k_comp += len(retrieved.intersection(truth)) / len(truth)
    recall_comp = recall_at_k_comp / query_vectors.shape[0]
    
    latency_comp = (search_end_time_comp - search_start_time_comp) * 1000
    
    print(f"Composite (Multi-Index HNSW) Latency: {latency_comp:.2f} ms")
    print(f"Composite (Multi-Index HNSW) Recall: {recall_comp:.4f}")

    This composite pattern delivers the low latency of partition selection and the high recall of HNSW, making it a formidable choice for multi-tenant architectures.

    Production Heuristics and Decision Framework

    There is no universally "best" indexing strategy. The optimal choice is a function of your data distribution, query patterns, and infrastructure constraints. Here is a decision framework for senior engineers:

    ScenarioPrimary Filter CharacteristicRecommended StrategyKey Rationale
    Multi-Tenant SaaStenant_id (low cardinality, always present)Multi-Index IVF/HNSWProvides perfect data isolation and pre-filtering. The choice of IVF vs. HNSW inside depends on per-tenant recall needs.
    E-commerce / Social MediaArbitrary, high-cardinality combinations (tags, user IDs, categories)Two-Step Hybrid Search (HNSW)Decouples filtering to a scalable system (Postgres/ES). HNSW ensures high recall on the filtered candidate set.
    Internal Knowledge BaseFew or no filters; pure semantic search is keySingle Large HNSW IndexWhen filtering is not a concern, HNSW provides the best raw speed-recall trade-off.
    Logging/Telemetry AnalysisPrimary filter on time-series data or data sourcePartition-Aware IVFData can be naturally partitioned by date or source. nprobe can be tuned for desired recall vs. speed.
    Highly Constrained Memory/CPU (Edge Devices)AnyIVF with Product Quantization (IVF_PQ)IVF_PQ offers the best memory compression. Filtering remains a challenge and often requires the Two-Step pattern.

    Final Performance Comparison (Illustrative)

    Running the code above will produce results that generally follow this pattern:

    StrategyAvg. Latency (ms)Recall@10Complexity
    Naive IVF (Post-Filter)50-100~0.75Low implementation, poor for sparse filters
    Naive HNSW (Post-Filter)20-40~0.10Unacceptable recall, simple to implement
    Multi-Index IVF5-10~0.95High operational overhead, best for tenant_id
    Two-Step Hybrid HNSW15-30~0.99Requires external DB, robust for all filters
    Composite (Multi-Index HNSW)8-15~0.99High overhead, best-in-class for tenants

    Actual numbers will vary based on hardware, data distribution, and parameter tuning (nprobe, M, efSearch).

    Conclusion

    Metadata filtering is not an afterthought in vector search; it is a core architectural consideration. Relying on the naive filtering capabilities of a vector database can lead to severe performance and recall issues in production.

    For systems with a dominant, low-cardinality partitioning key like tenant_id, physically separating data into multiple indices (the Multi-Index IVF/HNSW pattern) offers unparalleled performance. For systems with complex, arbitrary, and high-cardinality filters, the Two-Step Hybrid Search pattern—leveraging a traditional database for filtering before a targeted HNSW search—provides the necessary recall and flexibility.

    By understanding the fundamental conflict between an ANN index's geometric structure and metadata's logical structure, you can design and implement robust, scalable, and accurate RAG systems that meet the demands of real-world applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles