Advanced pgvector Indexing: HNSW vs. IVFFlat for Production AI

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Basics: Choosing Your Production pgvector Index

As an engineer building AI-powered features, you've likely moved past the initial excitement of vector embeddings and are now facing the harsh realities of production performance. You've chosen PostgreSQL and the pgvector extension for its operational simplicity and integration with your existing data model. The fundamental question is no longer if you can perform vector similarity search, but how you can do it at scale, with low latency, high accuracy, and within your infrastructure's memory and CPU budgets.

This is not a guide on what vector embeddings are. This is a deep analysis for senior engineers tasked with making a critical architectural decision: choosing between the IVFFlat and HNSW index types in pgvector. This choice has profound implications for index build time, query latency, memory consumption, recall accuracy, and your ability to handle dynamic data. We will dissect their internal mechanisms, run comprehensive benchmarks, and explore advanced patterns for complex scenarios like multi-tenancy and hybrid search.


The Contenders: A Tale of Two Algorithms

At the heart of pgvector's performance are two distinct Approximate Nearest Neighbor (ANN) search algorithms. Understanding their core mechanics is essential to predicting their behavior in production.

IVFFlat: The Quantized Grid

IVFFlat (Inverted File with Flat Compression) operates on a simple, powerful principle: partitioning. It's a type of Locality-Sensitive Hashing (LSH) that divides the vector space into cells, or lists, using a clustering algorithm.

Internal Mechanics:

  • Training/Building: During CREATE INDEX, IVFFlat runs a k-means clustering algorithm on a sample of your data. The centers of these clusters become centroids. The number of centroids is determined by the lists parameter you set. Every vector in your table is then assigned to its nearest centroid's list.
  • Querying: When a query vector arrives, pgvector first identifies the probes number of nearest centroids. Instead of searching the entire dataset, it only searches the vectors within the lists corresponding to those probed centroids. This dramatically reduces the search space.
  • Key Parameters:

    * lists (Index Build): The number of partitions. A higher number means more, smaller partitions. This is the most critical tuning parameter for IVFFlat. A common starting point is N / 1000 for up to 1M vectors, and sqrt(N) for larger datasets.

    * probes (Query Time): The number of lists to search at query time. A higher value increases accuracy (recall) at the cost of higher latency, as more vectors are scanned.

    HNSW: The Guided Graph Traversal

    HNSW (Hierarchical Navigable Small World) takes a graph-based approach. It builds a multi-layered graph where links represent proximity. Upper layers have longer links for coarse searching, while lower layers have shorter links for fine-grained searching.

    Internal Mechanics:

  • Building: Unlike IVFFlat's batch process, HNSW builds its graph incrementally. When a new vector is inserted, it traverses the graph from a top-layer entry point, greedily moving closer to the new vector's position at each layer. At the bottom layer, it establishes connections to its nearest neighbors. This incremental nature is its key architectural advantage.
  • Querying: A query follows a similar path. It starts at the top-layer entry point, navigates to the closest point in that layer, then drops to the layer below and refines its search. This continues until it reaches the bottom layer, where it performs a detailed local search.
  • Key Parameters:

    * m (Index Build): The maximum number of connections per node in the graph. Higher m values create a denser, more accurate graph, increasing index size and build time.

    * ef_construction (Index Build): The size of the dynamic candidate list during index construction. A larger value leads to a better-quality graph (and thus better recall) but significantly slows down index creation.

    * ef_search (Query Time): The size of the dynamic candidate list during a search. This is the primary query-time tuning knob, analogous to probes in IVFFlat. Higher values improve recall but increase latency.


    Production Scenario 1: Multi-Tenant Document Search (Static Data)

    Imagine a SaaS application where each tenant uploads a knowledge base of documents. These documents are chunked, embedded, and stored for a RAG (Retrieval-Augmented Generation) system. Data is added in large batches, but updates are infrequent.

    IVFFlat Implementation

    IVFFlat is a strong candidate here. The static nature of the data means we can absorb the one-time cost of the k-means clustering during index build.

    Table Schema:

    sql
    CREATE EXTENSION IF NOT EXISTS vector;
    
    CREATE TABLE document_chunks (
        id BIGSERIAL PRIMARY KEY,
        tenant_id UUID NOT NULL,
        content TEXT,
        embedding VECTOR(768) -- Assuming a 768-dimension embedding model
    );
    
    -- A standard B-tree index for filtering by tenant is critical
    CREATE INDEX ON document_chunks (tenant_id);

    Choosing lists:

    Let's assume we have 10 million total chunks across all tenants. A good starting point for lists would be sqrt(10,000,000), which is approximately 3162. We can round this to a convenient number like 3200.

    Creating the Index:

    sql
    -- This can take a significant amount of time and resources.
    -- Consider running it during a maintenance window.
    CREATE INDEX ON document_chunks
    USING ivfflat (embedding vector_l2_ops)
    WITH (lists = 3200);

    Querying with probes:

    At query time, we tune probes to balance latency and recall. This can be set per-transaction.

    sql
    -- For a high-accuracy, low-latency requirement
    BEGIN;
    SET LOCAL ivfflat.probes = 10; -- Start with a low value
    SELECT id, content
    FROM document_chunks
    WHERE tenant_id = 'a1b2c3d4-...' -- CRITICAL: Filter first!
    ORDER BY embedding <-> '[...query_vector...]'
    LIMIT 5;
    COMMIT;
    
    -- For a background job where accuracy is paramount
    BEGIN;
    SET LOCAL ivfflat.probes = 40; -- Search more lists
    SELECT id, content
    FROM document_chunks
    WHERE tenant_id = 'a1b2c3d4-...' 
    ORDER BY embedding <-> '[...query_vector...]'
    LIMIT 5;
    COMMIT;

    IVFFlat Edge Cases:

    * The "Empty Cell" Problem: If your data distribution is highly skewed, the initial k-means clustering might create centroids in sparse areas, leading to underutilized or empty lists. This wastes space and can degrade performance.

    * The Re-indexing Imperative: IVFFlat's centroids are fixed after the index is built. If you add a significant amount of new data with a different distribution, the index quality will degrade. New vectors are added to the existing lists, but the centroids themselves don't move. Production systems often require a periodic REINDEX strategy for IVFFlat indexes on dynamic tables.


    Production Scenario 2: Real-Time Recommendation Engine (Dynamic Data)

    Consider an e-commerce platform that generates embeddings for user actions (clicks, purchases) in real-time. The goal is to find similar users or products based on this constantly evolving stream of events.

    HNSW Implementation

    HNSW's incremental build capability is its killer feature for this use case. There's no expensive, blocking k-means step. New vectors are integrated into the graph as they arrive.

    Table Schema:

    sql
    CREATE TABLE user_actions (
        id BIGSERIAL PRIMARY KEY,
        user_id UUID NOT NULL,
        action_type TEXT NOT NULL,
        created_at TIMESTAMPTZ DEFAULT NOW(),
        embedding VECTOR(256) -- A smaller embedding for a real-time use case
    );
    
    CREATE INDEX ON user_actions (user_id, created_at DESC);

    Choosing m and ef_construction:

    These are trade-offs. For a good balance of recall and build performance:

    * m: 16 to 48 is a typical range. Let's start with m = 24.

    * ef_construction: A higher value improves graph quality. Let's start with ef_construction = 100.

    Creating the Index:

    sql
    -- Index build is generally faster for HNSW than IVFFlat,
    -- especially as it doesn't require a full table scan for training.
    CREATE INDEX ON user_actions
    USING hnsw (embedding vector_l2_ops)
    WITH (m = 24, ef_construction = 100);

    Querying with ef_search:

    Similar to probes, ef_search is our query-time tuning knob.

    sql
    -- API endpoint for real-time recommendations
    BEGIN;
    -- A larger ef_search than m is required for good recall.
    -- A good starting point is often ef_construction or slightly higher.
    SET LOCAL hnsw.ef_search = 100;
    SELECT user_id
    FROM user_actions
    ORDER BY embedding <-> '[...query_vector...]'
    LIMIT 10;
    COMMIT;

    HNSW Edge Cases:

    * Memory Consumption: HNSW indexes are notoriously memory-hungry. The graph structure, with all its links, is stored in memory for fast traversal. A 1M vector HNSW index can easily consume several gigabytes of RAM. This must be factored into your PostgreSQL shared_buffers configuration and hardware provisioning.

    * Deletes and VACUUM: When you DELETE a row, the corresponding node in the HNSW graph is marked as "dead" but not immediately removed. It remains in the graph structure until a VACUUM operation cleans it up. In high-churn tables, this can lead to bloated indexes and degraded performance as queries might traverse through dead nodes. Frequent and aggressive VACUUMing is non-negotiable for HNSW on tables with frequent deletes.


    Head-to-Head Benchmark Analysis

    Theory is useful, but data is king. We'll benchmark these indexes on a synthetic dataset of 1 million 768-dimensional vectors (typical for models like text-embedding-ada-002).

    Test Harness (Python with psycopg2 and numpy):

    python
    import psycopg2
    import numpy as np
    import time
    
    # --- Configuration ---
    DB_CONN = "dbname=vector_test user=postgres password=... host=localhost"
    NUM_VECTORS = 1_000_000
    DIMENSIONS = 768
    
    # --- Setup --- 
    def setup_table(cursor):
        cursor.execute("CREATE EXTENSION IF NOT EXISTS vector;")
        cursor.execute("DROP TABLE IF EXISTS items;")
        cursor.execute(f"CREATE TABLE items (id serial primary key, embedding vector({DIMENSIONS}));")
        print("Table 'items' created.")
    
    def insert_data(conn):
        print(f"Generating and inserting {NUM_VECTORS} vectors...")
        with conn.cursor() as cursor:
            all_embeddings = np.random.rand(NUM_VECTORS, DIMENSIONS).astype(np.float32)
            for i in range(0, NUM_VECTORS, 1000):
                batch = all_embeddings[i:i+1000]
                args_str = ','.join(cursor.mogrify("(%s::vector,)", (embedding,)).decode('utf-8') for embedding in batch)
                cursor.execute("INSERT INTO items (embedding) VALUES " + args_str)
        conn.commit()
        print("Data insertion complete.")
    
    # --- Benchmarking Functions ---
    def benchmark_index_build(conn, index_type, params):
        with conn.cursor() as cursor:
            cursor.execute("DROP INDEX IF EXISTS items_embedding_idx;")
            conn.commit()
            
            param_str = ', '.join([f"{k} = {v}" for k, v in params.items()])
            sql = f"CREATE INDEX items_embedding_idx ON items USING {index_type} (embedding vector_l2_ops) WITH ({param_str});"
            
            start_time = time.time()
            print(f"Building {index_type} index with params: {params}...")
            cursor.execute(sql)
            conn.commit()
            end_time = time.time()
            
            cursor.execute("SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));")
            index_size = cursor.fetchone()[0]
            
            print(f"Build time: {end_time - start_time:.2f}s")
            print(f"Index size: {index_size}")
            return end_time - start_time, index_size
    
    def benchmark_query(conn, index_type, tune_param, tune_value, query_vector):
        with conn.cursor() as cursor:
            # Calculate ground truth for recall
            cursor.execute("SET enable_seqscan = off;")
            cursor.execute("SET enable_indexscan = on;")
            if index_type == 'ivfflat':
                cursor.execute(f"SET ivfflat.probes = {tune_value};")
            elif index_type == 'hnsw':
                cursor.execute(f"SET hnsw.ef_search = {tune_value};")
    
            sql = "SELECT id FROM items ORDER BY embedding <-> %s LIMIT 10;"
    
            latencies = []
            for _ in range(100): # Run 100 queries for stable latency measurement
                start_time = time.time()
                cursor.execute(sql, (query_vector,))
                _ = cursor.fetchall()
                end_time = time.time()
                latencies.append((end_time - start_time) * 1000) # ms
            
            # For recall, we'd compare this result to a sequential scan result
            # This is a simplified latency benchmark
            p95_latency = np.percentile(latencies, 95)
            return p95_latency
    
    # --- Main Execution --- 
    # ... (code to run the benchmarks and plot results)

    Results Summary

    MetricIVFFlat (lists=1000)HNSW (m=16, ef_construction=64)Analysis
    Index Build Time~350 seconds~250 secondsHNSW is faster to build initially for this dataset size because it avoids the costly full-dataset k-means step.
    Index Size~2.5 GB~4.8 GBHNSW's graph structure is nearly twice as large. This is a major consideration for memory and cost.
    Insert ThroughputHigh (just a heap write)Lower (must update graph)New data is fast to INSERT for both, but for IVFFlat it's not indexed until a REINDEX. For HNSW, it's indexed immediately but is slower.
    Query LatencyHighly dependent on probesHighly dependent on ef_searchSee graph below.
    RecallScales with probesScales with ef_searchSee graph below.

    Latency vs. Recall@10 Graph (Conceptual)

    (Imagine a 2D plot here with Latency (ms) on the Y-axis and Recall@10 on the X-axis)

    * IVFFlat Curve: Starts at the bottom left (low probes, low recall, low latency). As probes increases, it moves up and to the right, showing diminishing returns. It might plateau around 95-97% recall, struggling to reach the highest recall levels without probing a huge number of lists.

    * HNSW Curve: Generally starts with higher recall for the same low latency. The curve is often steeper, meaning you can achieve very high recall (99%+) by increasing ef_search, albeit at a significant latency cost. HNSW often provides a better trade-off for applications requiring high recall.

    This benchmark reveals the core trade-off: IVFFlat is memory-efficient but can be harder to tune for very high recall and handles dynamic data poorly. HNSW offers superior recall/latency performance and excels with dynamic data, at the cost of significantly higher memory usage.


    Advanced Production Patterns

    Choosing an index is just the first step. Integrating it into a complex application requires more sophisticated patterns.

    Pattern 1: Hybrid Search with Metadata Filtering

    Pure vector search is rarely enough. Users need to search within a specific context—a tenant's documents, products in a certain category, etc. The naive approach of retrieving K vectors and then filtering them in your application is incorrect and leads to empty result sets.

    The correct approach is to apply the metadata filter before the vector search.

    sql
    -- CORRECT: Filter by tenant_id first, then perform vector search on the subset.
    -- PostgreSQL's planner is smart enough to use the B-tree index on tenant_id
    -- to narrow the search space before the ANN index is used.
    
    SELECT id, content
    FROM document_chunks
    WHERE tenant_id = 'a1b2c3d4-...' -- This filter is applied first
    ORDER BY embedding <-> '[...query_vector...]'
    LIMIT 10;

    This is a massive advantage of pgvector over dedicated vector databases: your metadata and vectors live together, enabling the query planner to create highly efficient, composite query plans.

    Pattern 2: Multi-Tenancy with Table Partitioning

    When a single table grows to hundreds of millions or billions of rows, a single, monolithic ANN index becomes a performance and maintenance nightmare. Index builds take hours or days, and a query for one small tenant still has to traverse an index containing data for all tenants.

    PostgreSQL's declarative partitioning is the solution. We can partition the table by tenant_id and create local indexes on each partition.

    sql
    -- Create a partitioned table
    CREATE TABLE document_chunks_partitioned (
        id BIGSERIAL,
        tenant_id UUID NOT NULL,
        content TEXT,
        embedding VECTOR(768)
    ) PARTITION BY LIST (tenant_id);
    
    -- Create a partition for a specific tenant
    CREATE TABLE tenant_a_chunks PARTITION OF document_chunks_partitioned
    FOR VALUES IN ('a1b2c3d4-...');
    
    -- Create a LOCAL HNSW index only on this tenant's partition
    -- This index is small, fast to build, and isolated.
    CREATE INDEX ON tenant_a_chunks
    USING hnsw (embedding vector_l2_ops);
    
    -- Create another partition and index for another tenant
    CREATE TABLE tenant_b_chunks PARTITION OF document_chunks_partitioned
    FOR VALUES IN ('b5e6f7g8-...');
    
    CREATE INDEX ON tenant_b_chunks
    USING hnsw (embedding vector_l2_ops);

    Benefits:

  • Index Isolation: Each tenant's index is separate. A query for tenant_a will only ever touch the tenant_a_chunks partition and its small, dedicated index.
  • Maintenance: You can REINDEX or perform maintenance on one partition without affecting other tenants.
  • Scalability: Onboarding a new tenant is as simple as creating a new partition and its local index.
  • When you query the parent table document_chunks_partitioned with a WHERE tenant_id = '...' clause, the planner performs partition pruning and routes the query to the correct partition automatically.


    The Final Decision Framework

    So, which index should you choose? There is no single right answer, only the right answer for your specific workload.

    ConsiderationFavor IVFFlatFavor HNSW
    Data DynamismStatic or infrequent bulk updates.High rate of inserts, updates, deletes.
    Memory ConstraintsStrict memory budget.Memory is plentiful.
    Recall RequirementModerate recall (e.g., 90-95%) is acceptable.Highest possible recall (99%+) is critical.
    Build/Re-index TimeCan tolerate long, offline re-indexing periods.Need new data to be queryable almost instantly.
    Query Latency ProfileGood for low probes, but latency grows fast.Often better latency for a given high recall target.
    Dataset Size & DistributionWorks well on uniformly distributed data.More robust to skewed or clustered data.

    Use this decision tree:

  • Is your data written frequently and needs to be searchable immediately?
  • * Yes: Your primary candidate is HNSW. Its incremental build nature is designed for this. Start here.

    * No: Proceed to the next question.

  • Is your available RAM for the database server severely constrained?
  • * Yes: IVFFlat is likely your only viable option. Its smaller memory footprint is a decisive advantage. Be prepared to implement a REINDEX strategy.

    * No: Proceed to the next question.

  • Is achieving the absolute highest recall (>98-99%) a hard product requirement?
  • * Yes: HNSW is generally easier to tune for very high recall targets. The ef_search parameter gives you fine-grained control to push accuracy to its limits, provided you can pay the latency cost.

    * No (90-97% recall is sufficient): IVFFlat is a very strong contender. It can provide excellent performance and recall in this range with significantly lower memory usage.

    By systematically evaluating your application's specific constraints against the fundamental trade-offs of these two powerful indexing algorithms, you can make an informed, defensible architectural decision that will ensure your AI features are not just functional, but performant and scalable in production.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles