pgvector Deep Dive: HNSW vs. IVFFlat for Production RAG Systems

19 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Senior Engineer's Dilemma: Choosing a pgvector Index

As organizations increasingly integrate Retrieval-Augmented Generation (RAG) into their production systems, the underlying vector database becomes a critical performance bottleneck. While dedicated vector databases offer specialized solutions, the operational simplicity and data consistency of keeping vectors within PostgreSQL via the pgvector extension are compelling. This decision, however, shifts the architectural burden from infrastructure management to deep database optimization.

You've already made the call to use pgvector. You understand vectors, embeddings, and the basics of a RAG pipeline. The question that keeps you up at night is no longer if you should use pgvector, but how to make it scale. The single most impactful decision you will make is your choice of index: hnsw or ivfflat.

This is not a simple choice of "faster" vs. "more accurate." It's a complex trade-off involving index build time, memory footprint, ingestion throughput, query latency, recall, and the nature of your data access patterns—particularly metadata filtering. This post provides a rigorous, production-focused analysis to guide this decision, complete with benchmarks and advanced query patterns that separate a proof-of-concept RAG from a scalable, enterprise-grade system.

Architectural Context: Why This Choice Matters

A typical RAG pipeline looks something like this:

  • Ingestion: Documents are chunked, passed through an embedding model (e.g., BAAI/bge-large-en-v1.5), and the resulting vectors are stored alongside their source text and metadata in a PostgreSQL table.
  • Retrieval: A user query is embedded using the same model. A similarity search query is executed against the PostgreSQL table to find the top-K most relevant chunks.
  • Augmentation & Generation: The retrieved chunks are formatted as context and prepended to the user's query in a prompt for a Large Language Model (LLM), which then generates a response.
  • The retrieval step is our focus. A sequential scan over millions of 768-dimensional vectors is a non-starter. An index is mandatory. pgvector offers two primary Approximate Nearest Neighbor (ANN) index types:

    * IVFFlat (Inverted File with Flat Compression): A clustering-based index. It partitions the vector space into lists (clusters), and at query time, searches only a subset of these clusters (probes) nearest to the query vector.

    * HNSW (Hierarchical Navigable Small World): A graph-based index. It builds a multi-layered graph where long-range connections on upper layers provide fast entry points, and shorter-range connections on lower layers allow for fine-grained searching.

    Let's dissect them from an implementation perspective.

    Section 1: Deep Dive into IVFFlat Indexing

    IVFFlat's core concept is space partitioning. Imagine a 2D map of all your vectors. IVFFlat runs a k-means clustering algorithm to find k (or lists) centroids. Each vector is then assigned to its nearest centroid. The index stores these assignments in an inverted file structure, mapping each centroid to a list of vectors it contains.

    When a query vector arrives, the system first identifies the n (probes) nearest centroids and then performs an exhaustive search only on the vectors within those lists. This drastically reduces the search space.

    Key Parameters and Their Impact

    * lists (Build-time): The number of clusters to partition the data into. This is the most critical parameter.

    * Low lists: Fewer, larger clusters. Faster index build, but queries may be slower as each probe searches a larger list.

    * High lists: More, smaller clusters. Slower, more memory-intensive index build, but queries can be faster if the right cluster is probed.

    Guideline: A common starting point is N / 1000 for up to 1M rows and sqrt(N) for larger datasets, where N is the number of rows. You must* benchmark this for your specific data distribution.

    * probes (Query-time): The number of clusters to search at query time.

    * Low probes (e.g., 1): Fastest query, lowest recall. You're only looking in the single best-guess cluster.

    * High probes: Slower query, higher recall. You expand the search to neighboring clusters, increasing the chance of finding the true nearest neighbors.

    Production Implementation & Analysis

    Let's set up a test environment. We'll use a table for storing document chunks from a hypothetical knowledge base.

    sql
    -- Ensure the extension is created
    CREATE EXTENSION IF NOT EXISTS vector;
    
    -- Table to store document chunks and their embeddings
    -- Assuming a 768-dimension model like BAAI/bge-base-en-v1.5
    CREATE TABLE doc_chunks (
        id BIGSERIAL PRIMARY KEY,
        doc_id BIGINT NOT NULL,
        chunk_text TEXT NOT NULL,
        created_at TIMESTAMPTZ DEFAULT NOW(),
        embedding VECTOR(768)
    );
    
    -- Optional: Add a B-Tree index for metadata filtering
    CREATE INDEX ON doc_chunks (doc_id);

    Now, let's populate it with 1 million vectors. For reproducibility without a real model, we'll use random data.

    python
    # Python script: populate_db.py
    import numpy as np
    import psycopg2
    from psycopg2.extras import execute_values
    
    def populate_data(conn_str, num_records=1_000_000, batch_size=1000):
        """Populates the database with random vector data."""
        conn = psycopg2.connect(conn_str)
        cur = conn.cursor()
    
        print(f"Generating {num_records} records...")
        for i in range(0, num_records, batch_size):
            # Generate a batch of random data
            embeddings = np.random.rand(batch_size, 768).astype(np.float32)
            doc_ids = np.random.randint(1, 10000, size=batch_size)
            chunk_texts = [f"This is chunk {i+j} for doc {doc_ids[j]}." for j in range(batch_size)]
            
            # Prepare data for insertion
            data_to_insert = []
            for j in range(batch_size):
                data_to_insert.append((int(doc_ids[j]), chunk_texts[j], embeddings[j].tolist()))
    
            # Use execute_values for efficient batch insertion
            execute_values(
                cur,
                "INSERT INTO doc_chunks (doc_id, chunk_text, embedding) VALUES %s",
                data_to_insert,
                template='(%s, %s, %s)'
            )
            conn.commit()
            if (i + batch_size) % 10000 == 0:
                print(f"Inserted {i + batch_size}/{num_records} records...")
    
        print("Population complete.")
        cur.close()
        conn.close()
    
    if __name__ == "__main__":
        # Replace with your actual connection string
        DB_CONNECTION = "postgresql://user:password@host:port/database"
        populate_data(DB_CONNECTION)

    With our data in place, let's build the IVFFlat index. A good starting point for lists with 1M records is sqrt(1,000,000) = 1000.

    sql
    -- Building the IVFFlat index
    -- For L2 distance, use vector_l2_ops. For cosine similarity, use vector_cosine_ops.
    SET maintenance_work_mem = '2GB'; -- Allocate more memory for faster index builds
    
    -- Let's time the index creation
    \timing
    CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);
    \timing

    On a reasonably powerful machine, this will take several minutes. Now, let's analyze query performance.

    sql
    -- Generate a random query vector in psql for testing
    -- In a real app, this comes from your embedding model
    SELECT random_normal_vector(768) INTO a_query_vector FROM generate_series(1,1);
    
    -- Query with low probes (fast, low recall)
    SET ivfflat.probes = 1;
    EXPLAIN ANALYZE
    SELECT id, chunk_text FROM doc_chunks ORDER BY embedding <-> a_query_vector LIMIT 10;
    
    -- Query with high probes (slower, high recall)
    SET ivfflat.probes = 10;
    EXPLAIN ANALYZE
    SELECT id, chunk_text FROM doc_chunks ORDER BY embedding <-> a_query_vector LIMIT 10;

    You'll see a query plan like this:

    text
    Limit  (cost=13.31..13.34 rows=10 width=40) (actual time=1.581..1.583 rows=10 loops=1)
      ->  Index Scan using doc_chunks_embedding_idx on doc_chunks  (cost=13.31..283.56 rows=100000 width=40) (actual time=1.579..1.581 rows=10 loops=1)
            Order By: (embedding <-> '[-0.01,...]')
    Planning Time: 0.123 ms
    Execution Time: 1.621 ms

    With probes = 1, the execution time might be ~1-2ms. With probes = 10, it might jump to ~10-15ms. This linear scaling of latency with probes is a key characteristic of IVFFlat.

    IVFFlat Edge Cases & Production Gotchas

  • Static Nature: The killer drawback. An IVFFlat index is built on the data present at creation time. New data inserted after the index is built is not included in the index until it is rebuilt. This makes it unsuitable for highly dynamic datasets.
  • * Production Pattern: For datasets with frequent appends, you must implement a periodic re-indexing strategy. This can be done with a tool like pg_repack for minimal downtime, or by building a new index concurrently and then swapping them.

    sql
        CREATE INDEX CONCURRENTLY doc_chunks_embedding_idx_new ON doc_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);
        -- Once complete:
        DROP INDEX CONCURRENTLY doc_chunks_embedding_idx;
        ALTER INDEX doc_chunks_embedding_idx_new RENAME TO doc_chunks_embedding_idx;
  • Parameter Tuning: lists is a build-time parameter. If you get it wrong, you have to rebuild the entire index. This necessitates a thorough benchmarking phase before going to production.
  • Section 2: Deep Dive into HNSW Indexing

    HNSW avoids the rigid partitioning of IVFFlat. Instead, it builds a multi-layer graph of interconnected vectors. The top layers have sparse, long-range connections that allow the search to quickly traverse the vector space. As the search descends through the layers, the connections become denser and shorter-range, homing in on the query vector's true neighbors.

    This structure allows for a greedy search algorithm that is incredibly fast and efficient, often achieving better recall-to-latency ratios than IVFFlat.

    Key Parameters and Their Impact

    * m (Build-time): The maximum number of connections per node on each layer.

    * Higher m: Denser graph, better recall, but larger index size, higher memory usage, and longer build time.

    * Typical values: 16-64.

    * ef_construction (Build-time): The size of the dynamic candidate list during index construction. It controls the quality of the graph.

    * Higher ef_construction: More thorough search for neighbors during the build, leading to a higher-quality index (better recall) at the cost of a much slower build time.

    * Typical values: 64-512.

    * ef_search (Query-time): The size of the dynamic candidate list during a search. This is the query-time knob analogous to probes.

    * Higher ef_search: More exhaustive search of the graph, higher recall, higher latency.

    Production Implementation & Analysis

    Let's drop our IVFFlat index and build an HNSW one on the same data.

    sql
    DROP INDEX doc_chunks_embedding_idx;
    
    -- Building the HNSW index
    SET maintenance_work_mem = '2GB';
    \timing
    CREATE INDEX ON doc_chunks USING hnsw (embedding vector_l2_ops) WITH (m = 32, ef_construction = 128);
    \timing

    Notice the build time. It will likely be significantly longer than the IVFFlat build. Now, let's query.

    sql
    -- HNSW does not use a SET variable. It's an index option.
    -- You can set it per-transaction for dynamic control.
    
    -- Query with low search effort (fast, lower recall)
    SET hnsw.ef_search = 40; -- A good starting point is often k (limit) + a bit
    EXPLAIN ANALYZE
    SELECT id, chunk_text FROM doc_chunks ORDER BY embedding <-> a_query_vector LIMIT 10;
    
    -- Query with high search effort (slower, higher recall)
    SET hnsw.ef_search = 200;
    EXPLAIN ANALYZE
    SELECT id, chunk_text FROM doc_chunks ORDER BY embedding <-> a_query_vector LIMIT 10;

    With ef_search = 40, you might see latency under 1ms. With ef_search = 200, it might increase to 3-5ms. The key observation is that for a similar high level of recall, HNSW is often faster than IVFFlat.

    HNSW Edge Cases & Production Gotchas

  • Memory Usage: HNSW's primary drawback. The graph structure is complex and must reside in memory for optimal performance. The index size on disk is a good proxy for RAM requirements. It can be 2-3x larger than an IVFFlat index for the same data.
  • Dynamic Data Handling: This is HNSW's killer feature. It's an incremental index. New data can be added, and it will be incorporated into the graph without a full rebuild. This makes it far superior for dynamic, real-time applications.
  • Deletion Overhead: While INSERTs are efficient, DELETEs are not. Deleting a vector in an HNSW graph simply marks it as deleted; it doesn't remove the node or restructure the graph. Over time, with many deletes, this can lead to performance degradation and index bloat, eventually requiring a REINDEX.
  • Section 3: Head-to-Head Benchmark

    Let's formalize the comparison with a proper benchmark. We need a ground truth to measure recall.

    Scenario: 1M 768-dim vectors. We'll find the true top-10 neighbors for 100 random query vectors using an exact sequential scan. Then, we'll run the same queries against our IVFFlat and HNSW indexes with varying parameters and measure latency and recall.

    Recall@10: For a given query, if the indexed search returns k of the true top-10 results, the recall is k / 10. We average this over all 100 test queries.

    Here's a simplified Python script concept for benchmarking:

    python
    # Python script: benchmark.py
    # (Assumes populated DB and a set of 100 test query vectors)
    
    def get_ground_truth(cur, query_vector):
        cur.execute("""
            SET enable_seqscan = on;
            SELECT id FROM doc_chunks ORDER BY embedding <-> %s LIMIT 10;
        """, (query_vector.tolist(),))
        return {row[0] for row in cur.fetchall()}
    
    def benchmark_ivfflat(cur, query_vector, probes):
        cur.execute(f"SET ivfflat.probes = {probes};")
        start_time = time.perf_counter()
        cur.execute("SELECT id FROM doc_chunks ORDER BY embedding <-> %s LIMIT 10;", (query_vector.tolist(),))
        end_time = time.perf_counter()
        results = {row[0] for row in cur.fetchall()}
        return end_time - start_time, results
    
    def benchmark_hnsw(cur, query_vector, ef_search):
        cur.execute(f"SET hnsw.ef_search = {ef_search};")
        start_time = time.perf_counter()
        cur.execute("SELECT id FROM doc_chunks ORDER BY embedding <-> %s LIMIT 10;", (query_vector.tolist(),))
        end_time = time.perf_counter()
        results = {row[0] for row in cur.fetchall()}
        return end_time - start_time, results
    
    # ... main loop to iterate through query vectors and parameters ...
    # ... calculate recall: len(ground_truth.intersection(results)) / 10 ...
    # ... aggregate and plot latency vs. recall ...

    Expected Results (Conceptual Graph):

    You would typically plot Latency (ms) on the Y-axis and Recall@10 on the X-axis. The resulting curves would show:

    * HNSW: A curve that starts at a very low latency and high recall, and then flattens out. You get diminishing returns in recall for large increases in latency (by increasing ef_search). It occupies the desirable "low latency, high recall" region of the graph.

    * IVFFlat: A curve that is shifted to the right (higher latency for a given recall). To achieve high recall, you need to increase probes, which linearly increases latency, making the curve steeper.

    Benchmark Summary Table:

    MetricIVFFlat (lists=1000)HNSW (m=32, ef_construction=128)
    Index Build Time~5-10 minutes~30-60 minutes
    Index Size on Disk~3.5 GB~7 GB
    Ingestion (new rows)Not indexed until rebuildIncremental, handled automatically
    Latency for 95% Recall~15-20 ms (probes ≈ 12)~2-4 ms (ef_search ≈ 80)

    These are illustrative numbers; you must run this on your own hardware and data. The conclusion is clear: if you can afford the RAM and the longer build time, HNSW offers a superior latency/recall profile for dynamic data.

    Section 4: Advanced Production Patterns

    Choosing an index is only the beginning. Real-world RAG systems require more sophisticated querying.

    Pattern 1: Metadata Filtering

    This is pgvector's superpower. What if you need the most relevant chunks, but only for doc_id = 123?

    sql
    EXPLAIN ANALYZE
    SELECT id, chunk_text
    FROM doc_chunks
    WHERE doc_id = 123
    ORDER BY embedding <-> a_query_vector
    LIMIT 10;

    How does PostgreSQL execute this? It depends. The query planner has to choose between two paths:

  • Use the B-Tree index on doc_id first: Find all rows for doc_id = 123, then perform an exact, in-memory sort on their vectors. This is efficient if the number of rows for doc_id = 123 is small (e.g., < 1000).
  • Use the ANN index on embedding first: Find the top-K nearest neighbors from the entire table, and then filter out the ones that don't match doc_id = 123. This is called "post-filtering." It's inefficient because you might retrieve 100 results from the ANN index only to find none of them match your WHERE clause.
  • PostgreSQL's planner is generally smart, but for ANN indexes, it can struggle. You must verify the plan. For low-cardinality filters (like a specific doc_id), the B-Tree scan is almost always better. For high-cardinality filters (e.g., created_at > '1 year ago'), the planner might choose the ANN index first. pgvector's HNSW implementation is improving its support for pre-filtering, but it's an area of active development.

    Pattern 2: Hybrid Search with Reciprocal Rank Fusion (RRF)

    Vector search is great for semantic relevance, but it can miss keyword matches. Full-text search (FTS) is great for keywords but misses semantic meaning. The state-of-the-art is to combine them.

    First, we need an FTS index:

    sql
    -- Add a tsvector column
    ALTER TABLE doc_chunks ADD COLUMN chunk_tsv tsvector;
    
    -- Create a trigger to auto-update it
    CREATE FUNCTION doc_chunks_tsv_trigger() RETURNS trigger AS $$
    begin
      new.chunk_tsv := to_tsvector('english', new.chunk_text);
      return new;
    end
    $$ LANGUAGE plpgsql;
    
    CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
    ON doc_chunks FOR EACH ROW EXECUTE PROCEDURE doc_chunks_tsv_trigger();
    
    -- Populate the column for existing data
    UPDATE doc_chunks SET chunk_tsv = to_tsvector('english', chunk_text);
    
    -- Create a GIN index for FTS
    CREATE INDEX doc_chunks_tsv_idx ON doc_chunks USING gin(chunk_tsv);

    Now, we can perform a hybrid search using Reciprocal Rank Fusion (RRF) to combine the results. RRF is a simple but powerful technique that scores documents based on the inverse of their rank in each result set, mitigating the need to normalize disparate scores (like BM25 from FTS and L2 distance from vector search).

    sql
    WITH vector_results AS (
        SELECT id, row_number() OVER (ORDER BY embedding <-> a_query_vector) as rank
        FROM doc_chunks
        -- Optional pre-filtering can be added here
        ORDER BY embedding <-> a_query_vector
        LIMIT 100
    ),
    fts_results AS (
        SELECT id, row_number() OVER (ORDER BY ts_rank_cd(chunk_tsv, websearch_to_tsquery('english', 'your search query')) DESC) as rank
        FROM doc_chunks
        WHERE chunk_tsv @@ websearch_to_tsquery('english', 'your search query')
        LIMIT 100
    )
    SELECT
        COALESCE(v.id, f.id) as id,
        -- The RRF formula: 1 / (k + rank)
        -- k is a constant, often 60, to reduce the impact of high ranks
        SUM(1.0 / (60 + COALESCE(v.rank, 0))) as vector_rrf_score,
        SUM(1.0 / (60 + COALESCE(f.rank, 0))) as fts_rrf_score,
        SUM(1.0 / (60 + COALESCE(v.rank, 0))) + SUM(1.0 / (60 + COALESCE(f.rank, 0))) as total_rrf_score
    FROM vector_results v
    FULL OUTER JOIN fts_results f ON v.id = f.id
    GROUP BY COALESCE(v.id, f.id)
    ORDER BY total_rrf_score DESC
    LIMIT 10;

    This query is complex but represents a true production-grade retrieval strategy. It executes two separate indexed searches in parallel and then fuses their results in memory to provide a final ranking that benefits from both semantic and keyword relevance.

    Conclusion: The Decision Framework

    Your choice between IVFFlat and HNSW is a critical architectural decision with long-term performance implications. Here is the definitive framework:

    Choose IVFFlat when:

  • Your dataset is static or updated infrequently (e.g., nightly batch jobs).
  • RAM is a significant constraint. IVFFlat has a much lower memory footprint.
  • Index build time is critical. IVFFlat is significantly faster to build.
    • You have a clear, predictable query pattern and can afford a periodic re-indexing maintenance window.

    Choose HNSW when:

  • Your dataset is dynamic, with real-time INSERTs. This is the most common scenario for modern applications.
  • Lowest possible query latency for high recall is the primary goal. HNSW consistently wins on the latency-recall curve.
  • You have sufficient RAM to hold the larger index in memory.
  • You are dealing with a moderate number of DELETEs and can schedule an occasional REINDEX to combat bloat.
  • For the vast majority of RAG systems being built today, which involve user-generated content, continuous document ingestion, or other dynamic data sources, HNSW is the superior choice and should be your default. Its operational simplicity for handling writes and its elite query performance justify the higher resource costs. IVFFlat remains a powerful tool for large-scale, static, or cost-sensitive batch processing workloads, but it requires a more deliberate and managed operational approach.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles