PostgreSQL pgvector: HNSW Indexing for Production Vector Search

October 4, 2025

21 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Performance Cliff of Naive Vector Search

For any senior engineer tasked with implementing semantic search, Retrieval-Augmented Generation (RAG), or recommendation systems, the initial foray into vector databases often starts with a simple, elegant query. Using pgvector in PostgreSQL, that query looks like this:

sql

SELECT id, content
FROM documents
ORDER BY embedding <-> '[...query_vector...]'
LIMIT 10;

This uses the L2 distance operator (<->) to find the 10 documents whose embeddings are closest to a given query vector. It's clean, intuitive, and works flawlessly on a development dataset of a few thousand rows. The problem, as many discover when moving towards production, is that this query performs an exhaustive sequential scan. It computes the distance between the query vector and every single row in the table before sorting them to find the top K. The time complexity is O(N), where N is the number of rows. On a table with millions of vectors, this query's latency moves from milliseconds to tens of seconds or even minutes, rendering it useless for any real-time application.

This is the performance cliff that necessitates an Approximate Nearest Neighbor (ANN) index. ANN algorithms trade a small, often negligible, amount of accuracy (recall) for a massive gain in search speed, typically achieving sub-linear, often logarithmic, complexity. pgvector offers two primary ANN index types: ivfflat and hnsw. While ivfflat is powerful, hnsw (Hierarchical Navigable Small World) has become the de facto choice for applications demanding low latency and high recall. However, unlocking its true potential requires moving beyond the default settings and mastering its complex internal mechanics and tuning parameters.

This article is a deep dive into the HNSW index in pgvector. We will dissect its architecture, provide a rigorous framework for tuning its parameters, and explore production-level considerations for deployment, maintenance, and performance optimization.

A Conceptual Deep Dive into HNSW

To effectively tune HNSW, we must first understand how it works. HNSW organizes data points (vectors) into a multi-layered graph, resembling a series of nested proximity networks.

Imagine a road network. A naive search would be like visiting every house in a city to find the closest one to you. HNSW builds an expressway system over this local road network.

The Layers: The bottom-most layer (Layer 0) contains every single vector. Each subsequent layer above it is a sparse, long-range "expressway" containing a subset of the nodes from the layer below. The topmost layer is the sparsest, connecting only a few, distant nodes.

The Connections (m): Within each layer, each node is connected to a fixed number of its nearest neighbors. This number of connections is a critical parameter, m, which defines the density of the graph.

The Search Process: A search begins at a predefined entry point in the topmost, sparsest layer.

1. The algorithm greedily traverses the graph in the current layer, always moving to the connected node closest to the query vector.

2. Once it finds a local minimum (a node where all its neighbors are further away from the query vector than it is), it drops down to the next layer below.

3. The process repeats: greedy traversal on the new layer, finding a local minimum, and dropping down again.

4. This continues until the search is completed on the dense bottom layer (Layer 0).

The brilliance of this approach is that the upper layers allow the search to quickly traverse vast distances across the vector space, while the lower, denser layers enable fine-grained, accurate searching once the algorithm is in the correct neighborhood. This hierarchical navigation is what gives HNSW its speed.

Practical Implementation and Core Parameters

Let's move from theory to practice. Assume we have a table for storing document embeddings generated by a model like OpenAI's text-embedding-3-small (1536 dimensions).

sql

-- Ensure the extension is enabled
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the table to store documents and their embeddings
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding VECTOR(1536) -- Match the dimensionality of your model
);

-- (Assume this table is populated with millions of rows)

To build an HNSW index, the syntax is straightforward:

sql

CREATE INDEX ON documents
USING hnsw (embedding vector_l2_ops)
WITH (m = 32, ef_construction = 128);

This command initiates the index build, a process that can be resource-intensive and time-consuming. The crucial part is the WITH clause, which contains the build-time tuning parameters.

Mastering the Tuning Knobs: `m`, `ef_construction`, and `ef_search`

The performance of your HNSW index is almost entirely dictated by three parameters. Understanding their interplay is non-negotiable for production success.

1. `m`: The Connectivity of the Graph

What it is: m defines the maximum number of bidirectional connections (edges) each node can have within a single layer of the graph. It directly controls the density of the graph.

The Trade-off:

- Low m (e.g., 8-12): Creates a sparse graph. This results in a smaller index size in memory and faster index build times. However, the search path might not be optimal, potentially leading to lower recall (the search gets stuck in a local minimum and misses the true nearest neighbors).

- High m (e.g., 32-64): Creates a dense graph with many redundant paths. This significantly increases the probability of finding the optimal route to the true nearest neighbors, improving recall. The costs are a larger memory footprint, longer build times, and slightly higher query latency as more paths are explored.

Production Guidance: A common starting point for m is between 16 and 32. For applications where recall is paramount and you can afford the memory, values up to 64 are reasonable. Values below 8 are rarely effective, and values above 96 often provide diminishing returns for a significant cost.

2. `ef_construction`: The Quality of the Index Build

What it is: ef_construction (effective construction) controls the quality of the graph being built. When adding a new node to the graph, the algorithm searches for its m nearest neighbors to connect to. ef_construction defines the size of the dynamic candidate list used during this search. A larger list means a more thorough search for neighbors, resulting in a higher-quality, more accurate graph.

The Trade-off:

- Low ef_construction (e.g., 64): Leads to a much faster index build. The resulting graph may be suboptimal, which can negatively impact query-time recall. The search might find "good enough" neighbors during the build, but not the absolute best ones.

- High ef_construction (e.g., 256-512): Significantly increases index build time. However, it produces a superior graph structure where nodes are connected to their true nearest neighbors. This investment at build time pays dividends at query time, enabling higher recall for a given query speed.

Production Guidance: The quality of your index is foundational. Skimping on ef_construction is often a false economy. A robust starting point is to set ef_construction to at least 4 m. For high-recall applications, values of 8 m or even higher are common. For a typical m = 32, an ef_construction of 128 to 256 is a solid choice.

3. `ef_search`: The Speed vs. Accuracy Knob at Query Time

What it is: ef_search is the query-time equivalent of ef_construction. It determines the size of the dynamic candidate list used during the search for a query vector. It is arguably the most important parameter for balancing latency and recall.

The Trade-off:

- Low ef_search (e.g., 40): The search is very fast because the candidate list at each step is small. The query latency will be very low. However, the risk of terminating the search prematurely in a suboptimal part of the graph is high, leading to lower recall.

- High ef_search (e.g., 200): The search is more exhaustive, exploring a larger set of candidates at each step. This dramatically increases the probability of finding the true nearest neighbors (higher recall) but comes at the cost of higher query latency.

Production Guidance: Unlike m and ef_construction, ef_search is not set at index creation. It's a runtime parameter configured per-session or per-transaction, giving you dynamic control.

sql

-- Set ef_search for the current transaction
BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT id FROM documents ORDER BY embedding <-> '[...]' LIMIT 10;
COMMIT;

This dynamic nature is incredibly powerful. You can have a high-quality base index (high m and ef_construction) and then adjust ef_search based on application needs. For an interactive user-facing search, you might use a lower ef_search (e.g., 50) for P99 latency under 50ms. For an offline batch processing job where accuracy is paramount, you can crank it up to 200 or more.

A good starting point for ef_search is slightly larger than the number of results you want (k). If you need LIMIT 10, start testing with ef_search around 40-50 and increase it until your recall metric is met within your latency budget.

A Production Benchmarking Strategy

Choosing these parameters shouldn't be guesswork. It requires a methodical, data-driven approach. Here is a step-by-step guide and a Python script for benchmarking HNSW performance on your own dataset.

The Goal: To generate a table of results showing how different combinations of m, ef_construction, and ef_search affect Index Build Time, Query Latency, and Recall.

Step 1: Establish Ground Truth

To measure recall, you first need to know the actual nearest neighbors. For a representative sample of query vectors (e.g., 100-1000 queries), run an exact search without an index and store the results.

sql

-- This will be slow! Run it once and save the results.
SET enable_seqscan = on;
SET enable_indexscan = off;

SELECT id, (embedding <-> '[...query_vector_1...]') as distance
FROM documents
ORDER BY distance
LIMIT 100;

Step 2: The Benchmarking Script

The following Python script uses psycopg2, numpy, and tqdm to automate the process. It iterates through a grid of parameters, rebuilds the index for each combination, and measures performance.

python

import psycopg2
import numpy as np
import time
import os
from tqdm import tqdm

# --- Configuration ---
DB_PARAMS = {
    'dbname': 'vector_db',
    'user': 'postgres',
    'password': 'password',
    'host': 'localhost',
    'port': '5432'
}

TABLE_NAME = 'documents'
VECTOR_DIMENSION = 1536
NUM_VECTORS = 100_000
NUM_QUERIES = 100
K = 10 # Number of nearest neighbors to retrieve

# --- Parameter Grids ---
M_VALUES = [16, 32, 48]
EF_CONSTRUCTION_VALUES = [64, 128, 256]
EF_SEARCH_VALUES = [20, 40, 60, 80, 100, 150]

# --- Helper Functions ---
def get_connection():
    return psycopg2.connect(**DB_PARAMS)

def setup_database(conn):
    with conn.cursor() as cur:
        print("Setting up database...")
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
        cur.execute(f"DROP TABLE IF EXISTS {TABLE_NAME};")
        cur.execute(f"""
            CREATE TABLE {TABLE_NAME} (
                id BIGSERIAL PRIMARY KEY,
                embedding VECTOR({VECTOR_DIMENSION})
            );
        """)
        print(f"Populating {TABLE_NAME} with {NUM_VECTORS} random vectors...")
        random_vectors = np.random.rand(NUM_VECTORS, VECTOR_DIMENSION).astype(np.float32)
        for vec in tqdm(random_vectors):
            cur.execute(f"INSERT INTO {TABLE_NAME} (embedding) VALUES (%s);", (vec.tolist(),))
        conn.commit()
        print("Database setup complete.")

def get_ground_truth(conn, query_vectors):
    print("Calculating ground truth...")
    ground_truth = []
    with conn.cursor() as cur:
        for query_vec in tqdm(query_vectors):
            cur.execute("SET enable_seqscan = on;")
            cur.execute("SET enable_indexscan = off;")
            cur.execute(
                f"SELECT id FROM {TABLE_NAME} ORDER BY embedding <-> %s LIMIT %s;",
                (query_vec.tolist(), K)
            )
            ground_truth.append(set([row[0] for row in cur.fetchall()]))
    return ground_truth

def calculate_recall(retrieved, truth):
    return len(set(retrieved).intersection(truth)) / K

# --- Main Benchmarking Logic ---
def main():
    conn = get_connection()
    # setup_database(conn) # Run this once to populate data

    query_vectors = np.random.rand(NUM_QUERIES, VECTOR_DIMENSION).astype(np.float32)
    ground_truth_sets = get_ground_truth(conn, query_vectors)

    results = []

    for m in M_VALUES:
        for ef_c in EF_CONSTRUCTION_VALUES:
            with conn.cursor() as cur:
                print(f"\n--- Building Index: m={m}, ef_construction={ef_c} ---")
                cur.execute(f"DROP INDEX IF EXISTS hnsw_idx;")
                conn.commit()

                start_time = time.time()
                cur.execute(f"""
                    CREATE INDEX hnsw_idx ON {TABLE_NAME}
                    USING hnsw (embedding vector_l2_ops)
                    WITH (m = {m}, ef_construction = {ef_c});
                """)
                conn.commit()
                build_time = time.time() - start_time
                print(f"Index built in {build_time:.2f} seconds.")

            for ef_s in EF_SEARCH_VALUES:
                latencies = []
                recalls = []
                with conn.cursor() as cur:
                    cur.execute("SET enable_seqscan = off;")
                    cur.execute("SET enable_indexscan = on;")
                    cur.execute(f"SET LOCAL hnsw.ef_search = {ef_s};")

                    for i, query_vec in enumerate(query_vectors):
                        start_query_time = time.time()
                        cur.execute(
                            f"SELECT id FROM {TABLE_NAME} ORDER BY embedding <-> %s LIMIT %s;",
                            (query_vec.tolist(), K)
                        )
                        query_latency = time.time() - start_query_time
                        latencies.append(query_latency)

                        retrieved_ids = [row[0] for row in cur.fetchall()]
                        recall = calculate_recall(retrieved_ids, ground_truth_sets[i])
                        recalls.append(recall)
                
                avg_latency_ms = np.mean(latencies) * 1000
                avg_recall = np.mean(recalls)

                result_row = {
                    'm': m,
                    'ef_construction': ef_c,
                    'ef_search': ef_s,
                    'build_time_s': round(build_time, 2),
                    'avg_latency_ms': round(avg_latency_ms, 2),
                    'avg_recall': round(avg_recall, 4)
                }
                results.append(result_row)
                print(f"m={m}, ef_c={ef_c}, ef_s={ef_s} -> Latency: {avg_latency_ms:.2f}ms, Recall: {avg_recall:.4f}")

    conn.close()

    print("\n--- Final Results ---")
    print("m\t| ef_c\t| ef_s\t| Build(s)\t| Latency(ms)\t| Recall")
    print("-"*70)
    for res in results:
        print(f"{res['m']}\t| {res['ef_construction']}\t| {res['ef_search']}\t| {res['build_time_s']}\t\t| {res['avg_latency_ms']}\t\t| {res['avg_recall']}")

if __name__ == "__main__":
    main()

Step 3: Analyze the Results

Running this script will produce a table that clearly illustrates the trade-offs. You will observe patterns like:

For a fixed m and ef_construction, increasing ef_search will increase both latency and recall.

For a fixed ef_search, a better-quality index (higher m and ef_construction) will yield higher recall for roughly the same latency.

Index build times are highly sensitive to ef_construction.

Your goal is to find the "knee" in the curve—the point where you achieve your target recall (e.g., 99%) within your latency budget (e.g., P99 < 50ms) without over-provisioning your index build process.

Advanced Considerations and Edge Cases

A production system is more than just a static index. Here are critical factors to consider.

Data Ingestion and Updates

Inserts: HNSW is an incremental index. When you INSERT a new vector, it is added to the graph without requiring a full rebuild. The new node traverses the graph from the entry point to find its neighbors, a process similar to a search. This makes HNSW excellent for datasets with frequent additions.

Deletes and Updates: This is a significant caveat. DELETE and UPDATE operations in PostgreSQL don't immediately remove data; they mark rows as "dead" to be cleaned up later by VACUUM. For HNSW, this means dead vectors remain in the graph structure until a VACUUM is run. This can degrade performance over time as the search algorithm wastes time visiting dead nodes. Frequent updates and deletes necessitate an aggressive AUTOVACUUM strategy for the indexed table. In extreme churn scenarios, periodic re-indexing (REINDEX) might be required to compact the graph and restore optimal performance.

Memory Management

HNSW is a memory-hungry index. For optimal performance, the entire index should fit in PostgreSQL's shared buffers (RAM). If the index is larger than available RAM, PostgreSQL will have to read pages from disk during the graph traversal, causing a catastrophic drop in performance.

You can estimate the index size with the following formula:

Index Size ≈ num_rows ( (dimension 4) + (m 2 4 * 2) )

num_rows: Total number of vectors.

dimension * 4: Bytes for storing the vector itself (assuming float4).

m 2 4 * 2: Bytes for storing connections. m connections per layer, up to 2 layers (on average), 4 bytes per neighbor ID, times 2 for bidirectional links.

For 1 million 1536-dimensional vectors with m=32, the estimated size would be:

1,000,000 ( (1536 4) + (32 2 4 2) ) ≈ 1,000,000 (6144 + 512) ≈ 6.6 GB

Monitor pg_buffercache to ensure your index hit rate is high.

HNSW vs. IVFFlat: A Strategic Choice

pgvector also offers the ivfflat index, which uses an inverted file system. It partitions the vector space into lists (clusters) and only searches within the most promising clusters identified by probes.

Feature	HNSW	IVFFlat
Build Time	Slower, especially with high `ef_construction`.	Faster, primarily involves k-means clustering.
Query Latency	Generally lower and more consistent.	Can be higher due to the initial cluster probe step.
Recall	Typically higher for a given speed.	Good, but can struggle with vectors near cluster borders.
Memory Usage	Higher. Stores the full graph structure.	Lower. Stores cluster centroids and inverted lists.
Incremental Adds	Excellent. New vectors are added efficiently.	Poor. Index must be rebuilt periodically.
Best For	Low-latency, high-recall, dynamic data.	Extremely large, static datasets where memory is a key constraint.

Rule of thumb: Start with HNSW. Only consider IVFFlat if you have a dataset in the hundreds of millions or billions of vectors where the HNSW memory footprint becomes prohibitive and your data is relatively static.

Conclusion: From Theory to Production-Ready Vector Search

Successfully deploying pgvector with HNSW at scale is a testament to deep systems engineering. It requires moving beyond the simple ORDER BY clause and embracing the complexity of the underlying ANN algorithm. The path to a performant, reliable vector search engine in PostgreSQL is paved with methodical benchmarking and a nuanced understanding of the trade-offs between m, ef_construction, and ef_search.

By treating these parameters not as magic numbers but as levers controlling the fundamental balance of graph quality, search depth, and resource consumption, you can build a system that meets stringent latency and recall requirements. Remember to establish ground truth, automate your benchmarking, and plan for the operational realities of data churn and memory management. With this approach, PostgreSQL transforms from a traditional relational database into a powerful, production-grade vector search platform.

The Performance Cliff of Naive Vector Search

A Conceptual Deep Dive into HNSW

Practical Implementation and Core Parameters

Mastering the Tuning Knobs: `m`, `ef_construction`, and `ef_search`

1. `m`: The Connectivity of the Graph

2. `ef_construction`: The Quality of the Index Build

3. `ef_search`: The Speed vs. Accuracy Knob at Query Time

A Production Benchmarking Strategy

Advanced Considerations and Edge Cases

Data Ingestion and Updates

Memory Management

HNSW vs. IVFFlat: A Strategic Choice

Conclusion: From Theory to Production-Ready Vector Search

Found this article helpful?