PostgreSQL Partial Indexes on GIN-indexed JSONB Columns
The High Cost of Unconditional JSONB Indexing
In modern data modeling, the JSONB type in PostgreSQL is a powerful tool, offering schema flexibility without sacrificing indexing capabilities. The standard approach for making JSONB searchable is to apply a Generalized Inverted Index (GIN). While effective, a naive CREATE INDEX ... ON my_table USING GIN (jsonb_column) on a large table with millions or billions of rows introduces significant performance and operational challenges:
JSONB column can often be larger than the table data itself. This consumes expensive disk space and can strain memory when parts of the index are loaded for queries.INSERT or UPDATE to the indexed JSONB column requires a corresponding update to the large GIN index, leading to increased I/O, CPU usage, and potential write contention.VACUUM process must work harder to reclaim space, and maintenance operations take longer, impacting overall database health.WHERE status = 'active' or WHERE tenant_id = 'some_tenant'). A full GIN index forces the query planner to sift through indexed data for all rows, including irrelevant ones, even if it can eventually filter them out.This article focuses on a precise, surgical solution to this problem: Partial GIN indexes on JSONB columns. We assume you are already familiar with the fundamentals of GIN indexes and the @> (contains), ? (exists), and other JSONB operators. Our goal is to demonstrate how to apply a WHERE clause to the index definition itself to create smaller, faster, and more efficient indexes tailored to your application's critical query paths.
Baseline Scenario: The Unoptimized E-commerce Product Catalog
Let's establish a baseline to quantify the problem. We'll model a product catalog table where each product has a properties JSONB column to store a wide variety of attributes.
First, let's set up our test environment and table.
-- Ensure we have the necessary extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
-- Create the product catalog table
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
sku TEXT NOT NULL UNIQUE,
product_status TEXT NOT NULL, -- e.g., 'draft', 'active', 'archived'
properties JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Create an index on the status for filtering
CREATE INDEX idx_products_status ON products(product_status);
Now, let's populate this table with a realistic dataset of 5 million products. We'll simulate a distribution where only 10% of products are active, 80% are archived, and 10% are drafts. This is a common pattern where the vast majority of data is historical or inactive.
-- Insert 5 million records
INSERT INTO products (sku, product_status, properties)
SELECT
'SKU-' || n,
CASE
WHEN r < 0.1 THEN 'active'
WHEN r < 0.2 THEN 'draft'
ELSE 'archived'
END,
jsonb_build_object(
'name', 'Product ' || n,
'tags', ARRAY['tag' || (n % 100), 'tag' || (n % 101)],
'specs', jsonb_build_object(
'weight_kg', n % 50,
'dimensions', jsonb_build_object('w', n % 100, 'h', n % 200, 'd', n % 50),
'material', CASE WHEN n % 3 = 0 THEN 'steel' WHEN n % 3 = 1 THEN 'plastic' ELSE 'wood' END
),
'inventory', jsonb_build_object(
'stock', n % 1000,
'warehouses', ARRAY['wh_a', 'wh_b']
),
-- Add some random data to increase JSONB complexity
'random_attr_' || (n % 20), 'value_' || n
)
FROM generate_series(1, 5000000) as n, random() as r;
Now, let's create the standard, full GIN index on the properties column.
-- Create a full GIN index on the entire JSONB column
CREATE INDEX idx_gin_products_properties_full ON products USING GIN (properties);
Let's analyze its size:
-- Check the size of the table and the full GIN index
SELECT
pg_size_pretty(pg_total_relation_size('products')) as total_table_size,
pg_size_pretty(pg_relation_size('idx_gin_products_properties_full')) as full_gin_index_size;
On a standard PostgreSQL 15 instance, this yields results like:
| total_table_size | full_gin_index_size |
|---|---|
| 3800 MB | 1450 MB |
The GIN index is nearly 40% of the total table size. This is our baseline.
Now, consider a common query: finding all active products with the tag tag50.
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, properties ->> 'name'
FROM products
WHERE
product_status = 'active'
AND
properties @> '{"tags": ["tag50"]}';
The query plan will look something like this:
Bitmap Heap Scan on products (cost=1234.56..5678.90 rows=500 width=45) (actual time=15.123..25.456 rows=5000 loops=1)
Recheck Cond: (properties @> '{"tags": ["tag50"]}'::jsonb)
Filter: (product_status = 'active'::text)
Rows Removed by Filter: 45000
Heap Blocks: exact=48000
Buffers: shared hit=49000
-> Bitmap Index Scan on idx_gin_products_properties_full (cost=0.00..1234.00 rows=50000 width=0) (actual time=12.123..12.123 rows=50000 loops=1)
Index Cond: (properties @> '{"tags": ["tag50"]}'::jsonb)
Buffers: shared hit=1000
Planning Time: 0.250 ms
Execution Time: 26.000 ms
Key observations from the plan:
idx_gin_products_properties_full to find all 50,000 products that contain tag50.Heap Blocks: exact=48000).product_status = 'active' filter, discarding 90% of the fetched rows (Rows Removed by Filter: 45000).This is fundamentally inefficient. We are forcing the database to read and process 45,000 rows from disk that we know in advance we don't need. The GIN index contains pointers to all products, but our query only cares about the active ones.
The Solution: A Partial GIN Index
A partial index includes a WHERE clause in its definition, meaning it only indexes rows that satisfy the condition. This is a perfect match for our problem.
Let's create a GIN index that only includes active products.
-- Create a partial GIN index for active products only
CREATE INDEX idx_gin_products_properties_active_partial
ON products USING GIN (properties)
WHERE product_status = 'active';
First, let's compare the index size.
-- Check the size of the new partial index
SELECT pg_size_pretty(pg_relation_size('idx_gin_products_properties_active_partial')) as partial_gin_index_size;
| partial_gin_index_size |
|---|
| 145 MB |
This is a 10x reduction in index size. The partial index is only 145 MB compared to the full index's 1450 MB. This immediately translates to savings in disk space, backup time, and memory pressure.
Now, let's re-run our query. PostgreSQL's query planner is smart enough to match the query's WHERE clause with the partial index's definition.
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, properties ->> 'name'
FROM products
WHERE
product_status = 'active'
AND
properties @> '{"tags": ["tag50"]}';
The new query plan:
Bitmap Heap Scan on products (cost=123.45..567.89 rows=500 width=45) (actual time=1.512..2.545 rows=5000 loops=1)
Recheck Cond: ((product_status = 'active'::text) AND (properties @> '{"tags": ["tag50"]}'::jsonb))
Heap Blocks: exact=4800
Buffers: shared hit=4900
-> Bitmap Index Scan on idx_gin_products_properties_active_partial (cost=0.00..123.00 rows=5000 width=0) (actual time=1.212..1.212 rows=5000 loops=1)
Index Cond: (properties @> '{"tags": ["tag50"]}'::jsonb)
Buffers: shared hit=100
Planning Time: 0.150 ms
Execution Time: 2.800 ms
Let's break down the massive improvements:
idx_gin_products_properties_active_partial. The Bitmap Index Scan is an order of magnitude faster (1.2ms vs 12ms) and required far fewer buffer hits because the index is much smaller and more dense with relevant data.exact=4800 vs exact=48000). Because the index only contains pointers to active products, we no longer fetch the 45,000 archived/draft products from the table.Filter line is gone. The filtering on product_status is implicitly handled by the choice of index. The condition is still present in the Recheck Cond, but the expensive work of fetching and discarding rows is eliminated.We can now safely drop the full GIN index if queries for non-active products are not a performance priority.
DROP INDEX idx_gin_products_properties_full;
Advanced Pattern 1: Multi-Tenant Isolation
Consider a SaaS application where a documents table stores user-generated content for multiple tenants. A tenant_id column segregates the data.
CREATE TABLE documents (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
is_public BOOLEAN NOT NULL DEFAULT FALSE,
content JSONB
);
CREATE INDEX idx_documents_tenant_id ON documents(tenant_id);
Some tenants might be on an enterprise plan and have extremely demanding query patterns on their content JSONB, while free-tier tenants may have simpler needs. A single GIN index on content would be bloated by the data from all tenants, causing the "noisy neighbor" problem where low-activity tenants degrade performance for high-activity ones.
We can use partial indexes to create performance isolation.
-- For a high-value enterprise tenant, create a dedicated partial GIN index
CREATE INDEX idx_gin_documents_content_tenant_enterprise_A
ON documents USING GIN (content)
WHERE tenant_id = 'a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6';
-- For another critical tenant
CREATE INDEX idx_gin_documents_content_tenant_enterprise_B
ON documents USING GIN (content)
WHERE tenant_id = 'f1e2d3c4-b5a6-f7e8-d9c0-b1a2f3e4d5c6';
Now, a query from Enterprise Tenant A will be serviced by its own small, highly-efficient index:
EXPLAIN ANALYZE
SELECT id FROM documents
WHERE tenant_id = 'a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6'
AND content -> 'metadata' ->> 'status' = 'approved';
The planner will select idx_gin_documents_content_tenant_enterprise_A, completely ignoring the data from all other tenants at the index level.
Trade-offs and Considerations:
* Management Overhead: This pattern is not suitable for every tenant. It should be reserved for a small number of high-value tenants where performance is critical. Managing thousands of indexes is an operational burden.
* Dynamic Index Creation: You could build an operational process to create/drop these indexes as tenants upgrade or downgrade plans, but this requires careful automation and monitoring.
* Generic Fallback: You might still want a broader, less-performant partial index for non-enterprise tenants, for example, on publicly accessible documents.
-- A fallback index for all public documents, regardless of tenant
CREATE INDEX idx_gin_documents_content_public
ON documents USING GIN (content)
WHERE is_public = TRUE;
Advanced Pattern 2: Indexing on JSONB Sub-documents and Expressions
Sometimes, indexing the entire JSONB document is wasteful. If your queries consistently target a specific nested object within the JSONB, you can create a partial index on an expression that extracts just that sub-object.
Let's go back to our products table. Assume we frequently run complex queries only within the specs object for active products.
// Example 'properties' field
{
"name": "...",
"tags": ["..."],
"specs": {
"weight_kg": 15,
"dimensions": {"w": 50, "h": 100, "d": 25},
"material": "steel"
}
}
Instead of indexing the whole properties blob, we can do this:
-- Create a partial GIN index on just the 'specs' object for active products
CREATE INDEX idx_gin_products_specs_active_partial
ON products USING GIN ((properties -> 'specs'))
WHERE product_status = 'active';
This index is even smaller and more efficient than our first partial index because it only contains keys from the specs object.
Let's test it with a query that searches for active products made of steel with a specific weight.
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, sku
FROM products
WHERE product_status = 'active'
AND properties -> 'specs' @> '{"material": "steel", "weight_kg": 15}';
The query planner will use our new, highly-specialized index:
... -> Bitmap Index Scan on idx_gin_products_specs_active_partial ...
This is the pinnacle of this technique: we are indexing the smallest possible subset of data (only the specs object) for the smallest possible subset of rows (only active products), resulting in an extremely fast and compact index.
Edge Cases and Production Pitfalls
While powerful, this technique requires precision. Here are common issues senior engineers face.
1. Predicate Mismatches
The WHERE clause in your query must exactly match the WHERE clause of your partial index for the planner to consider it. Even a semantically identical but syntactically different clause will cause the planner to ignore the index.
Example: Our index is WHERE product_status = 'active'.
* WHERE product_status = 'active' AND ... -> WILL use the index.
* WHERE product_status IN ('active') AND ... -> WILL use the index.
* WHERE product_status <> 'archived' AND product_status <> 'draft' AND ... -> WILL NOT use the index, even though it's logically the same for our dataset. The planner does not perform this level of logical deduction.
* WHERE upper(product_status) = 'ACTIVE' AND ... -> WILL NOT use the index.
Solution: Enforce strict conventions in your application's data access layer. Ensure that queries are constructed consistently to match the available partial indexes. This often requires close collaboration between developers and database administrators.
2. Write Performance vs. Read Performance
While a partial index is smaller and faster to update for rows it covers, having many partial indexes can degrade overall write performance. When a row is inserted or updated, PostgreSQL must check it against the WHERE clause of every partial index on that table to see if a new index entry is needed.
Consider a scenario with 5 statuses: draft, pending, active, rejected, archived. If you create a separate partial GIN index for each status:
CREATE INDEX ... WHERE product_status = 'draft';
CREATE INDEX ... WHERE product_status = 'pending';
CREATE INDEX ... WHERE product_status = 'active';
...
An INSERT statement will be fast, as it only needs to update one of these indexes (or none if the status is new). However, an UPDATE that changes the status from draft to pending will incur a write to two indexes: a deletion from the 'draft' index and an insertion into the 'pending' index.
Benchmarking is crucial. Before deploying a multi-index strategy, use a tool like pgbench with a custom script that mimics your application's write patterns (e.g., transitions between states) to measure the impact on transaction throughput.
3. NULLs in the WHERE Clause
A common use case for partial indexes is to exclude NULL values. For a JSONB column, you might want to index only those rows where a specific key exists.
Example: Index documents that have a review object.
-- This index only includes rows where the 'review' key is present in the JSONB
CREATE INDEX idx_gin_documents_content_with_review
ON documents USING GIN (content)
WHERE content ? 'review';
This is highly effective for indexing optional, but important, data structures within your JSONB. It prevents NULL or non-existent data from bloating your index.
4. Interaction with `jsonb_path_ops`
By default, GIN indexes on JSONB use the jsonb_ops operator class, which indexes every key and value. PostgreSQL also offers jsonb_path_ops, which only indexes values and is optimized for the @> operator. While jsonb_path_ops can create smaller indexes, it does not support other operators like ? (exists).
When creating a partial index, you must consider which operators your queries will use. If your partial index's corresponding queries only ever use @>, you can gain a further size reduction by specifying jsonb_path_ops.
-- A partial index using the more specialized operator class
CREATE INDEX idx_gin_products_properties_active_partial_path_ops
ON products USING GIN (properties jsonb_path_ops)
WHERE product_status = 'active';
This index would be even smaller than our 145 MB example, but it could not be used to satisfy a query like WHERE product_status = 'active' AND properties ? 'some_key'. The choice depends entirely on your query patterns.
Summary of Production Patterns and Anti-Patterns
✅ Use When:
status, state) and the vast majority of performance-critical queries target one or two states (active, published).type column, and you need to index the JSONB payload differently for each type.JSONB attribute is only present on a small fraction of rows.❌ Avoid When:
user_id where there are millions of users) is an anti-pattern that leads to an unmanageable number of indexes.WHERE clause of your partial index is updated very frequently, you will incur high overhead from deleting entries from one index and inserting into another.Conclusion
Combining partial indexes with GIN on JSONB columns is an advanced optimization technique that moves beyond brute-force indexing. It requires a deep understanding of your application's query patterns and data distribution. By surgically creating indexes that cover only the hot, critical subsets of your data, you can achieve order-of-magnitude improvements in query performance, drastically reduce storage and memory overhead, and improve the overall health and maintainability of your PostgreSQL database. This is not a default strategy, but a powerful tool in the arsenal of a senior engineer tasked with scaling a data-intensive application.