Optimizing Nested JSONB Queries in Postgres with GIN & B-Tree
The High Cost of Naive JSONB Queries in Production
In modern application development, using PostgreSQL's JSONB type to store semi-structured data is a common and powerful pattern. It offers schema flexibility without sacrificing the transactional integrity of a relational database. However, this flexibility comes with a hidden performance cost. A JSONB column, without proper indexing, is an opaque blob to the query planner. Any query attempting to filter on its contents will inevitably lead to a full sequential scan, a performance killer for tables with millions of rows.
Consider a typical multi-tenant SaaS application managing user-generated documents. A simplified table might look like this:
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
data JSONB NOT NULL
);
-- Let's populate it with a significant amount of data
INSERT INTO documents (tenant_id, data)
SELECT
'a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6'::uuid,
jsonb_build_object(
'type', CASE WHEN random() < 0.5 THEN 'invoice' ELSE 'report' END,
'status', 'processed',
'metadata', jsonb_build_object(
'source', 'api',
'version', 2.1,
'tags', ARRAY['internal', 'q3-2024']
),
'content', jsonb_build_object(
'customer_id', (random() * 10000)::int,
'line_items', 5
)
)
FROM generate_series(1, 1000000);
Now, imagine a frequent query: finding all documents for a specific customer within the nested content object.
EXPLAIN ANALYZE
SELECT id, data->'content'->'customer_id' as customer_id
FROM documents
WHERE data->'content'->>'customer_id' = '42';
The query plan will be grim:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..63605.53 rows=500 width=40) (actual time=2.919..385.127 rows=95 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on documents (cost=0.00..62555.53 rows=208 width=40) (actual time=1.841..378.895 rows=32 loops=3)
Filter: ((data -> 'content' ->> 'customer_id'::text) = '42'::text)
Rows Removed by Filter: 333301
Planning Time: 0.116 ms
Execution Time: 385.183 ms
A Parallel Sequential Scan. On a 1 million row table, this takes ~385ms. On a 100 million row table, this query becomes unusable. This is the baseline problem we will systematically solve using advanced indexing techniques.
Level 1: The Standard GIN Index (`jsonb_ops`)
The first line of defense is a Generalized Inverted Index (GIN). A standard GIN index on a JSONB column creates an index entry for every key and every value within the JSON document.
CREATE INDEX idx_gin_documents_data ON documents USING GIN (data);
This index is a powerhouse for operators like existence (?), key existence (?|, ?&), and containment (@>). The containment operator is particularly useful for finding documents that have a specific key-value pair at any level of nesting.
Let's rewrite our query to use the containment operator:
EXPLAIN ANALYZE
SELECT id, data->'content'->'customer_id' as customer_id
FROM documents
WHERE data @> '{"content": {"customer_id": 42}}';
Now, the query plan tells a different story:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=124.50..524.53 rows=1000 width=40) (actual time=0.291..0.435 rows=95 loops=1)
Recheck Cond: (data @> '{"content": {"customer_id": 42}}'::jsonb)
Heap Blocks: exact=95
-> Bitmap Index Scan on idx_gin_documents_data (cost=0.00..124.25 rows=1000 width=0) (actual time=0.274..0.274 rows=95 loops=1)
Index Cond: (data @> '{"content": {"customer_id": 42}}'::jsonb)
Planning Time: 0.158 ms
Execution Time: 0.467 ms
From 385ms to under 1ms. A massive improvement. The Bitmap Index Scan shows the GIN index is being used effectively.
Production Consideration: The default jsonb_ops GIN index is versatile but comes with a significant cost. It indexes everything. This leads to:
INSERT and UPDATE operations become more expensive because every key and value in the JSONB payload must be processed and added to the index. This can be a major bottleneck in write-heavy systems.For many use cases, especially those involving querying for the presence of data rather than specific values, this index is sufficient. But for performance-critical path-based lookups, we can do better.
Level 2: Path-Optimized GIN Index (`jsonb_path_ops`)
PostgreSQL offers an alternative GIN operator class: jsonb_path_ops. Unlike jsonb_ops, which indexes keys and values, jsonb_path_ops only indexes hashes of the values at specific paths. It does not index keys at all.
This has two primary effects:
The trade-off is that jsonb_path_ops only supports the containment (@>) operator. It cannot accelerate key existence operators like ?.
Let's build this index and compare.
-- First, drop the old index
DROP INDEX idx_gin_documents_data;
-- Create the path-optimized index
CREATE INDEX idx_gin_path_ops_documents_data ON documents USING GIN (data jsonb_path_ops);
Now, we run the exact same containment query:
EXPLAIN ANALYZE
SELECT id, data->'content'->'customer_id' as customer_id
FROM documents
WHERE data @> '{"content": {"customer_id": 42}}';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=64.50..464.53 rows=1000 width=40) (actual time=0.187..0.321 rows=95 loops=1)
Recheck Cond: (data @> '{"content": {"customer_id": 42}}'::jsonb)
Heap Blocks: exact=95
-> Bitmap Index Scan on idx_gin_path_ops_documents_data (cost=0.00..64.25 rows=1000 width=0) (actual time=0.173..0.173 rows=95 loops=1)
Index Cond: (data @> '{"content": {"customer_id": 42}}'::jsonb)
Planning Time: 0.121 ms
Execution Time: 0.350 ms
The performance is slightly better (~0.35ms vs ~0.46ms), primarily because the smaller index can be scanned faster. The real win is in index size and write performance.
Let's check the index sizes:
SELECT relname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE relname IN ('idx_gin_documents_data', 'idx_gin_path_ops_documents_data');
-- Assuming you recreate the first index to compare
-- idx_gin_documents_data | 152 MB
-- idx_gin_path_ops_documents_data | 88 MB
The jsonb_path_ops index is nearly 45% smaller. In a production environment with billions of rows and frequent writes, this difference is monumental. It translates to lower storage costs, faster backups, and reduced write amplification.
When to use jsonb_path_ops: If your query patterns exclusively rely on checking if a JSON document contains a specific structure with specific values (@>), jsonb_path_ops is almost always the superior choice over the default jsonb_ops.
Level 3: B-Tree Indexes on Expressions for High Selectivity
GIN indexes are powerful, but they have a fundamental limitation: they are not ideal for operators that rely on ordering, such as range queries (>, <, BETWEEN) or even simple equality on specific, typed data. The query planner can use a GIN index for equality, as we saw, but it's a multi-step process (Bitmap Index Scan -> Bitmap Heap Scan) that is less efficient than a direct lookup.
This is where B-Tree indexes on expressions shine. If you consistently query a specific, strongly-typed field deep within your JSONB structure, you can create a B-Tree index directly on the expression that extracts that value.
Let's return to our original query that used the ->> operator, which extracts a value as text.
-- This expression extracts the customer_id and casts it to an integer
(data->'content'->>'customer_id')::int
We can create an index on this exact expression.
-- Drop the GIN index for a clean comparison
DROP INDEX idx_gin_path_ops_documents_data;
CREATE INDEX idx_btree_documents_customer_id ON documents (((data->'content'->>'customer_id')::int));
Note the double parentheses. The first set is for the expression, and the second is for the CREATE INDEX syntax.
Now, let's run the original query that performed so poorly.
EXPLAIN ANALYZE
SELECT id, data->'content'->'customer_id' as customer_id
FROM documents
WHERE (data->'content'->>'customer_id')::int = 42;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_btree_documents_customer_id on documents (cost=0.43..12.35 rows=95 width=40) (actual time=0.038..0.125 rows=95 loops=1)
Index Cond: (((data -> 'content' ->> 'customer_id'::text))::integer = 42)
Planning Time: 0.174 ms
Execution Time: 0.155 ms
The query plan now shows an Index Scan. This is the most efficient access method in PostgreSQL. The planner directly navigates the B-Tree structure to find the pointers to the matching rows. The execution time has dropped to ~0.15ms, more than twice as fast as our best GIN index performance.
Advantages of B-Tree on Expressions:
WHERE (data->'content'->>'customer_id')::int > 1000 AND (data->'content'->>'customer_id')::int < 1050.Disadvantages:
data->'content'->>'line_items' gets no benefit.customer_id key, the expression will evaluate to NULL. You may need to handle this in your index or queries (e.g., WHERE COALESCE(...)).Advanced Pattern: Partial Indexes for Heterogeneous Collections
Real-world JSONB columns are often used to store different types of documents within the same table. Our example table contains both invoice and report documents. It's likely that queries for invoices look at different fields than queries for reports.
Creating a global B-Tree index on (data->'content'->>'customer_id')::int is wasteful if only invoice documents contain this field. The index will be bloated with NULL entries for every report document.
The solution is a partial index.
Let's say we only ever search for the customer_id on documents of type invoice.
-- Drop the full index
DROP INDEX idx_btree_documents_customer_id;
-- Create a partial index that only includes 'invoice' documents
CREATE INDEX idx_partial_btree_documents_invoice_customer_id
ON documents (((data->'content'->>'customer_id')::int))
WHERE (data->>'type') = 'invoice';
The WHERE clause is the key. PostgreSQL will only add an entry to this index if the document satisfies the condition. This results in an index that is roughly half the size and faster to scan, as it doesn't have to sift through irrelevant entries.
Now, when we run a query that includes this condition, the planner will intelligently select the partial index:
EXPLAIN ANALYZE
SELECT id
FROM documents
WHERE (data->>'type') = 'invoice'
AND (data->'content'->>'customer_id')::int = 42;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_partial_btree_documents_invoice_customer_id on documents (cost=0.42..10.25 rows=47 width=16) (actual time=0.035..0.075 rows=48 loops=1)
Index Cond: (((data -> 'content' ->> 'customer_id'::text))::integer = 42)
Planning Time: 0.201 ms
Execution Time: 0.098 ms
The performance is now under 0.1ms. The query planner is smart enough to see that the WHERE clause in the query is a superset of the index's WHERE clause, and therefore uses the smaller, more efficient partial index.
This pattern is incredibly powerful for optimizing large, multi-type tables. You can create multiple, small, highly-specialized partial indexes for different document types, keeping overall index overhead low while providing lightning-fast lookups for specific query patterns.
Production Benchmarking and Decision Framework
To make an informed decision, you must benchmark against your own data and query patterns. Here is a summary of our findings on the 1M row table:
| Indexing Strategy | Query Type | Execution Time (ms) | Index Size (MB) | Pros | Cons |
|---|---|---|---|---|---|
| None | Equality (->>) | 385.183 | 0 | No write overhead | Unusable performance on large tables |
GIN (jsonb_ops) | Containment (@>) | 0.467 | ~152 | Versatile, supports existence & containment | Large index, slower writes |
GIN (jsonb_path_ops) | Containment (@>) | 0.350 | ~88 | Smaller, faster writes, great for @> | Only supports @>, no existence checks (?) |
| B-Tree on Expression | Typed Equality ((->>'key')::type = val) | 0.155 | ~25 | Fastest for selective queries, supports ranges | Inflexible, only works for the specific expression |
| Partial B-Tree on Expression | Typed Equality with WHERE clause | 0.098 | ~13 | Extremely fast & small for filtered queries | Even more specific, requires matching WHERE clause |
Decision Framework
When faced with a slow JSONB query, use this thought process:
If you need to find documents containing a set of keys or a nested structure (@>), start with a GIN index. Choose jsonb_path_ops if @> is your only* use case, for its smaller size and better write performance. Choose jsonb_ops if you also need key existence checks (?, ?|, ?&).
* If you are frequently running queries like WHERE (data->'path'->>'key')::int = 123 or ... > 100, a B-Tree index on that specific expression will almost certainly outperform a GIN index.
* If yes, and your queries target specific document types, always prefer a partial index. Create specialized B-Tree or GIN indexes with a WHERE (data->>'type') = 'your_type' clause. This prevents index bloat and dramatically improves performance by reducing the search space.
* Yes. It's common to have a general GIN (data jsonb_path_ops) index for ad-hoc exploration and several highly specific partial B-Tree indexes on expressions for performance-critical queries. PostgreSQL's query planner is capable of picking the best index for a given query, or even combining multiple indexes in some scenarios.
By moving beyond a one-size-fits-all GIN index and embracing these advanced, specific indexing strategies, you can unlock the full performance potential of JSONB in PostgreSQL, ensuring your application remains fast and scalable even as your data grows in volume and complexity.