PostgreSQL BRIN Indexes for Petabyte-Scale Time-Series Data
The Senior Engineer's Dilemma: Time-Series Data and B-Tree Index Bloat
As systems scale, one of the most common and challenging datasets to manage is time-series data. Whether it's application logs, IoT sensor readings, or financial transactions, these tables can grow to billions of rows, pushing standard database optimization techniques to their limits.
A senior engineer's first instinct for query performance is to add an index. For a time-series table, the created_at or event_time column is the obvious candidate. The default choice, a B-Tree index, is the workhorse of relational databases. It excels at providing low-latency access to individual rows or small ranges.
However, on a petabyte-scale, append-only workload, the very nature of a B-Tree becomes its Achilles' heel:
INSERT into the table requires a corresponding insertion into the B-Tree index. This involves traversing the tree, potentially splitting pages, and rebalancing, adding significant overhead to your ingestion pipeline.This is where a standard tool fails a non-standard problem. For massive, physically ordered datasets, we need a more intelligent, space-efficient indexing strategy. Enter the BRIN (Block Range INdex). This post is not an introduction to BRIN; it's a deep dive into its production implementation, tuning, and the edge cases you'll face when using it to manage truly large-scale time-series data.
Deep Dive: The Mechanics of BRIN and Data Correlation
To effectively use BRIN, you must understand that it operates on a fundamentally different principle than a B-Tree. A B-Tree maps a key value to a specific row's physical location (TID). In contrast, a BRIN index is a metadata structure that stores a summary for a range of physical table blocks.
For each block range (a configurable number of 8KB pages, by default 128), a BRIN index stores the minimum and maximum values of the indexed column(s) found within that range.
Consider a table of sensor readings ordered by event_time:
Physical Table Blocks (Heap):
Block 1-128: Contains rows with event_time from '2023-10-01 00:00:00' to '2023-10-01 08:00:00'
Block 129-256: Contains rows with event_time from '2023-10-01 08:00:01' to '2023-10-01 16:00:00'
Block 257-384: Contains rows with event_time from '2023-10-01 16:00:01' to '2023-10-02 00:00:00'
...and so on.
A BRIN index on event_time would look conceptually like this:
BRIN Index Entry 1: { Block Range: 1-128, Min: '2023-10-01 00:00:00', Max: '2023-10-01 08:00:00' }
BRIN Index Entry 2: { Block Range: 129-256, Min: '2023-10-01 08:00:01', Max: '2023-10-01 16:00:00' }
BRIN Index Entry 3: { Block Range: 257-384, Min: '2023-10-01 16:00:01', Max: '2023-10-02 00:00:00' }
When you execute a query like WHERE event_time > '2023-10-01 12:00:00', the query planner scans the tiny BRIN index. It sees:
- Entry 1's max value is less than the query value, so it can be definitively skipped. No I/O for blocks 1-128.
- Entry 3's min value is greater than the query value, so it marks this range for scanning as well.
The result is a Bitmap Heap Scan. The BRIN index acts as a coarse filter, creating a bitmap of pages that could contain matching rows. The database then only reads those specific pages from disk, dramatically reducing I/O compared to a full table scan or a B-Tree scan over cold data.
The Critical Prerequisite: Physical Correlation
BRIN's effectiveness is entirely dependent on the physical ordering of data on disk. The values in the indexed column must be strongly correlated with their physical storage location. For append-only time-series data, this is usually the case by default—new rows with later timestamps are appended to the end of the table.
PostgreSQL provides a way to measure this: pg_stats.correlation.
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'your_time_series_table';
A correlation value close to 1.0 (or -1.0 for descending order) is ideal for a BRIN index. A value near 0 indicates no correlation, and a BRIN index will be useless as every block range will contain the full spectrum of values, causing the planner to scan the entire table anyway.
Production Implementation: A Real-World IoT Scenario
Let's move from theory to practice. We'll simulate an IoT device readings table and benchmark B-Tree vs. BRIN.
1. Schema and Data Generation
First, set up the table and a function to generate a large, realistic dataset.
-- Connect to your database
-- \c your_db
CREATE TABLE device_readings (
reading_id BIGSERIAL PRIMARY KEY,
device_id UUID NOT NULL,
event_time TIMESTAMPTZ NOT NULL,
payload JSONB
);
-- Function to generate realistic, ordered data
CREATE OR REPLACE FUNCTION generate_iot_data(num_rows INT) RETURNS void AS $$
DECLARE
i INT;
start_time TIMESTAMPTZ := '2023-01-01 00:00:00 UTC';
time_increment INTERVAL := '1 second';
BEGIN
FOR i IN 1..num_rows LOOP
INSERT INTO device_readings (device_id, event_time, payload)
VALUES (
gen_random_uuid(),
start_time + (i * time_increment),
jsonb_build_object('temperature', 20 + random() * 5, 'humidity', 50 + random() * 10)
);
END LOOP;
END;
$$ LANGUAGE plpgsql;
-- Generate 10 million rows. This will take a few minutes.
-- For a real test, aim for 100M+ rows.
SELECT generate_iot_data(10000000);
-- Let's ensure stats are up to date
ANALYZE device_readings;
Now, check the correlation of event_time.
-- Check correlation
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'device_readings' AND attname = 'event_time';
-- Expected output:
-- attname | correlation
-- -----------+-------------
-- event_time | 1
-- (1 row)
A perfect correlation of 1.0. We're ready to benchmark.
2. Baseline: The B-Tree Approach
Let's create a standard B-Tree index and analyze its performance.
-- Create the B-Tree index
CREATE INDEX idx_readings_event_time_btree ON device_readings USING btree (event_time);
-- Check table and index sizes
SELECT pg_size_pretty(pg_total_relation_size('device_readings')) as total_size,
pg_size_pretty(pg_relation_size('device_readings')) as table_size,
pg_size_pretty(pg_indexes_size('device_readings')) as index_size;
-- Sample Output (sizes will vary based on architecture and PG version):
-- total_size | table_size | index_size
-- ------------+------------+------------
-- 1833 MB | 1528 MB | 305 MB
-- (1 row)
The index is already 305 MB for just 10 million rows. For 1 billion rows, this would be ~30 GB.
Now, let's run a typical analytical query: fetching one day's worth of data from a few months ago. We use pg_stat_statements and clear buffers to simulate querying cold data.
-- Enable pg_stat_statements if not already done
-- CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- This is a superuser-only function to clear filesystem cache for a relation.
-- Use with extreme caution on production. It is for benchmarking only.
-- If you don't have it, stopping/starting postgres or using OS-level tools can work.
SELECT pg_prewarm('device_readings', 'read'); -- Load to cache first
SELECT pg_prewarm('idx_readings_event_time_btree', 'read');
-- Now, let's run the query and analyze the plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM device_readings
WHERE event_time >= '2023-02-01 00:00:00 UTC'
AND event_time < '2023-02-02 00:00:00 UTC';
Sample B-Tree EXPLAIN Output:
Finalize Aggregate (cost=34969.34..34969.35 rows=1 width=8) (actual time=245.391..245.392 rows=1 loops=1)
-> Gather (cost=34969.13..34969.34 rows=2 width=8) (actual time=245.319..245.385 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=33969.13..33969.14 rows=1 width=8) (actual time=237.936..237.937 rows=1 loops=3)
-> Index Scan using idx_readings_event_time_btree on device_readings (cost=0.56..33741.00 rows=91252 width=0) (actual time=0.063..231.115 rows=28800 loops=3)
Index Cond: ((event_time >= '2023-02-01 00:00:00+00'::timestamp with time zone) AND (event_time < '2023-02-02 00:00:00+00'::timestamp with time zone))
Buffers: shared hit=428 read=1057
Planning Time: 0.215 ms
Execution Time: 245.452 ms
Key takeaways:
245.452 msshared hit=428 read=1057. It had to read over 1000 blocks from disk.Index Scan.The BRIN Index Solution: Implementation and Benchmarking
Now, let's swap the B-Tree for a BRIN index and see the difference.
-- Drop the B-Tree index
DROP INDEX idx_readings_event_time_btree;
-- Create the BRIN index
CREATE INDEX idx_readings_event_time_brin ON device_readings USING brin (event_time);
-- Check the sizes again
SELECT pg_size_pretty(pg_total_relation_size('device_readings')) as total_size,
pg_size_pretty(pg_relation_size('device_readings')) as table_size,
pg_size_pretty(pg_indexes_size('device_readings')) as index_size;
-- Sample Output:
-- total_size | table_size | index_size
-- ------------+------------+------------
-- 1528 MB | 1528 MB | 72 kB
-- (1 row)
This is the first staggering result. The index size dropped from 305 MB to a mere 72 kB. That's over a 4000x reduction in storage. This is not a typo. The BRIN index is incredibly compact.
Now, let's re-run the same benchmark query (after clearing caches again).
-- Prewarm/clear cache as before
SELECT pg_prewarm('device_readings', 'read');
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM device_readings
WHERE event_time >= '2023-02-01 00:00:00 UTC'
AND event_time < '2023-02-02 00:00:00 UTC';
Sample BRIN EXPLAIN Output:
Finalize Aggregate (cost=10892.25..10892.26 rows=1 width=8) (actual time=48.283..48.284 rows=1 loops=1)
-> Gather (cost=10892.04..10892.25 rows=2 width=8) (actual time=48.225..48.279 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=9892.04..9892.05 rows=1 width=8) (actual time=41.341..41.342 rows=1 loops=3)
-> Bitmap Heap Scan on device_readings (cost=216.51..9663.91 rows=91252 width=0) (actual time=6.059..34.939 rows=28800 loops=3)
Recheck Cond: ((event_time >= '2023-02-01 00:00:00+00'::timestamp with time zone) AND (event_time < '2023-02-02 00:00:00+00'::timestamp with time zone))
Rows Removed by Index Recheck: 10214
Heap Blocks: exact=672
Buffers: shared hit=8 read=673
-> Bitmap Index Scan on idx_readings_event_time_brin (cost=0.00..193.70 rows=109503 width=0) (actual time=4.381..4.381 rows=81920 loops=1)
Index Cond: ((event_time >= '2023-02-01 00:00:00+00'::timestamp with time zone) AND (event_time < '2023-02-02 00:00:00+00'::timestamp with time zone))
Buffers: shared hit=4 read=1
Planning Time: 0.198 ms
Execution Time: 48.337 ms
The results are dramatic:
48.337 ms. A 5x speedup over the B-Tree.shared read=673. It read significantly fewer blocks from disk (673 vs 1057). The tiny BRIN index itself was read with just 1 block I/O.Bitmap Heap Scan. The Bitmap Index Scan on the BRIN index first identifies the block ranges to check. Then, the Bitmap Heap Scan reads only those blocks. The Rows Removed by Index Recheck shows that the BRIN index is not perfectly precise; some rows within the selected blocks did not match the condition, but this filtering is still vastly more efficient than the alternative.Tuning `pages_per_range`
The pages_per_range storage parameter controls the size of the block range summarized by each BRIN index entry. The default is 128.
pages_per_range: Creates a larger, more precise index. Each entry covers a smaller range of data, so the index can more accurately prune blocks. This reduces Rows Removed by Index Recheck but increases index size.pages_per_range: Creates a smaller, less precise index. Good for maximizing space savings, but may cause the planner to scan more heap pages than necessary.Let's test this. We'll recreate the index with a smaller range.
DROP INDEX idx_readings_event_time_brin;
-- Create a more precise BRIN index
CREATE INDEX idx_readings_event_time_brin_32 ON device_readings USING brin (event_time) WITH (pages_per_range = 32);
-- Check size (it will be larger than the default, but still tiny)
-- For our 10M rows, it might be ~240 kB instead of 72 kB.
-- Re-run the EXPLAIN (ANALYZE, BUFFERS) query...
With pages_per_range = 32, you would likely see:
- A slight decrease in execution time.
Rows Removed by Index Recheck.- A slightly higher number of buffers read for the index scan itself, but a lower number for the heap scan.
The key takeaway is that pages_per_range is a knob to tune the trade-off between index size and query performance. The optimal value depends entirely on your data distribution and query patterns. Benchmark different values on your production data.
Advanced Patterns and Edge Cases
Multi-column BRIN Indexes: A Common Pitfall
What if your queries filter on both event_time and device_id? A natural thought is a multi-column BRIN index.
-- Seemingly logical, but potentially useless index
CREATE INDEX idx_readings_device_time_brin ON device_readings USING brin (device_id, event_time);
This index will only be effective if the data is physically clustered on disk by device_id first, and then by event_time. In our current schema, device_id is a random UUID. The correlation is near zero, rendering the BRIN index on this column useless.
To make this work, you would need to physically re-order the table using the CLUSTER command.
-- First, create a B-Tree index to guide the clustering
CREATE INDEX idx_readings_device_id_btree ON device_readings (device_id, event_time);
-- This command re-writes the entire table and takes an EXCLUSIVE lock!
-- Do this only during a maintenance window.
CLUSTER device_readings USING idx_readings_device_id_btree;
-- Now, the multi-column BRIN index would be effective.
CLUSTER is a heavy operation and not practical for tables with high-frequency writes. For most time-series use cases, a single-column BRIN on the time dimension combined with other strategies is more effective.
When BRIN Fails: The Anti-Pattern
Never use a BRIN index on a column with low or no physical correlation. Let's demonstrate this with our device_id (UUID) column.
-- Check correlation for device_id
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'device_readings' AND attname = 'device_id';
-- Expected output: a value very close to 0
-- attname | correlation
-- ----------+----------------------
-- device_id | -0.00018433729995166
-- Create a useless index
CREATE INDEX idx_readings_device_id_brin ON device_readings USING brin (device_id);
-- Try to use it
EXPLAIN (ANALYZE)
SELECT * FROM device_readings WHERE device_id = 'some-random-uuid' LIMIT 1;
The query planner will be smart enough to ignore this index entirely and perform a sequential scan. A BRIN index on a low-correlation column is worse than no index at all; it consumes space and write overhead for zero benefit.
Maintenance and Summarization
BRIN indexes are not updated transactionally like B-Trees. The summary ranges are updated by VACUUM or by manually calling a summarization function. After a large data ingestion batch, the newest data won't be reflected in the index until one of these operations runs.
To ensure the index is up-to-date after a large batch load without waiting for autovacuum, you can manually summarize.
-- After inserting a million new rows...
-- This function scans the new pages and updates the BRIN summary.
-- It returns the number of page ranges inserted into the index.
SELECT brin_summarize_new_values('idx_readings_event_time_brin');
This is a lightweight operation and should be part of any robust batch ingestion pipeline that relies on BRIN indexes for query performance.
Combining BRIN with Other Indexing Strategies
BRIN is not a replacement for B-Tree; it's a specialized tool. In a real production system, you'll often use both.
event_time: For fast analytical queries over large time ranges.device_id: For fast point lookups of a specific device's data.Let's set up this hybrid scenario:
-- We already have our BRIN index on event_time
-- CREATE INDEX idx_readings_event_time_brin ON device_readings USING brin (event_time);
-- Add a B-Tree for device lookups
CREATE INDEX idx_readings_device_id_btree ON device_readings USING btree (device_id);
Now watch how the PostgreSQL planner intelligently chooses the best index for the job.
Query 1: Pure Time Range Scan
EXPLAIN SELECT COUNT(*) FROM device_readings
WHERE event_time > '2023-03-01' AND event_time < '2023-03-02';
-- Result: Uses the BRIN index via a Bitmap Heap Scan.
Query 2: Specific Device Lookup
-- Find a real device_id from your table to test
EXPLAIN SELECT * FROM device_readings
WHERE device_id = 'f47ac10b-58cc-4372-a567-0e02b2c3d479' LIMIT 100;
-- Result: Uses the B-Tree index via an Index Scan for fast, direct lookups.
Query 3: Combined Filter
EXPLAIN SELECT COUNT(*) FROM device_readings
WHERE device_id = 'f47ac10b-58cc-4372-a567-0e02b2c3d479'
AND event_time > '2023-03-01' AND event_time < '2023-03-10';
For this query, the planner has a choice. It will likely use the more selective index (the B-Tree on device_id) and then filter the results by event_time. If both filters were moderately selective, it might even use a BitmapAnd operation, creating bitmaps from both indexes and combining them before hitting the table heap.
This hybrid approach gives you the best of both worlds: extreme space efficiency and performance for time-based analytics, and high-performance point lookups for operational queries.
Conclusion: A Strategic Tool, Not a Silver Bullet
BRIN indexes are a powerful, specialized tool for managing petabyte-scale, well-correlated data. They are not a universal replacement for B-Trees.
As a senior engineer, your decision to use BRIN should be based on a clear understanding of its mechanics and trade-offs:
event_time in an append-only table) and your primary query pattern involves large range scans on this column.pages_per_range values to find the optimal balance between index size and filtering precision for your specific workload.brin_summarize_new_values() in your batch loading procedures to ensure index freshness.By moving beyond default indexing and strategically applying BRIN, you can achieve orders-of-magnitude improvements in storage efficiency and query performance, enabling your PostgreSQL database to handle time-series data at a scale that would otherwise be unmanageable or prohibitively expensive.