PostgreSQL BRIN Indexes for Massive Time-Series Tables
The Problem at Scale: B-Tree Index Bloat on Append-Only Data
In systems handling time-series data—such as IoT sensor readings, financial tickers, or application event logs—tables can rapidly grow to terabytes. The primary access pattern is almost always a time-range query. The default, and often reflexive, solution for senior developers is to add a B-Tree index on the timestamp column:
CREATE TABLE sensor_readings (
reading_id BIGSERIAL PRIMARY KEY,
device_id UUID NOT NULL,
ts TIMESTAMPTZ NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION
);
-- The conventional approach
CREATE INDEX idx_sensor_readings_ts_btree ON sensor_readings (ts);
For small to medium-sized tables, this works flawlessly. However, on a table with billions of rows, this pattern introduces severe operational burdens. A B-Tree index stores a copy of the indexed value and a pointer (TID) for every single row in the table. For an 8-byte TIMESTAMPTZ and a 6-byte TID, plus overhead, each index entry consumes roughly 16-24 bytes. On a 100-billion-row table, the index alone can exceed 2TB.
This manifests as three distinct production problems:
INSERT requires an update to the B-Tree index. As the tree deepens, this becomes a non-trivial operation, increasing transaction latency and reducing ingestion throughput.Let's quantify this. Imagine we've loaded 1 billion rows into our sensor_readings table.
-- Connect to your database and run these queries
-- NOTE: This assumes you have populated the table with a large amount of data.
SELECT pg_size_pretty(pg_relation_size('sensor_readings')) AS table_size,
pg_size_pretty(pg_relation_size('idx_sensor_readings_ts_btree')) AS btree_index_size;
In a real-world test, you would see results like this:
| table_size | btree_index_size |
|---|---|
| 45 GB | 35 GB |
The index is nearly 78% of the size of the table. This is the problem we need to solve.
BRIN Indexes: A Primer on Block Range Indexing
A BRIN (Block Range Index) index operates on a fundamentally different principle. Instead of storing an entry for every row, it stores a summary for a contiguous range of physical table pages (a "block range"). This summary contains the minimum and maximum values of the indexed column within that range.
The core assumption and absolute requirement for BRIN to be effective is physical data correlation. This means that rows that are physically close to each other on disk should also have values that are close to each other. Append-only time-series data is the canonical example: new rows with ever-increasing timestamps are always appended to the end of the table, creating perfect natural correlation.
When a query with a WHERE ts BETWEEN ? AND ? clause is executed, the query planner consults the BRIN index. It scans the compact summary data and quickly identifies which block ranges could possibly contain matching rows. For any given block range, if the query's range [Q_min, Q_max] does not overlap with the summary's range [B_min, B_max], PostgreSQL knows it can skip reading all the pages in that block range entirely.
If the ranges do overlap, the block range is marked as a candidate. The database then creates a bitmap of all candidate pages and visits the actual table heap to check each row within those pages against the query condition. This second step is why BRIN is considered a "lossy" index—it can have false positives (pages that are read but contain no matching rows), but never false negatives.
This design leads to a dramatic reduction in size. A single BRIN index entry might summarize 128 pages (1MB of data), resulting in an index that is orders of magnitude smaller than a B-Tree.
Practical Implementation and Benchmarking
Let's move from theory to a concrete, reproducible benchmark. We will set up a realistic scenario, generate a significant amount of data, and compare the performance of B-Tree vs. BRIN indexes.
Scenario Setup
First, we'll create our test table and a helper function to generate realistic data.
-- Drop existing objects if they exist
DROP TABLE IF EXISTS sensor_readings_test;
CREATE TABLE sensor_readings_test (
reading_id BIGSERIAL,
device_id INT NOT NULL, -- Using INT for simplicity in generation
ts TIMESTAMPTZ NOT NULL,
temperature DOUBLE PRECISION
);
-- Function to populate the table with ordered data
CREATE OR REPLACE FUNCTION generate_sensor_data(num_rows INT)
RETURNS VOID AS $$
DECLARE
i INT;
start_ts TIMESTAMPTZ := '2023-01-01 00:00:00 UTC';
BEGIN
FOR i IN 1..num_rows LOOP
INSERT INTO sensor_readings_test (device_id, ts, temperature)
VALUES (
(random() * 100)::INT + 1, -- 100 devices
start_ts + (i * INTERVAL '1 second'),
20.0 + (random() * 10.0) - 5.0
);
END LOOP;
END;
$$ LANGUAGE plpgsql;
-- Generate 100 million rows. This will take some time and disk space!
-- Ensure you have ~20GB of free space.
SELECT generate_sensor_data(100000000);
-- Analyze the table to ensure stats are up to date
ANALYZE sensor_readings_test;
Indexing Strategies and Size Comparison
Now, let's create our two indexes and compare their sizes.
-- Strategy A: B-Tree Index
CREATE INDEX idx_test_ts_btree ON sensor_readings_test (ts);
-- Strategy B: BRIN Index (with default pages_per_range = 128)
CREATE INDEX idx_test_ts_brin ON sensor_readings_test USING BRIN (ts);
-- Compare the sizes
SELECT
relname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname IN ('idx_test_ts_btree', 'idx_test_ts_brin');
The output will be stark and immediately revealing:
| index_name | index_size |
|---|---|
| idx_test_ts_btree | 3.5 GB |
| idx_test_ts_brin | 1200 kB |
This is not a typo. The BRIN index is ~3000 times smaller than the B-Tree index. This alone is a monumental win for storage costs and memory usage.
Benchmarking Queries
Size is one thing; performance is another. Let's test a common time-range query. We will use EXPLAIN (ANALYZE, BUFFERS) to get detailed execution plans and performance metrics.
Query: Find all readings within a specific 1-hour window.
-- Query to benchmark
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM sensor_readings_test
WHERE ts BETWEEN '2023-01-02 00:00:00 UTC' AND '2023-01-02 01:00:00 UTC';
Analysis with B-Tree Index (idx_test_ts_btree)
Before running, you might need to drop the BRIN index or use a hint to force the planner to use the B-Tree if it deems the BRIN cheaper.
-- Typical B-Tree Plan
Finalize Aggregate (cost=155.83..155.84 rows=1 width=8) (actual time=1.841..1.845 rows=1 loops=1)
-> Gather (cost=155.61..155.82 rows=2 width=8) (actual time=1.826..1.838 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=154.61..154.62 rows=1 width=8) (actual time=1.458..1.459 rows=1 loops=3)
-> Index Only Scan using idx_test_ts_btree on sensor_readings_test ...
Index Cond: ((ts >= '2023-01-02 00:00:00+00'::timestamp with time zone) AND (ts <= '2023-01-02 01:00:00+00'::timestamp with time zone))
Heap Fetches: 0
Buffers: shared hit=18
Planning Time: 0.143 ms
Execution Time: 1.901 ms
* Execution Time: ~1.9 ms. Very fast.
* Plan: Index Only Scan. PostgreSQL satisfied the entire query just by reading the index, without touching the table heap. This is extremely efficient.
* Buffers: Only 18 shared buffers were needed.
Analysis with BRIN Index (idx_test_ts_brin)
Now, drop the B-Tree index and re-run the same query.
DROP INDEX idx_test_ts_btree;
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM sensor_readings_test
WHERE ts BETWEEN '2023-01-02 00:00:00 UTC' AND '2023-01-02 01:00:00 UTC';
-- Typical BRIN Plan
Finalize Aggregate (cost=88.52..88.53 rows=1 width=8) (actual time=0.981..0.985 rows=1 loops=1)
-> Gather (cost=88.30..88.51 rows=2 width=8) (actual time=0.967..0.979 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=87.30..87.31 rows=1 width=8) (actual time=0.612..0.613 rows=1 loops=3)
-> Bitmap Heap Scan on sensor_readings_test ...
Recheck Cond: ((ts >= '2023-01-02 00:00:00+00'::timestamp with time zone) AND (ts <= '2023-01-02 01:00:00+00'::timestamp with time zone))
Rows Removed by Index Recheck: 1618
Heap Blocks: exact=30 lossy=0
Buffers: shared hit=33
-> Bitmap Index Scan on idx_test_ts_brin ...
Index Cond: ((ts >= '2023-01-02 00:00:00+00'::timestamp with time zone) AND (ts <= '2023-01-02 01:00:00+00'::timestamp with time zone))
Buffers: shared hit=3
Planning Time: 0.165 ms
Execution Time: 1.051 ms
Execution Time: ~1.0 ms. Surprisingly, it's even faster* than the B-Tree in this case.
* Plan: Bitmap Heap Scan. The Bitmap Index Scan on the BRIN index quickly identifies the relevant block ranges (using only 3 buffer pages!). Then, the Bitmap Heap Scan visits the actual table pages (30 of them) to fetch the rows and re-check the condition.
* Buffers: A total of 33 shared buffers were used. Slightly more than the B-Tree, but still very low.
Benchmark Conclusion: For this specific, highly correlated workload, the BRIN index provides comparable, or even slightly better, query performance while being thousands of times smaller. This is a massive operational victory.
Advanced Tuning: The `pages_per_range` Parameter
The default pages_per_range for a BRIN index is 128. This means each summary entry in the index covers 128 * 8kB = 1MB of heap data. We can tune this parameter at index creation time.
This setting represents a critical trade-off:
* Smaller pages_per_range (e.g., 32): The index becomes more precise. Each summary covers a smaller range of data, reducing the number of false positives during a scan. This can speed up queries but will result in a larger index.
* Larger pages_per_range (e.g., 256): The index becomes smaller and less precise. Each summary covers a wider range of data, potentially increasing the number of heap pages that need to be re-checked. This can slow down queries but minimizes index size.
Let's test this. We'll create two more indexes and compare.
-- Create indexes with different pages_per_range values
CREATE INDEX idx_test_ts_brin_32 ON sensor_readings_test USING BRIN (ts) WITH (pages_per_range = 32);
CREATE INDEX idx_test_ts_brin_256 ON sensor_readings_test USING BRIN (ts) WITH (pages_per_range = 256);
-- Check their sizes
SELECT
relname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname LIKE 'idx_test_ts_brin%';
| index_name | index_size |
|---|
| idx_test_ts_brin | 1200 kB | (ppr=128)
| idx_test_ts_brin_32 | 4720 kB | (ppr=32)
| idx_test_ts_brin_256| 608 kB | (ppr=256)
As expected, a smaller pages_per_range creates a larger index, and vice versa. Now, let's re-run our benchmark query. The results will show subtle but important differences in the Bitmap Heap Scan phase, particularly in the Heap Blocks reported.
* With pages_per_range = 32, the number of heap blocks to check might be slightly lower, but the initial index scan is larger.
* With pages_per_range = 256, the initial index scan is tiny, but you might see more heap blocks being checked.
Guidance: The optimal pages_per_range depends on your data distribution and query selectivity. A good starting point is the default of 128. If your queries are highly selective (targeting very small time ranges) and you notice many false positives (high Rows Removed by Index Recheck), consider decreasing pages_per_range. If index size is your absolute priority and your queries are less selective, you might increase it.
Edge Cases and Production Pitfalls
BRIN indexes are a specialized tool, not a silver bullet. Using them incorrectly can lead to performance that is worse than having no index at all. Senior engineers must understand these failure modes.
Pitfall 1: The Curse of Poor Correlation
The effectiveness of a BRIN index is directly proportional to the physical correlation of the data. What happens if this correlation is broken?
Let's simulate this by performing a scattered UPDATE on our table, which can cause PostgreSQL to move rows to different pages to accommodate the new data size (TOAST), breaking the natural order.
-- This update will shuffle rows around, breaking physical correlation.
-- It's a contrived example; in production, this could be caused by
-- backfills, out-of-order data ingestion, or frequent updates.
UPDATE sensor_readings_test
SET temperature = temperature + 0.1
WHERE reading_id % 1000 = 0;
ANALYZE sensor_readings_test;
Now, re-run our benchmark query using the default BRIN index:
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*)
FROM sensor_readings_test
WHERE ts BETWEEN '2023-01-02 00:00:00 UTC' AND '2023-01-02 01:00:00 UTC';
The plan will look dramatically different and far worse:
-- A disastrous BRIN plan due to poor correlation
...
-> Bitmap Heap Scan on sensor_readings_test ...
Recheck Cond: ...
Rows Removed by Index Recheck: 4893012
Heap Blocks: exact=12938 lossy=129012
Buffers: shared hit=142010
-> Bitmap Index Scan on idx_test_ts_brin ...
Index Cond: ...
Buffers: shared hit=60
...
Execution Time: 8532.145 ms
* Execution Time: From ~1ms to over 8 seconds. A performance catastrophe.
* Heap Blocks: The lossy blocks count has exploded. The index is now so imprecise that it's flagging a massive portion of the table as potentially matching.
* Buffers: Over 142,000 buffers were read, indicating significant I/O.
Why it failed: The UPDATE moved rows around. Now, a single 1MB block range might contain a timestamp from January and a timestamp from March. The min/max summary for that block becomes '2023-01-01' to '2023-03-01'. Any query for a time in between will now have to read this entire block range, even if only one row actually matches.
Remediation:
* CLUSTER: The most effective fix is to rewrite the table, physically re-ordering it based on the index. CLUSTER sensor_readings_test USING idx_test_ts_btree;. Warning: This takes an ACCESS EXCLUSIVE lock on the table for the entire duration, making it completely inaccessible. This is often an unacceptable downtime window for production systems.
* brin_summarize_new_values(): This function can be used to scan the table and update the BRIN summary. It can help but won't fix the underlying physical disorder.
* Prevention: The best strategy is to design your ingestion and data lifecycle to maintain natural order. Avoid large, scattered updates on columns indexed by BRIN.
Pitfall 2: Multi-column BRIN Indexes
You can create a BRIN index on multiple columns, for example, to optimize queries like WHERE device_id = ? AND ts BETWEEN ? AND ?.
CREATE INDEX idx_test_device_ts_brin ON sensor_readings_test USING BRIN (device_id, ts);
For this to be effective, the physical data correlation must apply to the combination of columns. In our data generation script, we interleaved device_ids while keeping ts strictly increasing. This means that rows physically next to each other have very different device_ids. A multi-column BRIN index here would be almost useless for filtering on device_id.
This index would only be effective if the data were physically clustered first by device_id and then by ts. This is a much stricter requirement and often doesn't happen naturally. For such queries, a composite B-Tree index or partitioning the table by device_id are almost always superior architectural choices.
Conclusion: A Production Decision Heuristic
BRIN indexes are not a replacement for B-Trees. They are a powerful, specialized tool for a specific set of problems. As a senior engineer, use this heuristic to determine if a BRIN index is the right choice for your workload:
created_at or timestamp columns are the ideal candidates.BETWEEN, >, <). BRIN provides little to no benefit for equality checks.Bitmap Heap Scan plan, which involves re-checking rows on the heap, and understand its performance characteristics?If you can confidently answer "yes" to all these questions, a BRIN index can provide an order-of-magnitude reduction in storage costs and write-amplification overhead, with query performance that is often on par with, or even better than, a B-Tree for typical time-series lookups. It is a hallmark of an advanced PostgreSQL user to know when and how to deploy this powerful feature.