Optimizing Multi-Tenant Time-Series Data with PostgreSQL Partial & BRIN Indexes
The Canonical Problem: The Unscalability of B-Tree Indexes in Multi-Tenant Time-Series Tables
As a senior engineer responsible for a multi-tenant SaaS platform, you've inevitably encountered the events, logs, or metrics table. It starts innocently, but quickly grows to billions of rows, becoming the gravitational center of your database's performance problems. The typical query pattern involves isolating a single tenant's data within a specific time range.
Let's model a realistic scenario. We're building an observability platform where tenants push metric data.
-- A simplified but representative time-series table
CREATE TABLE metrics (
metric_id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
tenant_id UUID NOT NULL,
metric_name VARCHAR(255) NOT NULL,
metric_value DOUBLE PRECISION NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'active', -- e.g., active, archived, processed
tags JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- The standard, go-to index for multi-tenant queries
CREATE INDEX idx_metrics_tenant_timestamp ON metrics (tenant_id, created_at DESC);
Initially, this idx_metrics_tenant_timestamp B-Tree index is effective. Queries like fetching the last hour of data for a specific tenant are fast.
-- A typical dashboard query
SELECT metric_name, metric_value, created_at
FROM metrics
WHERE tenant_id = 'a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d'
AND created_at >= NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC;
However, as the table grows to 500 million, then 5 billion rows, severe problems emerge:
TIMESTAMPTZ can easily exceed 50-100GB. This consumes expensive disk space and, more importantly, valuable memory (shared buffers).INSERT now requires updating not just the table (heap) but also this enormous B-Tree index. This slows down data ingestion, a critical path for time-series applications.status filter? A common query might be to find only active metrics. SELECT metric_name, metric_value, created_at
FROM metrics
WHERE tenant_id = 'a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d'
AND status = 'active'
AND created_at >= NOW() - INTERVAL '24 hours';
A composite index on (tenant_id, status, created_at) seems logical. But if 99% of metrics are active, the selectivity of the status column is extremely low. The index becomes even larger for marginal benefit, as the planner still has to traverse a huge portion of the index for that tenant.
This is a classic scaling bottleneck. Throwing more hardware at it provides diminishing returns. We need a more intelligent, surgical approach to indexing.
Deep Dive: Partial Indexes for High-Cardinality State Isolation
Partial indexes are a foundational tool for advanced PostgreSQL optimization. Instead of indexing every row, a partial index includes only rows that satisfy a WHERE clause specified during index creation. This allows us to create smaller, more targeted indexes for specific query patterns.
The Anti-Pattern: One Index Per Tenant
A naive interpretation might be to create a partial index for each tenant. This is a critical anti-pattern and will not scale. Managing thousands or millions of indexes is an operational nightmare for the database planner and for your engineering team.
The Production Pattern: Indexing by State or Tenant Class
The correct approach is to create partial indexes based on low-cardinality columns that represent a meaningful, frequently queried state. In our metrics table, the status column is a prime candidate.
Imagine that over 95% of the data is in an active state, 4% is processed, and 1% is archived. The most frequent operational queries might be against the non-active states.
-- Create a partial index ONLY for 'processed' metrics
CREATE INDEX idx_metrics_processed_tenant_timestamp
ON metrics (tenant_id, created_at DESC)
WHERE status = 'processed';
-- Create another partial index ONLY for 'archived' metrics
CREATE INDEX idx_metrics_archived_tenant_timestamp
ON metrics (tenant_id, created_at DESC)
WHERE status = 'archived';
Analysis of the Impact:
* Size Reduction: These two indexes combined will be roughly 5% of the size of a single, non-partial index on (tenant_id, created_at). This is a massive win for storage and memory.
* Write Performance: INSERTs where status = 'active' (the vast majority) no longer need to update these two indexes, significantly reducing write overhead.
* Query Performance: Let's analyze a query that leverages our new index.
EXPLAIN ANALYZE
SELECT metric_id, metric_name
FROM metrics
WHERE tenant_id = 'a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d'
AND status = 'processed' -- This clause is key
AND created_at >= NOW() - INTERVAL '7 days';
Expected EXPLAIN Output (Simplified):
Index Scan using idx_metrics_processed_tenant_timestamp on metrics
Index Cond: ((tenant_id = 'a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d') AND (created_at >= '...'))
Filter: (status = 'processed') -- This filter is checked against the index condition, not the heap
Planning Time: 0.25ms
Execution Time: 5.8ms
The query planner is smart enough to see that the query's WHERE status = 'processed' clause matches the predicate of idx_metrics_processed_tenant_timestamp. It can use this much smaller, more cache-friendly index to satisfy the query extremely quickly.
Edge Cases and Caveats for Partial Indexes
* Predicate Matching is Literal: The query planner is not magic. The WHERE clause in your query must match the index's predicate. A query with WHERE status IN ('processed', 'archived') will not use either of our partial indexes. You must design your application logic and indexes in tandem.
* Parameterization Issues: Be cautious with query builders or ORMs. If they parameterize the status value, the planner may not be able to match the query to the partial index at planning time. You often need to ensure the literal value is in the query string for the planner to make the optimal choice.
* Lifecycle Management: This strategy implies you have a clear understanding of your data's states. If new statuses are added, you need an operational process to evaluate whether they require their own partial indexes.
Introducing BRIN Indexes for Correlated Time-Series Data
While partial indexes help us manage logical data slices, Block Range INdexes (BRIN) help us manage physical data layout. A BRIN index is fundamentally different from a B-Tree. Instead of storing a pointer to every row, it stores the minimum and maximum value for a range of table blocks (by default, 128 blocks).
The Critical Prerequisite: Physical Correlation
A BRIN index is only effective if the values in the indexed column are well-correlated with their physical storage location in the table heap. Time-series data, which is almost always inserted in chronological order, is the canonical use case. As new data arrives, it's appended to the end of the table. This means that consecutive blocks on disk will contain records with consecutive timestamps.
Let's consider the created_at column. A BRIN index on this column would be incredibly small.
-- A BRIN index on our timestamp column
CREATE INDEX idx_metrics_created_at_brin ON metrics USING BRIN (created_at);
Performance & Size Comparison (on a 1 Billion Row Table):
* B-Tree Index on created_at:
* Size: ~30-40 GB
* Benefit: Very fast for point lookups (e.g., WHERE created_at = '...').
* BRIN Index on created_at:
* Size: ~10-15 MB (Yes, megabytes! Often 1000x smaller.)
* Benefit: Extremely fast for large range scans (e.g., WHERE created_at BETWEEN '...' AND '...').
How a BRIN Scan Works:
When you run a query with a time range, the planner consults the BRIN index. For each block range, it checks if the min/max values in the index overlap with your query's time range.
* If there is no overlap, PostgreSQL knows with certainty that no matching rows exist in that entire range of (e.g., 128) blocks and skips reading them from disk entirely.
* If there is an overlap, it marks the range as a candidate. This is called a "lossy" result.
Finally, it performs a bitmap heap scan on only the candidate blocks. For a query that selects 1 day of data out of 3 years, a BRIN index can allow PostgreSQL to skip reading >99.9% of the table blocks.
Tuning pages_per_range
The effectiveness of a BRIN index can be tuned with the pages_per_range storage parameter.
CREATE INDEX idx_metrics_created_at_brin_tuned
ON metrics USING BRIN (created_at)
WITH (pages_per_range = 32);
* pages_per_range (default 128): A larger value means a smaller index but it's less precise (more "lossy"). It might produce more false-positive blocks that need to be scanned.
* pages_per_range (e.g., 32 or 64): A smaller value means a slightly larger index but it's more precise. The min/max ranges are tighter, reducing the number of blocks that need to be visited in the heap scan.
For high-frequency time-series data, lowering pages_per_range can often yield better query performance at the cost of a marginally larger (but still tiny) index.
The Combined Strategy: Hyper-Optimizing with Partial BRIN Indexes
Now we combine these two powerful concepts. This is where we move from general optimization to surgical, application-aware performance engineering.
Our multi-tenant platform has a few "whale" tenants who generate 80% of the traffic and data. Their dashboards need to be lightning fast. The long tail of smaller tenants can tolerate slightly slower queries.
Our goal is to create hyper-specific indexes for a high-volume tenant's most critical data.
The Scenario: A query for a high-volume tenant, looking at their active metrics within a specific time range.
-- The query we need to make instantaneous
SELECT date_trunc('minute', created_at) as minute, avg(metric_value)
FROM metrics
WHERE tenant_id = 'big-whale-tenant-uuid-goes-here'
AND status = 'active'
AND created_at BETWEEN '2023-10-26 00:00:00' AND '2023-10-26 23:59:59'
GROUP BY 1
ORDER BY 1;
A standard B-Tree on (tenant_id, status, created_at) would work, but it would still be massive, containing entries for all tenants. We can do much better.
The Advanced Solution: A Partial BRIN Index
-- A hyper-optimized index for a specific, high-value tenant
CREATE INDEX idx_metrics_big_whale_tenant_active_brin
ON metrics USING BRIN (created_at)
WHERE tenant_id = 'big-whale-tenant-uuid-goes-here' AND status = 'active';
Let's break down why this is so effective:
WHERE clause): The index only contains entries for this one specific tenant and only for their active metrics. If this tenant has 200 million rows out of a 2 billion row table, the index immediately ignores 90% of the table.USING BRIN): Within that subset of data, we are not building a giant B-Tree. We are building a tiny BRIN index on created_at. Because this tenant's data is also inserted chronologically, it will have excellent physical correlation.Analyzing the EXPLAIN Output:
Finalize GroupAggregate
-> Sort
-> Partial aggregate
-> Bitmap Heap Scan on metrics
Recheck Cond: ((tenant_id = '...') AND (created_at >= '...') AND (created_at <= '...'))
Rows Removed by Index Recheck: 15000
Heap Blocks: exact=1200 lossy=8500
-> Bitmap Index Scan on idx_metrics_big_whale_tenant_active_brin
Index Cond: (created_at >= '...' AND created_at <= '...')
Planning Time: 0.4ms
Execution Time: 85.2ms
This plan is beautiful.
* It selects our tiny idx_metrics_big_whale_tenant_active_brin index.
* It performs a Bitmap Index Scan using only the created_at condition. This scan is incredibly fast as it just reads the tiny BRIN summary.
It identifies a set of lossy blocks (8500 in this example) that might* contain our data.
The Bitmap Heap Scan then visits only* those 8500 blocks, skipping the millions of other blocks in the table.
We have successfully answered a query on a multi-billion row table in under 100ms by reading only a tiny fraction of the data from disk.
Production Considerations & Maintenance Overhead
This advanced strategy is not a "set it and forget it" solution. It requires ongoing monitoring and maintenance.
Data Correlation Drift and `VACUUM`
The effectiveness of a BRIN index is entirely dependent on physical data correlation.
* Problem: If you run UPDATE statements that change timestamps or perform large DELETE operations, the physical ordering can be disrupted. VACUUM will reclaim space, but it doesn't reorder rows. Over time, the correlation can drift, making the BRIN index less effective (more "lossy" blocks).
* Monitoring: You can check the correlation of a column using pg_stats:
SELECT attname, correlation FROM pg_stats WHERE tablename = 'metrics' AND attname = 'created_at';
A value close to 1.0 or -1.0 is ideal. If it drops towards 0, your BRIN index performance will degrade.
* Solution:
1. VACUUM FULL / CLUSTER: The CLUSTER command physically reorders the table on disk according to a specified index. CLUSTER metrics USING idx_metrics_tenant_timestamp; would be highly effective. However, it takes an ACCESS EXCLUSIVE lock on the table, meaning it's completely inaccessible for the duration of the operation. This is often untenable for a live production system.
2. Periodic Data Rewriting: A more practical approach for 24/7 systems is to use a partitioning strategy or a background job that periodically copies recent, potentially unordered data into a new, perfectly-ordered table segment. This is complex but avoids long-running locks.
3. Tuning autovacuum: Ensure autovacuum runs frequently enough to update the visibility map and summarize new blocks for BRIN indexes. A sluggish autovacuum can leave your BRIN index out of date.
Managing a Dynamic Index Landscape
You cannot manually create partial BRIN indexes for every new high-volume tenant.
* Automated Identification: You need a background process that analyzes query performance (e.g., using pg_stat_statements) or tenant data volume to identify candidates for this hyper-optimization.
* Declarative Schema Management: Use a robust migration tool (like sqitch or a custom script) to manage the lifecycle of these indexes. Your process should be able to programmatically add a new partial BRIN index for a tenant that just crossed a certain volume threshold.
* Index Pruning: Just as important is a process for dropping these indexes if a tenant becomes inactive or their volume drops. Unused indexes still incur a small maintenance overhead.
Monitoring Index Usage
After implementing these indexes, you must verify they are being used. The pg_stat_user_indexes view is your ground truth.
SELECT
relname AS table_name,
indexrelname AS index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'metrics'
ORDER BY idx_scan DESC;
If you see your carefully crafted idx_metrics_big_whale_tenant_active_brin has idx_scan = 0 after a few days, it's a sign that either the query patterns don't match or the planner is choosing a different path. This requires immediate investigation.
Conclusion: A Multi-Layered Framework for Indexing at Scale
Optimizing massive multi-tenant time-series tables in PostgreSQL is not about finding a single silver-bullet index. It's about building a multi-layered, intelligent strategy that reflects your application's specific access patterns.
(tenant_id, created_at)) to serve the general long-tail of queries. Accept its size and write overhead as a cost of doing business.status = 'processed' or type = 'error'). Create small, efficient Partial B-Tree indexes on these states to dramatically speed up operational queries.created_at), which is physically correlated, create a global BRIN index. This will provide a massive performance boost for broad time-range queries at a near-zero cost in terms of size and write overhead.By combining these techniques and implementing robust monitoring and maintenance processes, you can move beyond brute-force indexing and build a database architecture that is performant, cost-effective, and capable of scaling to meet the demands of a growing SaaS platform.