Partial & BRIN Indexes for Massive Multi-Tenant PostgreSQL Tables
The Inevitable Scaling Problem in Multi-Tenant Architectures
In any mature multi-tenant SaaS platform, certain tables grow to monstrous proportions. A canonical example is an events, audit_logs, or telemetry table. These tables often share a common schema:
tenant_id to enforce data isolation.created_at, timestamp, etc.) for chronological ordering.status, event_type, level).data, payload).As this table scales to billions or even trillions of rows, the indexing strategy becomes the primary determinant of application performance, database stability, and operational cost. A naive composite index, such as (tenant_id, created_at), which works perfectly at 10 million rows, becomes a liability at 10 billion. The index itself can be hundreds of gigabytes, making maintenance operations like VACUUM and re-indexing prohibitively slow and resource-intensive. Writes are penalized, and even seemingly simple queries suffer.
This article dissects an advanced indexing pattern specifically designed for this scenario. We will move beyond standard B-Tree indexes to strategically combine Partial B-Tree Indexes for 'hot' data partitions with Partial BRIN Indexes for 'cold', archival data. This hybrid approach dramatically reduces index size, accelerates common query patterns, and lowers maintenance overhead without requiring table partitioning (which introduces its own complexities).
Scenario: The `audit_logs` Table
Let's establish our working model: a multi-tenant audit_logs table.
Common Characteristics & Query Patterns:
tenant_id: Thousands or tens of thousands of tenants.* Fetch recent, non-archived logs for a specific tenant.
* Search for specific events within a wide date range for a tenant (e.g., for an annual audit).
* Count active alerts for a tenant in the last 24 hours.
Here is the base schema we'll use for our analysis:
CREATE TABLE audit_logs (
id BIGSERIAL PRIMARY KEY,
tenant_id INT NOT NULL,
user_id INT,
log_level VARCHAR(20) NOT NULL, -- e.g., 'INFO', 'WARN', 'ERROR'
message TEXT,
payload JSONB,
is_archived BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Naive, conventional index
CREATE INDEX idx_audit_logs_tenant_id_created_at
ON audit_logs (tenant_id, created_at DESC);
To simulate a production environment, let's populate this table with a significant amount of data. We'll use 1000 tenants and generate 100 million rows, with 95% of the data marked as is_archived and spread over a year, while 5% is recent and not archived.
-- Data Generation Script (for demonstration)
-- In a real scenario, this would be billions of rows.
INSERT INTO audit_logs (tenant_id, user_id, log_level, message, is_archived, created_at)
SELECT
(random() * 999 + 1)::int, -- 1000 tenants
(random() * 10000)::int, -- some user id
CASE (random() * 2)::int
WHEN 0 THEN 'INFO'
WHEN 1 THEN 'WARN'
ELSE 'ERROR'
END,
'Log message ' || g.i,
-- 95% of data is archived and old
CASE WHEN g.i <= 95000000 THEN TRUE ELSE FALSE END,
-- Spread old data over a year, new data over the last week
CASE
WHEN g.i <= 95000000 THEN NOW() - '1 year'::interval * random()
ELSE NOW() - '7 days'::interval * random()
END
FROM generate_series(1, 100000000) as g(i);
-- Ensure stats are up-to-date
ANALYZE audit_logs;
With our naive (tenant_id, created_at DESC) index on this 100M row table, the index size is already substantial. On a typical setup, this might be around 2.5-3 GB. At a billion rows, you're looking at a 25-30 GB index that must be traversed for every query.
Step 1: Isolating Hot Data with Partial B-Tree Indexes
A vast majority of application queries target recent, 'active' data. In our schema, this corresponds to is_archived = FALSE. A partial index allows us to create a highly optimized, smaller index that only contains entries for this hot data subset.
The Strategy: Create a B-Tree index on (tenant_id, created_at) but restrict it to rows where is_archived is false.
-- Drop the naive index
DROP INDEX idx_audit_logs_tenant_id_created_at;
-- Create a partial index for hot data
CREATE INDEX idx_audit_logs_hot_data
ON audit_logs (tenant_id, created_at DESC)
WHERE is_archived = FALSE;
Analysis of the Partial Index
idx_audit_logs_hot_data, only contains entries for the 5% of our data that is not archived (5 million rows). Its size will be approximately 5% of the original naive index, a 20x reduction. This translates directly to less disk I/O, better cache utilization, and faster traversals.EXPLAIN (ANALYZE, BUFFERS)
SELECT id, created_at, log_level, message
FROM audit_logs
WHERE tenant_id = 123
AND is_archived = FALSE
AND created_at > NOW() - '1 day'::interval
ORDER BY created_at DESC
LIMIT 100;
Expected EXPLAIN ANALYZE Output (Conceptual):
Limit (cost=0.56..15.46 rows=100 width=64) (actual time=0.051..0.123 rows=100 loops=1)
Buffers: shared hit=15
-> Index Scan using idx_audit_logs_hot_data on audit_logs (cost=0.56..595.63 rows=4000 width=64) (actual time=0.050..0.115 rows=100 loops=1)
Index Cond: ((tenant_id = 123) AND (created_at > (now() - '1 day'::interval)))
Filter: (is_archived = FALSE) -- This filter is redundant but may appear
Buffers: shared hit=15
Planning Time: 0.215 ms
Execution Time: 0.138 ms
Key Observations:
* Planner Choice: The query planner correctly and enthusiastically chooses our small, efficient idx_audit_logs_hot_data.
* Index Condition: The tenant_id and created_at conditions are pushed down into the index scan, allowing the database to pinpoint the required rows very quickly.
* Buffer Hits: The number of shared buffers hit will be minimal because the entire relevant portion of the index likely fits in memory.
Compare this to the performance with the original, full-table index. The planner would have to traverse a much larger data structure, potentially leading to more I/O and higher latency, especially under concurrent load.
Edge Case: When the Planner Ignores a Partial Index
The PostgreSQL query planner will only use a partial index if the WHERE clause of the query is a logical superset of the index's WHERE clause. For the index WHERE is_archived = FALSE, the planner knows it can be used for queries with WHERE is_archived = FALSE or WHERE is_archived = FALSE AND tenant_id = 123.
However, what about a query without that predicate?
-- This query CANNOT use our partial index
EXPLAIN ANALYZE
SELECT count(*)
FROM audit_logs
WHERE tenant_id = 123;
This query would result in a Sequential Scan or a scan on a different index (if one existed on tenant_id alone). The planner cannot use idx_audit_logs_hot_data because it cannot guarantee that the index contains all rows for tenant_id = 123; it's missing the archived ones.
This is not a flaw but a fundamental trade-off. You are optimizing for specific, high-frequency access patterns at the cost of others. Your application logic must be aware of these indexing strategies.
Step 2: Taming Cold Data with Partial BRIN Indexes
We've solved for the hot data, but what about the 95% of our table that is archived? These logs are queried infrequently, typically with wide date ranges, such as "find all ERROR logs for tenant 456 in Q3 of last year."
A B-Tree index on (tenant_id, created_at) for this massive dataset would be enormous and inefficient for range scans over millions of rows. This is where the BRIN (Block Range Index) shines.
What is a BRIN Index?
A BRIN index doesn't store an entry for every row. Instead, it stores the minimum and maximum value for a column (or columns) within a large range of physical table blocks (by default, 128 pages or 1MB). When you query a range, PostgreSQL checks the BRIN index to see which block ranges could possibly contain matching data, and then only scans those blocks. All other blocks are skipped entirely.
The Critical Prerequisite: Physical Correlation
BRIN indexes are only effective if the physical ordering of data on disk is highly correlated with the indexed column's values. In our append-only audit_logs table, new rows with increasing created_at values are appended to the end of the table. This creates a near-perfect natural correlation, making created_at an ideal candidate for a BRIN index.
The Strategy: Create a partial BRIN index specifically for the archived data.
-- Create a partial BRIN index for cold data
CREATE INDEX idx_audit_logs_cold_data_brin
ON audit_logs USING BRIN (created_at)
WHERE is_archived = TRUE;
Analysis of the Partial BRIN Index
created_at might be several gigabytes. The BRIN index will likely be a few megabytes at most. This is a game-changing reduction in storage and memory overhead.-- Note: We need another index for the tenant_id lookup.
-- Let's assume a simple one exists for this query.
CREATE INDEX idx_audit_logs_tenant_id ON audit_logs (tenant_id);
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, created_at, log_level, message
FROM audit_logs
WHERE tenant_id = 456
AND is_archived = TRUE
AND log_level = 'ERROR'
AND created_at BETWEEN '2023-07-01' AND '2023-09-30';
Expected EXPLAIN ANALYZE Output (Conceptual):
Bitmap Heap Scan on audit_logs (cost=1500.00..65000.00 rows=5000 width=64) (actual time=50.123..250.456 rows=4890 loops=1)
Recheck Cond: ((tenant_id = 456) AND (created_at >= '2023-07-01') AND (created_at <= '2023-09-30'))
Filter: (is_archived AND (log_level = 'ERROR'::text))
Rows Removed by Filter: 150000
Heap Blocks: lossy=12800
Buffers: shared hit=13500
-> BitmapAnd (cost=1500.00..1500.00 rows=5000 width=0) (actual time=45.123..45.123 rows=0 loops=1)
Buffers: shared hit=700
-> Bitmap Index Scan on idx_audit_logs_cold_data_brin (cost=0.00..500.00 rows=2000000 width=0) (actual time=20.567..20.567 rows=234567 loops=1)
Index Cond: ((created_at >= '2023-07-01') AND (created_at <= '2023-09-30'))
Buffers: shared hit=50
-> Bitmap Index Scan on idx_audit_logs_tenant_id (cost=0.00..1000.00 rows=100000 width=0) (actual time=24.432..24.432 rows=98765 loops=1)
Index Cond: (tenant_id = 456)
Buffers: shared hit=650
Planning Time: 0.543 ms
Execution Time: 255.876 ms
Key Observations:
* Bitmap Scan: PostgreSQL combines the results of our two indexes (idx_audit_logs_cold_data_brin and idx_audit_logs_tenant_id) using a BitmapAnd operation. This is highly efficient.
* BRIN Effectiveness: The Bitmap Index Scan on idx_audit_logs_cold_data_brin is the crucial part. The BRIN index rapidly identifies all the 1MB block ranges that contain logs from Q3 2023. It tells the planner to completely ignore the vast majority of the table's physical blocks (e.g., those from 2022 or Q1/Q2 2023).
Lossy Blocks: The Heap Blocks: lossy=12800 indicates that the BRIN scan identified 12,800 blocks that might* contain matching rows. The database then scans these blocks and re-checks the condition. Without the BRIN index, this would have been a scan over millions of blocks.
The Combined Production Pattern
By implementing both strategies, we have created a sophisticated, two-tiered indexing system within a single table:
idx_audit_logs_hot_data) for is_archived = FALSE. It serves low-latency dashboard and real-time queries with pinpoint precision.idx_audit_logs_cold_data_brin) for is_archived = TRUE. It serves high-throughput analytical and audit queries over large date ranges by dramatically reducing I/O.(tenant_id) to efficiently combine with the BRIN index for cold data queries.Final Indexing Schema:
-- Index for hot, active data (frequent, low-latency lookups)
CREATE INDEX idx_audit_logs_hot_data
ON audit_logs (tenant_id, created_at DESC)
WHERE is_archived = FALSE;
-- Index for cold, archived data (infrequent, large range scans)
CREATE INDEX idx_audit_logs_cold_data_brin
ON audit_logs USING BRIN (tenant_id, created_at)
WHERE is_archived = TRUE;
-- Note: A composite BRIN can also work well if queries often filter by tenant AND a date range.
This refined composite BRIN (tenant_id, created_at) is often superior as it stores min/max values for both columns per block range, allowing even better filtering when both are used in a query.
Advanced Considerations and Production Pitfalls
This pattern is powerful, but it requires careful management in a production environment.
1. Maintaining BRIN Correlation
The effectiveness of a BRIN index is entirely dependent on physical data correlation. While append-only tables have natural correlation, operations can disrupt it:
* Bulk Updates: An UPDATE command that touches millions of old rows can cause them to be moved, breaking the physical ordering.
* Delayed Data/Backfills: If a background job inserts data with old timestamps, it will be placed at the end of the table physically, creating a mismatch between the created_at value and its physical location.
You can check the correlation of a column using the system catalog:
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'audit_logs' AND attname = 'created_at';
A correlation close to 1.0 or -1.0 is ideal. If it drops significantly (e.g., below 0.8), the BRIN index's performance will degrade. In such cases, running CLUSTER audit_logs USING can physically re-sort the table, but this takes an exclusive lock and is a major operational event. It's better to design data ingestion pipelines to preserve the natural order.
2. Tuning BRIN `pages_per_range`
The pages_per_range storage parameter for a BRIN index controls how many table blocks are summarized by a single index entry. The default is 128.
* Smaller pages_per_range (e.g., 32): Creates a larger, more precise BRIN index. It will be better at filtering, returning fewer 'lossy' blocks, but at the cost of index size. This is useful if your data isn't perfectly correlated.
* Larger pages_per_range (e.g., 256): Creates a smaller, less precise index. Ideal for perfectly sequential data where large chunks of the table can be skipped with confidence.
-- Create a BRIN index with a custom range
CREATE INDEX idx_audit_logs_cold_data_brin_tuned
ON audit_logs USING BRIN (created_at)
WITH (pages_per_range = 64)
WHERE is_archived = TRUE;
Tuning this parameter requires benchmarking against your specific query patterns and data distribution.
3. Concurrent Index Creation
On a multi-terabyte table, a standard CREATE INDEX command will lock the table against writes for an unacceptably long time. You must use CREATE INDEX CONCURRENTLY.
CREATE INDEX CONCURRENTLY idx_audit_logs_hot_data
ON audit_logs (tenant_id, created_at DESC)
WHERE is_archived = FALSE;
Trade-offs of CONCURRENTLY:
* Pros: It does not take a heavy lock, allowing INSERT, UPDATE, and DELETE operations to continue while the index is being built.
* Cons:
* It is much slower (can be 2-3x longer).
* It consumes more CPU and I/O resources.
* It requires two full scans of the table.
* It can fail if a long-running transaction modifies data while it's running. If it fails, it leaves behind an invalid index that must be manually dropped.
This is a non-negotiable operational practice for large, live tables.
4. The Archiving Process
Our entire strategy hinges on the is_archived flag. The process that transitions data from 'hot' to 'cold' (UPDATE audit_logs SET is_archived = TRUE WHERE ...) is critical. This should be a well-behaved, throttled background process.
A naive UPDATE ... WHERE created_at < NOW() - '7 days' can cause a massive number of rows to be updated at once, leading to table bloat and high I/O. A better approach is to process records in smaller, manageable batches:
-- Batching UPDATE loop
LOOP
UPDATE audit_logs
SET is_archived = TRUE
WHERE id IN (
SELECT id
FROM audit_logs
WHERE is_archived = FALSE
AND created_at < NOW() - '7 days'
LIMIT 10000 -- Process in batches of 10k
FOR UPDATE SKIP LOCKED
);
-- Exit if no rows were updated
EXIT WHEN NOT FOUND;
-- A small delay to reduce load
PERFORM pg_sleep(1);
END LOOP;
This batching approach prevents transaction log overruns and reduces the impact on database performance.
Conclusion: Beyond One-Size-Fits-All Indexing
For engineers operating at scale, treating indexing as a one-size-fits-all problem is a recipe for performance degradation and operational pain. The naive composite B-Tree index, while a reasonable starting point, fails catastrophically on massive, multi-tenant tables with varied data access patterns.
By dissecting the query patterns into 'hot' transactional lookups and 'cold' analytical scans, we can deploy a sophisticated, hybrid indexing strategy. The combination of a partial B-Tree index for the small, frequently accessed subset of data and a partial BRIN index for the vast, chronologically correlated archive provides a solution that is orders of magnitude more efficient in terms of storage, query latency, and maintenance overhead.
This approach is not a magic bullet. It requires a deep understanding of your data's lifecycle, careful management of data correlation, and disciplined operational practices like concurrent index builds and batched updates. However, for SaaS platforms managing petabytes of data in PostgreSQL, moving beyond conventional indexing to embrace these advanced patterns is not just an optimization—it's a necessity for survival.