PostgreSQL Partial Indexing for High-Performance Multi-Tenant SaaS
The Inefficiency of Standard Indexes in Skewed Multi-Tenant Data
In a mature multi-tenant SaaS environment, certain tables inevitably become repositories for vast amounts of data with highly skewed distributions. Consider a canonical tasks table in a project management application. Over 95% of tasks are eventually marked as completed, and a significant portion might be archived. Queries from the application, however, overwhelmingly focus on the small subset of tasks that are active or pending.
Let's model this scenario. We have a table designed to hold tasks for millions of users across thousands of tenants.
CREATE TABLE tasks (
id BIGSERIAL PRIMARY KEY,
tenant_id INT NOT NULL,
project_id INT NOT NULL,
assignee_id INT,
title VARCHAR(255) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending', -- e.g., 'pending', 'active', 'completed'
is_archived BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Populate with a realistic, skewed data distribution
-- (This is a simplified script for demonstration)
INSERT INTO tasks (tenant_id, project_id, status, is_archived)
SELECT
(random() * 1000)::int + 1, -- 1000 tenants
(random() * 10000)::int + 1,
CASE
WHEN random() < 0.02 THEN 'active'
WHEN random() < 0.05 THEN 'pending'
ELSE 'completed'
END,
random() < 0.3 -- 30% are archived
FROM generate_series(1, 10000000); -- 10 million tasks
-- Analyze the table to get statistics for the query planner
ANALYZE tasks;
A common query is to fetch all active, unarchived tasks for a specific tenant. The natural instinct is to create a composite B-tree index to support this lookup:
CREATE INDEX idx_tasks_tenant_status_archived ON tasks (tenant_id, status, is_archived);
At first glance, this seems optimal. The index covers all columns in the WHERE clause. However, let's examine the consequences in a system with 10 million rows where only 2% of tasks are 'active' and 70% of all tasks are unarchived.
This index will contain entries for every single row in the tasks table. It will be massive. Let's inspect its size:
-- On a sample run, this can be over 200 MB
SELECT pg_size_pretty(pg_relation_size('idx_tasks_tenant_status_archived'));
-- Result: ~245 MB
When we execute our query, PostgreSQL has to navigate this large index structure, even though the data we're seeking represents a tiny fraction of the indexed rows.
EXPLAIN ANALYZE
SELECT id, title
FROM tasks
WHERE tenant_id = 123
AND status = 'active'
AND is_archived = FALSE;
Here's a typical EXPLAIN ANALYZE output:
Index Scan using idx_tasks_tenant_status_archived on tasks (cost=0.56..25.61 rows=14 width=21) (actual time=0.085..0.124 rows=14 loops=1)
Index Cond: ((tenant_id = 123) AND ((status)::text = 'active'::text) AND (is_archived = false))
Planning Time: 0.215 ms
Execution Time: 0.145 ms
While this specific query is fast due to caching and the small dataset for a single tenant, the underlying problem is scale. The index itself is bloated. It slows down writes (INSERT, UPDATE), consumes significant memory and disk space, and makes maintenance operations like VACUUM and REINDEX more expensive. For queries that need to scan larger portions of the index, the cost becomes prohibitive.
This is where partial indexes provide a far more elegant and performant solution.
The Surgical Precision of Partial Indexes
A partial index is an index built on a subset of a table's rows, defined by a WHERE clause provided at index creation time. It only indexes rows that satisfy the predicate. This is the key to solving our bloat problem.
Let's create a partial index specifically for the query we need to optimize:
CREATE INDEX idx_tasks_active_unarchived_for_tenant ON tasks (tenant_id)
WHERE status = 'active' AND is_archived = FALSE;
Notice a few critical things here:
WHERE clause: It precisely matches the most performance-critical, selective part of our query's WHERE clause.(tenant_id). The status and is_archived columns are constant within the index due to the WHERE predicate, so including them in the index key itself is redundant and would only increase its size.Now, let's compare the size of this new index to our original composite index:
-- Compare the bloated index with the surgical partial index
SELECT pg_size_pretty(pg_relation_size('idx_tasks_tenant_status_archived')) AS full_index_size,
pg_size_pretty(pg_relation_size('idx_tasks_active_unarchived_for_tenant')) AS partial_index_size;
| full_index_size | partial_index_size |
|---|---|
| 245 MB | 3584 kB |
The partial index is nearly 70 times smaller. This is a monumental improvement. It will fit more easily in memory, be faster to scan, and impose negligible overhead on write operations for rows that don't match its predicate (i.e., updating a task to completed).
Let's re-run our query. PostgreSQL's query planner is smart enough to see that the partial index is a perfect fit.
EXPLAIN ANALYZE
SELECT id, title
FROM tasks
WHERE tenant_id = 123
AND status = 'active'
AND is_archived = FALSE;
Index Scan using idx_tasks_active_unarchived_for_tenant on tasks (cost=0.29..8.40 rows=14 width=21) (actual time=0.025..0.027 rows=14 loops=1)
Index Cond: (tenant_id = 123)
Filter: ((status)::text = 'active'::text)
Planning Time: 0.250 ms
Execution Time: 0.045 ms
While the execution time improvement here is modest due to the simplicity of the query, the real win is the dramatic reduction in resource consumption and the guaranteed performance stability as the tasks table grows to hundreds of millions or billions of rows. The index size will only grow in proportion to the number of active, unarchived tasks, not the total number of tasks.
Advanced Production Patterns
Partial indexes are not just for simple status flags. They unlock several powerful patterns essential for robust SaaS application design.
Pattern 1: Enforcing Uniqueness on a Subset (The Soft-Delete Pattern)
A common requirement is to enforce email uniqueness for active users, but allow multiple users to exist with the same email if they have been "soft-deleted". A standard UNIQUE constraint on the email column makes this impossible.
A partial unique index is the canonical solution.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
deleted_at TIMESTAMPTZ
);
-- This constraint ensures only ONE active user can have a given email.
CREATE UNIQUE INDEX idx_users_unique_active_email
ON users (email)
WHERE deleted_at IS NULL;
Let's test this behavior:
-- 1. Create an active user
INSERT INTO users (email) VALUES ('[email protected]');
-- SUCCESS
-- 2. Try to create another active user with the same email
INSERT INTO users (email) VALUES ('[email protected]');
-- ERROR: duplicate key value violates unique constraint "idx_users_unique_active_email"
-- Key (email)=([email protected]) already exists.
-- 3. Soft-delete the original user
UPDATE users SET deleted_at = NOW() WHERE email = '[email protected]';
-- SUCCESS
-- 4. Now, creating a new active user with that email is allowed
INSERT INTO users (email) VALUES ('[email protected]');
-- SUCCESS
-- 5. We can even have multiple soft-deleted users with the same email
UPDATE users SET deleted_at = NOW() WHERE id = 1;
INSERT INTO users (email, deleted_at) VALUES ('[email protected]', NOW());
-- SUCCESS
This pattern is incredibly powerful for maintaining data integrity in systems that use soft deletes, ensuring that business logic constraints only apply to the active, relevant subset of data.
Pattern 2: The "To-Do List" Index
Consider a background job processing system with a jobs table. Jobs have statuses like queued, running, failed, and completed. Workers constantly poll the table for new work with a query like:
SELECT id, payload
FROM jobs
WHERE status = 'queued'
ORDER BY created_at
LIMIT 10
FOR UPDATE SKIP LOCKED;
Over time, the jobs table will be dominated by completed and failed jobs. A standard index on (status, created_at) would be inefficient. A partial index targeting only the small number of queued jobs is far superior.
CREATE INDEX idx_jobs_queued_for_processing ON jobs (created_at)
WHERE status = 'queued';
This index is tiny, containing only the jobs workers care about. It makes the polling query extremely fast and low-impact, which is critical for a high-throughput job queue.
Pattern 3: Indexing Non-NULL Foreign Keys
Imagine a documents table where a document can optionally be assigned to a folder_id. Most documents might not be in a folder, leaving folder_id as NULL for a large percentage of rows.
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
tenant_id INT NOT NULL,
folder_id INT, -- Can be NULL
title TEXT
);
A query to find all documents within a specific folder for a tenant (WHERE tenant_id = ? AND folder_id = ?) would traditionally use an index on (tenant_id, folder_id).
However, if 90% of documents have a NULL folder_id, that index will be bloated with NULL entries. A more efficient approach is a partial index that completely ignores them.
CREATE INDEX idx_documents_in_folders ON documents (tenant_id, folder_id)
WHERE folder_id IS NOT NULL;
This creates a dense, efficient index containing only the rows relevant to folder-based lookups, again saving significant space and improving performance.
Critical Edge Cases and Gotchas
While powerful, partial indexes require a precise understanding of the query planner. Mismatches between your query and your index definition will cause the planner to ignore your carefully crafted index, leading to unexpected performance degradation.
Gotcha 1: The Predicate Matching Trap
The query planner will only consider a partial index if the query's WHERE clause is a logical superset of the index's WHERE predicate. This means the query's conditions must be at least as restrictive as the index's.
Let's revisit our tasks example:
CREATE INDEX idx_tasks_active_unarchived_for_tenant ON tasks (tenant_id) WHERE status = 'active' AND is_archived = FALSE;
-- The query's WHERE clause is an exact match.
SELECT * FROM tasks WHERE tenant_id = 123 AND status = 'active' AND is_archived = FALSE;
-- The query's WHERE is a superset (more restrictive).
-- The planner can use the partial index and then apply the extra filter.
SELECT * FROM tasks WHERE tenant_id = 123 AND status = 'active' AND is_archived = FALSE AND project_id = 456;
-- The query's WHERE is a logical subset (less restrictive).
-- It asks for 'active' OR 'pending', but the index only contains 'active'.
-- The planner MUST perform a full table scan or use a different index.
SELECT * FROM tasks WHERE tenant_id = 123 AND status IN ('active', 'pending') AND is_archived = FALSE;
This is the most common pitfall. If you have queries for both 'active' and 'pending' tasks, you would need two separate partial indexes or a different indexing strategy altogether.
Gotcha 2: Immutability of Functions
You can use functions in a partial index predicate, but they must be marked as IMMUTABLE. This is a promise to the database that the function will always return the same result for the same input, and its result does not depend on any database lookups or configuration settings.
-- ERROR: functions in index predicate must be marked IMMUTABLE
CREATE INDEX idx_recent_tasks ON tasks (created_at)
WHERE created_at > (NOW() - INTERVAL '30 days');
The function NOW() is STABLE, not IMMUTABLE. Its value changes with every transaction. PostgreSQL cannot create a static index based on a moving target.
-- lower() is an IMMUTABLE function.
CREATE INDEX idx_users_lower_case_email ON users (email)
WHERE lower(email) LIKE '%.gov';
This could be used to efficiently find all government-affiliated users, assuming such a query is common. The lower() function guarantees deterministic output.
Gotcha 3: Type Casting and Parameterized Queries
Subtle differences in data types between your query and your index definition can prevent the planner from using the index. This is especially dangerous with ORMs and query builders that might abstract away the exact types.
If an index is on some_column_int, but your application sends the parameter as a string (WHERE some_column_int = '123'), PostgreSQL might not use the index because it has to consider the possibility of an implicit cast. Always ensure your application code sends parameters with the correct data type.
A Full Production Example: High-Throughput Notification System
Let's synthesize these concepts into a complete, production-grade example. A user notification system is a perfect candidate for partial indexing. A notifications table can grow to billions of rows, but users are almost exclusively interested in the tiny fraction of notifications that are unread.
Schema:
CREATE TABLE notifications (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
type VARCHAR(50) NOT NULL, -- e.g., 'new_comment', 'task_assigned'
message TEXT NOT NULL,
is_read BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Data Skew:
We'll simulate a system with 50 million notifications, where 99.5% are read.
INSERT INTO notifications (user_id, type, message, is_read)
SELECT
(random() * 100000)::int + 1, -- 100k users
'some_event',
'This is a notification message.',
random() > 0.005 -- 99.5% are read
FROM generate_series(1, 50000000);
ANALYZE notifications;
The Problem Query:
This is the query that powers the notification badge/dropdown in the UI. It must be instantaneous.
SELECT COUNT(*) FROM notifications WHERE user_id = 54321 AND is_read = FALSE;
Attempt 1: The Standard Composite Index
CREATE INDEX idx_notifications_user_read_status ON notifications (user_id, is_read);
Let's check its size and performance.
-- Size: Can easily be > 1 GB
SELECT pg_size_pretty(pg_relation_size('idx_notifications_user_read_status'));
-- Result: ~1.2 GB
-- Performance:
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM notifications WHERE user_id = 54321 AND is_read = FALSE;
Finalize Aggregate (cost=1234.56..1234.57 rows=1 width=8) (actual time=2.850..2.851 rows=1 loops=1)
-> Gather (cost=1234.34..1234.55 rows=2 width=8) (actual time=2.845..2.849 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=234.34..234.35 rows=1 width=8) (actual time=1.567..1.568 rows=1 loops=3)
-> Parallel Index Only Scan using idx_notifications_user_read_status on notifications (cost=0.57..233.05 rows=517 width=0) (actual time=0.045..1.543 rows=83 loops=3)
Index Cond: ((user_id = 54321) AND (is_read = false))
Heap Fetches: 0
Buffers: shared hit=45
Planning Time: 0.123 ms
Execution Time: 2.898 ms
An execution time of ~3ms might seem acceptable, but notice the complexity. The planner resorts to a parallel scan across a very large index. Under heavy load, with I/O contention and less favorable caching, this query's latency can easily spike to 10-20ms or more. The real cost is the 1.2 GB of memory and disk this index consumes permanently.
Attempt 2: The Partial Index Solution
CREATE INDEX idx_notifications_unread ON notifications (user_id)
WHERE is_read = FALSE;
Now, let's check the new index.
-- Size: A tiny fraction of the original
SELECT pg_size_pretty(pg_relation_size('idx_notifications_unread'));
-- Result: ~6144 kB
-- Performance:
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM notifications WHERE user_id = 54321 AND is_read = FALSE;
Aggregate (cost=12.19..12.20 rows=1 width=8) (actual time=0.039..0.040 rows=1 loops=1)
-> Index Only Scan using idx_notifications_unread on notifications (cost=0.42..11.94 rows=101 width=0) (actual time=0.021..0.032 rows=249 loops=1)
Index Cond: (user_id = 54321)
Heap Fetches: 0
Buffers: shared hit=4
Planning Time: 0.158 ms
Execution Time: 0.065 ms
The Results:
* Index Size: 1.2 GB vs 6 MB (a 99.5% reduction).
* Execution Time: 2.9 ms vs 0.065 ms (a 44x speedup).
* Planner Simplicity: A simple, direct Index Only Scan vs a parallelized scan.
* Resource Usage: shared hit=45 vs shared hit=4. It reads an order of magnitude fewer blocks from memory.
This is a definitive win. The partial index provides vastly superior performance with a fraction of the resource footprint, making it the unequivocally correct choice for this production workload.
Conclusion
Partial indexes are an indispensable tool in the arsenal of a senior engineer responsible for database performance at scale. By moving beyond generic composite indexes and applying surgical, predicate-driven indexing strategies, you can solve a whole class of performance problems related to skewed data distributions common in multi-tenant and high-volume systems.
The key is to deeply understand your application's query patterns and identify the small, critical subsets of data that are accessed most frequently. By creating indexes that cater specifically to these subsets, you achieve dramatic reductions in index size, leading to faster queries, lower write overhead, and more efficient use of system resources. However, this power demands precision. A thorough understanding of how the query planner matches predicates is non-negotiable to avoid the pitfall of creating an index that your queries can't actually use. When implemented correctly, partial indexes are a testament to the principle that in database optimization, less is often more.