Postgres Partitioning with `pg_partman` for Multi-Tenant Time-Series
The Inevitable Scaling Problem of Time-Series Data in Multi-Tenant Systems
In any mature SaaS application, tables storing time-series data—such as metrics, logs, events, or audit trails—are ticking time bombs. A single, monolithic events table that performs adequately with 100 tenants and 10 million rows will grind to a halt with 10,000 tenants and 10 billion rows. The reasons are familiar to any senior engineer who has fought this battle:
AUTOVACUUM becomes a major operational burden. On a multi-terabyte table, vacuums run for hours or even days, consuming significant I/O and CPU, often struggling to keep up with tuple churn.DELETE operation is incredibly slow, generates massive WAL traffic, and leaves behind dead tuples that exacerbate the VACUUM problem. It often requires careful, iterative batch-deleting to avoid locking contention and performance degradation.PostgreSQL's native declarative partitioning is the textbook solution. It allows you to split one logical table into smaller, more manageable physical tables. The query planner can then perform partition pruning, ignoring partitions that don't contain relevant data based on the WHERE clause. While this is a powerful feature, managing a partitioned table with hundreds or thousands of partitions manually is an operational nightmare. Creating new partitions, enforcing retention policies, and managing indexes across the entire set is tedious and error-prone.
This is where pg_partman enters. It's not just a convenience wrapper; it's a sophisticated automation engine for partition management that enables production-grade strategies. This article will dissect a specific, advanced pattern: using pg_partman to manage a multi-tenant time-series table partitioned by time, while ensuring high performance for tenant-specific queries through a carefully designed primary key and indexing strategy.
The Core Strategy: Time-Partitioning with Tenant-Aware Indexing
Our goal is to build a system that can efficiently answer queries like:
SELECT * FROM metrics WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at BETWEEN '2023-10-26 10:00:00' AND '2023-10-26 11:00:00';
To achieve this, we cannot partition by tenant_id. Partitioning by LIST (tenant_id) would create a separate table for each tenant, which is unmanageable with thousands of tenants. Instead, we partition by a time-based column (created_at) and make our indexes tenant-aware.
Here is the parent table structure we'll work with. Note the deliberate choice of the primary key.
CREATE TABLE public.metrics (
metric_id uuid NOT NULL DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL,
created_at timestamptz NOT NULL,
metric_name text NOT NULL,
metric_value double precision NOT NULL,
metadata jsonb
) PARTITION BY RANGE (created_at);
-- CRITICAL: The partition key (created_at) MUST be part of the primary key.
ALTER TABLE public.metrics
ADD CONSTRAINT metrics_pkey PRIMARY KEY (tenant_id, metric_id, created_at);
-- The index that will serve most of our queries.
CREATE INDEX metrics_tenant_id_created_at_idx ON public.metrics (tenant_id, created_at DESC);
Why this structure is critical:
PARTITION BY RANGE (created_at): This enables efficient time-based partition pruning. When a query specifies a created_at range, the planner will only scan the relevant daily or hourly partitions.PRIMARY KEY (tenant_id, metric_id, created_at): In PostgreSQL, any primary key or unique constraint on a partitioned table must include all columns from the partition key. We include tenant_id first to optimize for tenant-specific lookups. metric_id ensures uniqueness for each measurement.INDEX ON (tenant_id, created_at DESC): This is our workhorse index. After the planner prunes partitions down to a small subset (e.g., a few days' worth of tables), this index allows for an extremely fast index scan to locate the specific rows for a single tenant_id within that time range.Now, let's automate the management of this structure with pg_partman.
Advanced `pg_partman` Configuration for Production
First, ensure the extension is installed (CREATE EXTENSION pg_partman;). We'll configure pg_partman to create new partitions ahead of time and automatically drop old ones.
Creating the Partition Set
The create_parent function is the entry point. We will configure it for daily partitions, creating 4 days of partitions in advance and maintaining a 90-day retention period.
-- Install the extension if you haven't already
-- CREATE EXTENSION pg_partman;
-- Create the partition set configuration
SELECT partman.create_parent(
p_parent_table := 'public.metrics',
p_control := 'created_at', -- The column to partition by
p_type := 'native', -- Use native declarative partitioning
p_interval := '1 day', -- Create one partition per day
p_premake := 4, -- Create 4 future partitions (4 days)
p_start_partition := (now() - interval '3 days')::text -- Optional: Bootstrap with some past partitions
);
This command populates partman.part_config and creates the initial set of partitions. The p_premake value is crucial. It ensures that partitions are always available for incoming data, preventing insertion errors. A value of 4 for daily partitions means you have a 4-day buffer if your maintenance job fails to run.
Automating Maintenance and Retention
While you can call partman.run_maintenance() manually, this must be automated. You have two primary production-ready options:
pg_partman_bgw (Background Worker): This is a dedicated PostgreSQL background worker that wakes up at a configured interval to run maintenance. It's managed entirely within Postgres. * Configuration in postgresql.conf:
shared_preload_libraries = 'pg_partman_bgw'
pg_partman_bgw.interval = 3600 # Run every hour
pg_partman_bgw.role = 'postgres' # The role to run maintenance as
pg_partman_bgw.database = 'your_db_name'
* Pros: Self-contained within the database, no external dependencies.
* Cons: Less observable than external job schedulers. Failures must be monitored by querying Postgres logs or pg_partman's log tables.
run_maintenance_proc(), a stored procedure wrapper.* Pros: Integrates with existing infrastructure monitoring and alerting (e.g., Prometheus, Datadog). Failures are immediately visible in your job scheduler's dashboard.
* Cons: Requires external infrastructure configuration.
For most production Kubernetes-based environments, the external CronJob is preferable due to superior observability.
Configuring Retention
Now, let's configure the 90-day retention policy directly in pg_partman's configuration table.
UPDATE partman.part_config
SET
retention = '90 days',
retention_keep_table = true, -- Keep the parent and child tables, just drop the data
retention_keep_index = true
WHERE parent_table = 'public.metrics';
When run_maintenance() is executed, it will check for any partitions whose date range is entirely older than 90 days from the current time and DROP them. This is an instantaneous, metadata-only operation, vastly superior to a DELETE command.
Deep Dive: Query Performance Before and After
Let's prove the value with a concrete benchmark. We'll simulate a table with ~100 million rows, representing data for 1000 tenants over 100 days.
Scenario 1: Single Monolithic Table
-- (Hypothetical EXPLAIN output for a large, unpartitioned table)
EXPLAIN ANALYZE SELECT count(*)
FROM metrics_monolith
WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b'
AND created_at >= '2023-10-26 00:00:00'
AND created_at < '2023-10-27 00:00:00';
-- Likely Result:
-- Finalize Aggregate (cost=... rows=1 width=8) (actual time=850.123..850.124 rows=1 loops=1)
-- -> Gather (cost=... rows=2 width=8) (actual time=849.555..850.120 rows=3 loops=1)
-- Workers Planned: 2
-- Workers Launched: 2
-- -> Partial Aggregate (cost=... rows=1 width=8) (actual time=845.555..845.556 rows=1 loops=3)
-- -> Parallel Index Scan using metrics_mono_tenant_id_created_at_idx on metrics_monolith ...
-- Index Cond: (tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at >= ... AND created_at < ...)
-- Buffers: shared hit=150000
-- Planning Time: 0.543 ms
-- Execution Time: 850.456 ms
The key takeaway here is the Execution Time and Buffers. Even with a perfectly good index, the database has to navigate a massive index and fetch data blocks that are interspersed with data from thousands of other tenants, leading to high buffer usage and slower execution.
Scenario 2: pg_partman Partitioned Table
Now, let's run the same query against our partitioned structure.
EXPLAIN ANALYZE SELECT count(*)
FROM metrics
WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b'
AND created_at >= '2023-10-26 00:00:00'
AND created_at < '2023-10-27 00:00:00';
-- Result:
-- Finalize Aggregate (cost=... rows=1 width=8) (actual time=15.123..15.124 rows=1 loops=1)
-- -> Gather (cost=... rows=2 width=8) (actual time=14.555..15.120 rows=3 loops=1)
-- Workers Planned: 2
-- Workers Launched: 2
-- -> Partial Aggregate (cost=... rows=1 width=8) (actual time=10.555..10.556 rows=1 loops=3)
-- -> Parallel Index Scan using metrics_p2023_10_26_tenant_id_created_at_idx on metrics_p2023_10_26 ...
-- Index Cond: (tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at >= ... AND created_at < ...)
-- Buffers: shared hit=1500
-- Planning Time: 1.234 ms (includes partition pruning)
-- Execution Time: 15.456 ms
Analysis of the Improvement:
metrics_p2023_10_26 partition could possibly contain the requested data. It completely ignored the other 99+ partitions. This is the single biggest performance gain.shared hit=1500 vs 150000).15ms vs 850ms). This difference becomes even more pronounced as the total data volume grows into the terabytes.Advanced Edge Cases and Production Patterns
Implementing partitioning is more than just setup; it's about handling the operational lifecycle.
Edge Case 1: Archiving Data Instead of Dropping It
Dropping data is not always acceptable. For compliance or analytics, you may need to archive it to cheaper storage like Amazon S3. pg_partman can facilitate this by detaching partitions instead of dropping them.
Step 1: Modify the retention policy.
We can't configure this directly in partman.part_config. Instead, we'll disable pg_partman's automatic dropping and handle it ourselves in a custom procedure.
UPDATE partman.part_config
SET retention = NULL
WHERE parent_table = 'public.metrics';
Step 2: Create a custom archiving procedure.
This procedure will find old partitions, detach them, and export their data. Here, we'll use COPY to a local file, but in a real system, you would use an extension like aws_s3 to COPY directly to S3.
CREATE OR REPLACE PROCEDURE public.run_metrics_archive()
LANGUAGE plpgsql
AS $$
DECLARE
v_partition_name text;
v_archive_command text;
v_detached_schema text := 'detached';
BEGIN
-- Find partitions older than 90 days
FOR v_partition_name IN
SELECT child_table
FROM partman.show_partitions('public.metrics', 'DESC')
WHERE to_date(substring(child_table from '\d{4}_\d{2}_\d{2}$'), 'YYYY_MM_DD') < (now() - interval '90 days')
LOOP
RAISE NOTICE 'Archiving and detaching partition: %', v_partition_name;
-- 1. Detach the partition. It becomes a standalone table.
-- We move it to a 'detached' schema to keep things clean.
EXECUTE format('ALTER TABLE public.metrics DETACH PARTITION %I', v_partition_name);
EXECUTE format('ALTER TABLE %I SET SCHEMA %I', v_partition_name, v_detached_schema);
-- 2. Archive the data (example: COPY to stdout, replace with aws_s3.query_export_to_s3)
v_archive_command := format(
'COPY %I.%I TO STDOUT WITH (FORMAT CSV, HEADER)',
v_detached_schema, v_partition_name
);
RAISE NOTICE 'Run archive command: %', v_archive_command;
-- In a real system, you would execute the command here.
-- EXECUTE v_archive_command;
-- 3. Drop the now-empty, detached table
EXECUTE format('DROP TABLE %I.%I', v_detached_schema, v_partition_name);
RAISE NOTICE 'Successfully archived and dropped %', v_partition_name;
END LOOP;
END;
$$;
This procedure would then be called by your external CronJob after partman.run_maintenance() completes.
Edge Case 2: The `UPDATE` Partition Key Problem
What happens if you try to update a row's created_at value, moving it from one partition to another?
UPDATE metrics SET created_at = '2023-10-25 12:00:00' WHERE metric_id = '...';
In PostgreSQL, this operation is executed as a DELETE from the old partition and an INSERT into the new one. This has several negative implications:
* Performance: It's significantly slower than an in-place update.
* Concurrency: It can acquire stronger locks than a simple UPDATE.
* Trigger Complexity: UPDATE triggers will fire in a non-obvious way.
Solution: The best solution is often architectural. Design your application so that the partition key is immutable. If a timestamp must be changed, model it as a logical DELETE (e.g., setting a deleted_at flag) and a new INSERT. If updates across partitions are unavoidable, ensure they are rare and performed during off-peak hours.
Edge Case 3: Monitoring and Failure Recovery
How do you know if pg_partman's maintenance is failing?
pg_partman logs errors to the standard PostgreSQL log, but it also provides a dedicated table: partman.part_error_log.
Your monitoring system must include a check on this table.
-- A simple check for recent errors
SELECT *
FROM partman.part_error_log
WHERE error_time > now() - interval '1 hour';
A query like this should be integrated into a monitoring tool (e.g., a Datadog agent check, a Prometheus exporter) to fire an alert if any errors are logged. Common failures include:
* Permissions issues: The role running maintenance doesn't have CREATE or DROP privileges.
* Lock contention: run_maintenance() times out trying to acquire a lock on the parent table. The p_lock_wait parameter in run_maintenance() can be used to control this, but it's better to ensure maintenance runs during low-traffic periods.
* Out of disk space: Partition creation fails because the tablespace is full.
Conclusion: From Liability to Asset
By combining PostgreSQL's native partitioning with the automation and advanced features of pg_partman, a massive, unwieldy time-series table can be transformed from a performance liability into a highly scalable and manageable asset. The key is not just to partition, but to do so with a strategy that aligns with your primary query patterns.
The multi-tenant pattern discussed here—partitioning by time while indexing for tenant-specific access—is a proven approach for building high-performance SaaS backends. It allows the query planner to work on small, relevant data subsets, drastically reducing query latency and I/O load. By automating partition creation and retention, pg_partman eliminates the operational overhead, freeing up engineering teams to focus on building features, not on manual database janitoring. This is the level of robust engineering required to scale PostgreSQL into the petabyte range.