Automated Postgres Time-Series Partitioning at Scale with pg_partman
The Inevitable Scaling Problem with Time-Series Data
If you're running any significant logging, analytics, or IoT platform on PostgreSQL, you've encountered this scenario: a single, monolithic table, perhaps named events or metrics, grows to billions of rows. Initially, performance is acceptable. But over time, a creeping decay sets in. Queries for recent data, which should be fast, begin to slow down. Index sizes balloon, consuming gigabytes of RAM and leading to cache misses. The VACUUM process, once a swift background task, becomes a multi-hour ordeal, struggling to keep up with dead tuples. Maintenance operations like adding a column or rebuilding an index cause extended locks and production incidents.
This isn't a failure of PostgreSQL; it's a failure of data architecture. A single, massive B-tree cannot efficiently manage data when access patterns are heavily skewed toward a specific dimension, most commonly time.
PostgreSQL's native declarative partitioning is the textbook solution. It allows you to physically store data in smaller, more manageable child tables (partitions) while presenting them as a single logical table. However, native partitioning is a toolkit, not a fully automated solution. It leaves critical operational questions unanswered:
- How are new partitions for future data created automatically?
- How are old, irrelevant partitions dropped or archived without manual intervention?
- How do you ensure consistent indexing across all partitions, present and future?
Answering these questions with custom cron jobs and brittle shell scripts is a path to operational fragility. This is the precise problem pg_partman, a PostgreSQL extension, is built to solve. It provides a metadata-driven configuration layer on top of native partitioning, automating the entire lifecycle management of your partitions. This article is not an introduction to pg_partman. It is a deep dive into its advanced usage, focusing on production patterns, performance tuning, and edge cases you will encounter when operating at scale.
Section 1: Advanced Parent Table Configuration
The foundation of a pg_partman setup is the create_parent() function. A basic invocation gets you started, but production systems require a more nuanced configuration to ensure resilience and performance.
Let's consider a realistic schema for an IoT metrics table:
CREATE TABLE public.device_metrics (
metric_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
device_uuid UUID NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metric_type TEXT NOT NULL,
payload JSONB NOT NULL
);
-- A standard index for looking up metrics by device and time
CREATE INDEX idx_device_metrics_device_uuid_recorded_at
ON public.device_metrics (device_uuid, recorded_at DESC);
-- A GIN index for querying the JSONB payload
CREATE INDEX idx_device_metrics_payload_gin
ON public.device_metrics USING GIN (payload);
Our goal is to partition device_metrics by recorded_at on a daily basis. A naive create_parent call is insufficient. We need to control pre-creation, maintenance, and the starting partition.
Production-Grade `create_parent` Invocation
-- First, install the extension
CREATE EXTENSION IF NOT EXISTS pg_partman;
-- Now, configure the partitioning
SELECT partman.create_parent(
p_parent_table := 'public.device_metrics',
p_control := 'recorded_at',
p_type := 'native',
p_interval := '1 day',
p_premake := 4,
p_start_partition := (NOW() - INTERVAL '3 days')::date::text,
p_automatic_maintenance := 'on',
p_jobmon := 'on'
);
Let's dissect the advanced parameters here:
p_type := 'native': While pg_partman once supported trigger-based partitioning, you should exclusively use native for any modern PostgreSQL version (11+). It's vastly more performant.p_premake := 4: This is one of the most critical settings for production. It instructs pg_partman to always maintain 4 partitions for future data. If today is Wednesday, it ensures partitions for Thursday, Friday, Saturday, and Sunday already exist. Why is this crucial? If your maintenance job (run_maintenance_proc) fails to run for any reason (scheduler issue, lock contention), your application will not face a catastrophic failure trying to insert data for which no partition exists. This buffer is your primary line of defense against write interruptions.p_start_partition: By default, pg_partman creates the first partition around the current time. In a data migration scenario, you often need to create partitions for historical data you're about to load. This parameter gives you explicit control over the oldest partition to create upon setup.p_automatic_maintenance := 'on': This is a convenience that automatically adds this partition set to the partman.part_config table with maintenance enabled. We'll still need to configure how it runs, but this flags it for management.p_jobmon := 'on': If you use the pg_jobmon extension for monitoring scheduled jobs, this integrates pg_partman's maintenance runs into it. This is highly recommended for visibility into job success, failure, and duration.After running this, inspect the partman.part_config table. You'll see a detailed configuration entry for public.device_metrics, which becomes the single source of truth for its partitioning strategy.
Section 2: The Heartbeat: `run_maintenance_proc`
pg_partman is not a passive system. Its logic is executed by the partman.run_maintenance_proc() function (or run_maintenance() for older versions). This procedure is your partition management engine. It iterates through the partman.part_config table, creates new partitions based on the premake setting, and drops/detaches old ones based on retention policies. If this function does not run, your partitions will become stale.
Your responsibility is to schedule this procedure reliably. The most common and robust method within the database is using pg_cron.
-- Ensure pg_cron is enabled in postgresql.conf
-- shared_preload_libraries = 'pg_cron'
-- Schedule the maintenance to run every hour
SELECT cron.schedule(
'pgpartman-maintenance',
'0 * * * *', -- At the start of every hour
$$CALL partman.run_maintenance_proc()$$
-- Use the stored procedure version for PG11+
);
Scheduling Strategy and Frequency
premake buffer is always full.CALL ... run_maintenance_proc()? For PostgreSQL 11+, run_maintenance_proc() is a stored procedure that can handle its own transaction control. This is more robust than the older run_maintenance() function, especially when dealing with DDL and potential locks. Monitoring and Failure Handling (Edge Case)
What happens when run_maintenance_proc fails? It might be due to a lock on the parent table, a transaction timeout, or a permissions issue. This is where your monitoring becomes critical.
partman.part_config: The maintenance_last_run and maintenance_last_result columns are your first port of call. You should have an external monitoring system (e.g., Prometheus with a custom exporter, Datadog) that periodically queries this table and alerts if maintenance_last_run is older than your expected schedule interval or if maintenance_last_result indicates a failure.NOW() + premake * interval. If that partition doesn't exist, it's a critical alert that your premake buffer has been exhausted.p_premake is your safety net. If your maintenance job fails, premake gives you time (in this case, 4 days) to fix the underlying issue before your application starts throwing errors because a target partition does not exist.
Section 3: Production-Grade Indexing with Template Tables
When pg_partman creates a new child table, how does it know which indexes to create? By default, it doesn't. It only creates the partition itself. Manually adding indexes after the fact is tedious and error-prone. The pg_partman solution is the template table.
A template table is a non-partitioned, empty table that has the exact same schema, including all indexes, constraints, and foreign keys, as your desired child partitions. pg_partman will use this table as a model for all future partitions it creates.
This pattern decouples index management from partition creation, making your schema evolution much cleaner.
Implementation
CREATE TABLE public.device_metrics_template (
metric_id BIGINT,
device_uuid UUID,
recorded_at TIMESTAMPTZ,
metric_type TEXT,
payload JSONB
) WITH (autovacuum_enabled = 'false'); -- No data, so no vacuum needed
-- Replicate the exact indexes from the original parent
CREATE INDEX idx_template_device_uuid_recorded_at
ON public.device_metrics_template (device_uuid, recorded_at DESC);
CREATE INDEX idx_template_payload_gin
ON public.device_metrics_template USING GIN (payload);
-- Add any other constraints, foreign keys, etc.
partman.part_config table to point to your new template. UPDATE partman.part_config
SET
template_table = 'public.device_metrics_template',
inherit_privileges = true -- Ensure new partitions get permissions from the parent
WHERE parent_table = 'public.device_metrics';
Now, every time run_maintenance_proc creates a new daily partition, it will be created with LIKE public.device_metrics_template INCLUDING ALL, ensuring perfect index consistency.
The `UNIQUE` Index Conundrum (Advanced Edge Case)
There is a critical limitation in PostgreSQL's native partitioning: a UNIQUE or PRIMARY KEY constraint on a partitioned table must include the partition key.
In our device_metrics table, metric_id is the primary key. If we partition by recorded_at, a PRIMARY KEY constraint on metric_id alone is invalid because the database cannot efficiently guarantee uniqueness across all partitions without scanning every single one.
To make it valid, the primary key would have to be (metric_id, recorded_at). This is often not what you want, as it changes the uniqueness guarantee.
Production Workarounds:
PRIMARY KEY (metric_id, recorded_at). This is the most performant and database-native solution.metric_id is unique before insertion. This offloads the constraint from the database but adds complexity to your code.UNIQUE index on a materialized view that only contains the unique key column, or use a trigger-based approach to check for duplicates in a separate lookup table. These solutions add significant overhead and complexity and should be considered last resorts.For our example, we will assume a composite primary key is acceptable. We would modify the parent and template tables:
-- On the parent table (before partitioning)
ALTER TABLE public.device_metrics DROP CONSTRAINT device_metrics_pkey;
ALTER TABLE public.device_metrics ADD PRIMARY KEY (metric_id, recorded_at);
-- On the template table
ALTER TABLE public.device_metrics_template ADD PRIMARY KEY (metric_id, recorded_at);
Section 4: Automated Retention and Archiving
Storing data forever is rarely feasible or desirable. pg_partman automates data retention, but a naive DROP policy can lead to data loss. A production-grade strategy involves detaching, archiving, and then dropping.
Configuring Retention
Update the configuration to retain 90 days of data:
UPDATE partman.part_config
SET
retention = '90 days',
retention_keep_table = true, -- Do not DROP the table immediately
retention_keep_index = true
WHERE parent_table = 'public.device_metrics';
retention = '90 days': Partitions whose data range is entirely older than 90 days from the current time will be targeted.retention_keep_table = true: This is the crucial setting. Instead of DROPping the old partition, pg_partman will simply detach it from the parent table. The table, e.g., device_metrics_p2023_10_01, becomes a regular, standalone table. It is no longer part of the query plan for the parent device_metrics table.The Archive-Then-Drop Pattern
Now that old partitions are detached instead of dropped, you can implement a robust archiving workflow.
pg_partman naming convention but are no longer attached to a parent.pg_dump to export the detached table's data to a file. Upload this file to a cold storage solution like Amazon S3, Google Cloud Storage, or Azure Blob Storage. You can use an extension like aws_s3 to do this from within Postgres.DROP the detached table.Here is a conceptual PL/pgSQL procedure that could be run by a separate, less frequent cron job:
CREATE OR REPLACE PROCEDURE archive_and_drop_detached_partitions() LANGUAGE plpgsql AS $$
DECLARE
detached_table_name TEXT;
dump_command TEXT;
exit_code INT;
BEGIN
FOR detached_table_name IN
SELECT
c.relname
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_inherits i ON i.inhrelid = c.oid
WHERE c.relname LIKE 'device_metrics_p%' AND i.inhrelid IS NULL
LOOP
RAISE NOTICE 'Archiving partition: %', detached_table_name;
-- Example: dump to a file and upload. Requires shell access.
-- In a real scenario, use aws_s3 extension or a dedicated backup tool.
dump_command := format('pg_dump -t %I -f /backups/%s.sql your_db && aws s3 cp /backups/%s.sql s3://your-backup-bucket/',
detached_table_name, detached_table_name, detached_table_name);
-- This is a conceptual example of running a shell command.
-- Using a dedicated backup agent is more robust.
-- PERFORM system(dump_command);
RAISE NOTICE 'Archive for % complete. Dropping table.', detached_table_name;
-- After successful backup, drop the table.
-- EXECUTE 'DROP TABLE ' || quote_ident(detached_table_name);
END LOOP;
END;
$$;
This two-step detach-then-archive process transforms data retention from a risky DROP operation into a safe, auditable archiving pipeline.
Section 5: Query Performance and Partition Pruning
The entire purpose of partitioning is to improve query performance by enabling partition pruning. This is the process where the PostgreSQL query planner intelligently excludes partitions that cannot possibly contain the data requested by the WHERE clause.
Consider this query:
EXPLAIN ANALYZE
SELECT * FROM public.device_metrics
WHERE recorded_at >= NOW() - INTERVAL '1 hour';
Because the WHERE clause contains a constraint on the partition key (recorded_at), the planner is extremely effective. If you have years of data partitioned by day, the EXPLAIN plan will show that it only scans a single partition: today's.
-- Sample EXPLAIN output (simplified)
Append (cost=0.29..43.44 rows=12 width=104) (actual time=0.032..0.152 rows=312 loops=1)
-> Index Scan using device_metrics_p2024_03_15_pkey on device_metrics_p2024_03_15 ...
Index Cond: (recorded_at >= (now() - '01:00:00'::interval))
Planning Time: 0.315 ms
Execution Time: 0.189 ms
Contrast this with a query on an unpartitioned, billion-row table. Even with an index on recorded_at, the planner would have to traverse a massive B-tree, leading to significantly more I/O and slower execution times.
Pitfall: Preventing Partition Pruning
You can inadvertently write queries that disable the planner's ability to prune partitions. This typically happens when you obscure the partition key value within a function or a type cast.
WHERE recorded_at >= '2024-03-15 00:00:00'::timestamptzWHERE date_trunc('day', recorded_at) = '2024-03-15'WHERE recorded_at::date = '2024-03-15'In the 'BAD' examples, the planner may not be smart enough to understand that the function's output can be mapped back to a specific partition. It might resort to scanning all partitions and applying the function to every row, defeating the purpose of partitioning. Always filter on the raw partition key column with stable, immutable functions (NOW() is considered stable).
Benchmark Illustration
On a test system with a device_metrics table containing 500 million rows over 2 years:
| Query Scenario | Unpartitioned Table | Partitioned Table (Daily) | Performance Gain |
|---|---|---|---|
| Select 1 hour of data for one device | 12,500 ms | 8 ms | ~1500x |
| Select 1 day of data for all devices | 35,000 ms | 150 ms | ~230x |
| Aggregate (COUNT) over 1 week of data | 98,000 ms | 950 ms | ~100x |
These are illustrative benchmarks. Real-world performance depends on hardware, indexing, and data distribution. The takeaway is clear: for time-bound queries, partition pruning provides orders-of-magnitude performance improvements.
Section 6: Zero-Downtime Migration for Existing Tables
What if you already have a 2TB device_metrics table and want to partition it without taking your application offline? This is a complex but achievable migration.
The Strategy: Dual-Write and Backfill
device_metrics_new, and configure it for partitioning with pg_partman as described above.device_metrics table. On every INSERT, the trigger function also inserts the same data into device_metrics_new. This ensures that all new, incoming data flows to both tables. CREATE OR REPLACE FUNCTION duplicate_to_new_metrics()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO public.device_metrics_new VALUES (NEW.*);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_dual_write_metrics
BEFORE INSERT ON public.device_metrics
FOR EACH ROW EXECUTE FUNCTION duplicate_to_new_metrics();
-- Run this repeatedly for different time ranges
INSERT INTO device_metrics_new
SELECT * FROM device_metrics
WHERE recorded_at >= '2023-01-01 00:00:00' AND recorded_at < '2023-01-02 00:00:00';
BEGIN;
-- Acquire an exclusive lock to prevent any writes during the rename
LOCK TABLE public.device_metrics IN ACCESS EXCLUSIVE MODE;
LOCK TABLE public.device_metrics_new IN ACCESS EXCLUSIVE MODE;
-- Rename the tables
ALTER TABLE public.device_metrics RENAME TO device_metrics_old;
ALTER TABLE public.device_metrics_new RENAME TO device_metrics;
COMMIT;
This transaction is very fast, as a rename is just a metadata change. Your application will experience a brief lock, but not extended downtime.
device_metrics_old table.Conclusion
PostgreSQL's native partitioning provides the core engine for managing large datasets, but pg_partman provides the essential automation and operational safety required for production systems. By moving beyond basic setup and embracing advanced patterns—proactive partition creation via premake, consistent schema management with template tables, and safe retention policies through a detach-and-archive workflow—you can transform unmanageable, monolithic tables into a highly performant and maintainable architecture. The key is to treat partitioning not as a one-time setup, but as a continuous, automated lifecycle managed by pg_partman and monitored by you.