Advanced PostgreSQL Partitioning: Scaling Time-Series with pg_partman
The Inevitable Scaling Problem with Time-Series Tables
Any system logging events, metrics, or sensor data eventually confronts the monolith table problem. A single events table, growing by millions or billions of rows, becomes a significant bottleneck. INSERT performance degrades due to index bloat, SELECT queries become agonizingly slow without precise time-range scoping, and maintenance operations like VACUUM and ALTER TABLE can lock up the system for hours.
PostgreSQL's native declarative partitioning is the foundational solution. It allows a large table to be physically split into smaller, more manageable child tables (partitions) while being logically queryable as a single entity. The query planner can then perform partition pruning, scanning only the relevant child tables based on the WHERE clause, leading to dramatic performance gains.
However, native partitioning alone is insufficient for a robust, automated production system. It provides the mechanism but not the management. Key operational questions remain:
This is where pg_partman enters. It's an extension that automates the creation and management of time-based and serial-based table partition sets. But simply installing and running create_parent() is just scratching the surface. This article focuses on the advanced patterns required to leverage pg_partman in demanding, high-throughput production environments.
Section 1: Beyond the Basics - A Production-Ready `pg_partman` Setup
Let's assume we're building a system to ingest high-frequency IoT device events. A naive table might look like this:
CREATE TABLE events (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
device_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
payload JSONB
);
CREATE INDEX idx_events_device_id ON events (device_id);
CREATE INDEX idx_events_created_at ON events (created_at DESC);
This schema will fail at scale. A UUID primary key on a multi-terabyte table is inefficient, and the indexes will become enormous. The first step is to design for partitioning from day one.
The Partition Key Constraint: Rethinking Primary Keys
A critical limitation of PostgreSQL partitioning is that any PRIMARY KEY or UNIQUE constraint on a partitioned table must include all columns of the partition key. Our partition key will be created_at. This means event_id alone can no longer be the primary key.
The standard production pattern is to create a composite primary key.
-- The parent table for our partitioned setup
CREATE TABLE events (
event_id UUID NOT NULL DEFAULT gen_random_uuid(),
device_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
payload JSONB,
PRIMARY KEY (created_at, event_id) -- Partition key first is a common convention
) PARTITION BY RANGE (created_at);
-- We still need fast lookups by device_id
CREATE INDEX idx_events_device_id_created_at ON events (device_id, created_at DESC);
By including created_at in the primary key, we satisfy PostgreSQL's constraint. This has implications for your application logic. Uniqueness is now guaranteed on the combination of created_at and event_id. While collisions on this composite key are astronomically unlikely, it's a fundamental shift from a simple UUID primary key.
Code Example 1: Detailed `create_parent()` Configuration
Now, let's configure pg_partman to manage this table. We won't just set an interval; we'll configure it for a high-write production workload.
-- Ensure the extension is installed
CREATE EXTENSION IF NOT EXISTS pg_partman;
-- Configure pg_partman to manage the 'events' table
SELECT partman.create_parent(
p_parent_table := 'public.events',
p_control := 'created_at',
p_type := 'native',
p_interval := 'daily',
p_premake := 14, -- Create 14 days of partitions in advance
p_start_partition := (now() - interval '7 day')::text,
p_jobmon := 'on' -- Enable logging via the pg_jobmon extension
);
Let's break down the advanced considerations here:
* p_type := 'native': Always use this for modern PostgreSQL (11+). It leverages declarative partitioning, which is superior to the old trigger-based method.
* p_premake := 14: This is a critical tuning parameter. It tells pg_partman to maintain 14 future daily partitions at all times. For a high-throughput system, if your maintenance job fails to run for a day, you don't want writes to start failing because a partition for the current day doesn't exist. The premake value is your buffer. The trade-off is storage overhead for empty tables vs. write availability. A value between 7 and 30 is common, depending on your operational confidence and data velocity.
* p_start_partition: Explicitly setting a start partition is crucial if you are migrating existing data. If omitted, pg_partman might create partitions from the earliest data in the table, which could be years ago, creating thousands of unnecessary partitions.
p_jobmon := 'on': In production, you must* have insight into pg_partman's maintenance runs. pg_jobmon (another extension by the same author) provides detailed logging of run times, steps, and failures. Without this, you are flying blind.
After running this, pg_partman creates the necessary child partitions (events_p2023_10_26, events_p2023_10_27, etc.) and automatically attaches them to the parent events table. All indexes from the parent are automatically replicated to the children.
Section 2: Advanced Data Retention and Archiving Patterns
pg_partman provides a simple mechanism for data retention via the p_retention parameter in part_config and the run_maintenance_proc() function. It will drop partitions older than the specified interval. In a production system with valuable data, automatically dropping data is extremely dangerous.
A network hiccup, a misconfiguration, or a bug could lead to permanent data loss. The professional pattern is to separate the detachment of a partition from its destruction.
Our strategy will be:
pg_partman to detach old partitions, but not drop them.- Implement a separate, idempotent process to archive the detached partitions to cold storage (e.g., AWS S3).
- Only after successful archival does the process drop the table.
Step 1: Configure `pg_partman` for Detach-Only
Update the part_config table to disable automatic dropping.
UPDATE partman.part_config
SET
retention = '30 days',
retention_keep_table = 'true',
retention_keep_index = 'true'
WHERE parent_table = 'public.events';
* retention = '30 days': We still define the retention window.
* retention_keep_table = 'true': This is the key. When run_maintenance_proc() runs, it will detach any partition whose data is entirely older than 30 days, but it will leave the table (e.g., public.events_p2023_09_25) in the database as a standalone table.
Code Example 2: A Robust Archival and Cleanup Script
This script would be run by an external scheduler (like cron or an Airflow DAG) after the pg_partman maintenance job.
#!/bin/bash
set -e
set -o pipefail
PG_HOST="your_db_host"
PG_USER="your_db_user"
PG_DB="your_db_name"
S3_BUCKET="s3://your-company-pg-archives"
# Find detached partitions. pg_partman leaves them in the parent's schema.
# We look for tables that are no longer in pg_inherits.
DETACHED_TABLES=$(psql -h $PG_HOST -U $PG_USER -d $PG_DB -t -c "
SELECT c.relname
FROM pg_class c
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname = 'public'
AND c.relname ~ '^events_p[0-9]{4}_[0-9]{2}_[0-9]{2}$'
AND NOT EXISTS (
SELECT 1 FROM pg_inherits WHERE inhrelid = c.oid
);
")
if [ -z "$DETACHED_TABLES" ]; then
echo "No detached partitions found to archive."
exit 0
fi
echo "Found detached partitions: $DETACHED_TABLES"
for TABLE_NAME in $DETACHED_TABLES; do
echo "Archiving table: $TABLE_NAME"
FILENAME="${TABLE_NAME}_$(date +%Y%m%d%H%M%S).sql.gz"
S3_PATH="${S3_BUCKET}/${PG_DB}/${TABLE_NAME}/${FILENAME}"
# 1. Archive the table to S3
pg_dump -h $PG_HOST -U $PG_USER -d $PG_DB -t "public.${TABLE_NAME}" | gzip | aws s3 cp - "$S3_PATH"
# Check exit code of the pipe. If pg_dump or aws cp fails, this will be non-zero.
if [ $? -ne 0 ]; then
echo "ERROR: Archival of $TABLE_NAME to $S3_PATH failed. Aborting drop."
exit 1
fi
echo "Successfully archived $TABLE_NAME to $S3_PATH"
# 2. Verify archive exists (optional but recommended)
aws s3 ls "$S3_PATH" > /dev/null
if [ $? -ne 0 ]; then
echo "ERROR: Verification of archive $S3_PATH failed. Aborting drop."
exit 1
fi
# 3. Drop the table only after successful archival and verification
echo "Dropping table $TABLE_NAME..."
psql -h $PG_HOST -U $PG_USER -d $PG_DB -c "DROP TABLE public.\"${TABLE_NAME}\";"
echo "Successfully dropped table $TABLE_NAME."
done
echo "Archival and cleanup process complete."
This script introduces a safety layer. Data is only dropped after it's securely backed up. It's idempotent; if the script fails midway (e.g., the S3 upload is interrupted), the next run will find the same detached table and try again. The detached table acts as a buffer or staging area for archival.
Section 3: Evolving Schemas and Indexes with Template Tables
As a system matures, its data access patterns change. Perhaps for recent data (last 7 days), you need fast key-value lookups on a specific field in the payload JSONB, but for older data, this index is just costly overhead.
pg_partman allows you to specify a template_table for newly created partitions. This provides a powerful mechanism for managing heterogeneous index strategies across your partition set.
Scenario: A GIN index for recent data only
Let's say we want a GIN index on payload for fast searching, but only on partitions for the next 7 days, as ad-hoc queries primarily target recent events.
Code Example 3: Implementing a Template Table
-- This table will never hold data, it's just a template for structure.
CREATE TABLE events_template (
event_id UUID NOT NULL,
device_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB,
PRIMARY KEY (created_at, event_id)
);
-- Add the standard indexes that ALL partitions need
CREATE INDEX idx_events_template_device_id_created_at ON events_template (device_id, created_at DESC);
-- Add the special GIN index only for new partitions
CREATE INDEX idx_events_template_payload_gin ON events_template USING gin (payload jsonb_path_ops);
UPDATE partman.part_config
SET template_table = 'public.events_template'
WHERE parent_table = 'public.events';
Now, whenever pg_partman's maintenance job runs and creates new partitions, it will use events_template to define the structure, including the special GIN index. Existing partitions are unaffected.
Edge Case: Removing the index from older partitions
This setup only applies the index to new partitions. What about the partitions that are now older than 7 days but still have the expensive GIN index? You need a cleanup process.
This can be handled by a scheduled procedure.
CREATE OR REPLACE PROCEDURE drop_old_gin_indexes() LANGUAGE plpgsql AS $$
DECLARE
v_partition_name TEXT;
v_index_name TEXT;
v_cutoff_date TIMESTAMPTZ := now() - interval '7 days';
BEGIN
-- Find child partitions of 'events' that are older than the cutoff
FOR v_partition_name IN
SELECT pg_class.relname
FROM pg_inherits
JOIN pg_class ON pg_inherits.inhrelid = pg_class.oid
JOIN pg_class as parent ON pg_inherits.inhparent = parent.oid
WHERE parent.relname = 'events'
AND to_date(substring(pg_class.relname from 'events_p(\d{4}_\d{2}_\d{2})'), 'YYYY_MM_DD') < v_cutoff_date::date
LOOP
-- Construct the expected index name based on pg's naming convention for partitioned indexes
v_index_name := v_partition_name || '_payload_gin_idx'; -- Note: a better way is to query pg_indexes
-- Check if the index exists on this partition and drop it
IF EXISTS (SELECT 1 FROM pg_indexes WHERE tablename = v_partition_name AND indexname = v_index_name) THEN
RAISE NOTICE 'Dropping index % on partition %', v_index_name, v_partition_name;
EXECUTE 'DROP INDEX CONCURRENTLY ' || quote_ident(v_index_name);
END IF;
END LOOP;
END;
$$;
-- This would be called daily by a scheduler
CALL drop_old_gin_indexes();
Performance Note: DROP INDEX CONCURRENTLY is used to avoid locking the partition while the index is being removed. This is critical in a production environment.
This combination of a template table for creation and a cleanup procedure for removal allows you to maintain a rolling window of specialized indexes, optimizing for both recent query performance and long-term storage costs.
Section 4: Performance Tuning and Verifying Partition Pruning
Partitioning is useless if the query planner doesn't use it. Partition pruning is the process where the planner intelligently excludes partitions from a query plan based on the WHERE clause. If this fails, the database may still scan all child tables, negating the benefits.
Common Pitfalls that Break Pruning
WHERE clauses: * Bad: WHERE date_trunc('day', created_at) = '2023-10-26'
* Why: The planner cannot guarantee that the result of date_trunc will fall within a specific partition's range. It must check all partitions.
* Good: WHERE created_at >= '2023-10-26 00:00:00' AND created_at < '2023-10-27 00:00:00'
* Why: This provides a direct, analyzable range that the planner can map directly to one or more partitions.
* Bad: WHERE created_at > (now() - interval '1 hour') if now() is not treated as a stable value during planning.
Why: The planner might not evaluate now() at plan time, forcing a scan of all partitions that could* match.
* Good: Pre-calculating the timestamp in your application and passing it as a literal parameter.
Code Example 4: Analyzing `EXPLAIN` Plans
Let's analyze the difference. Assume we have hundreds of daily partitions.
Query 1: Correct Pruning
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*)
FROM events
WHERE created_at >= '2023-10-25 00:00:00' AND created_at < '2023-10-26 00:00:00';
The EXPLAIN output will look something like this:
Finalize Aggregate (cost=4350.93..4350.94 rows=1 width=8) (actual time=24.571..24.572 rows=1 loops=1)
Buffers: shared hit=1234
-> Gather (cost=4350.72..4350.93 rows=2 width=8) (actual time=24.542..24.568 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1234
-> Partial Aggregate (cost=3350.72..3350.73 rows=1 width=8) (actual time=21.169..21.170 rows=1 loops=3)
Buffers: shared hit=1234
-> Parallel Append (cost=0.00..3210.50 rows=56088 width=0) (actual time=0.027..17.891 rows=45123 loops=3)
Buffers: shared hit=1234
-> Parallel Index Only Scan using events_p2023_10_25_pkey on events_p2023_10_25 ...
(cost=0.00..2890.00 rows=56088 width=0) (actual time=0.023..15.432 rows=45123 loops=3)
Buffers: shared hit=1234
The most important part is the last line: Parallel Index Only Scan using ... on events_p2023_10_25. The planner correctly identified that only one partition (events_p2023_10_25) needed to be scanned. This is a successful prune.
Query 2: Pruning Failure
EXPLAIN (ANALYZE, BUFFERS)
SELECT count(*)
FROM events
WHERE date_trunc('day', created_at) = '2023-10-25'::date;
The EXPLAIN output will be drastically different and much longer:
... (Aggregate and Gather nodes) ...
-> Append (cost=0.00..543210.98 rows=1234567 width=8)
-> Index Scan using events_p2023_09_01_pkey on events_p2023_09_01 ...
Filter: (date_trunc('day'::text, created_at) = '2023-10-25 00:00:00'::timestamp with time zone)
-> Index Scan using events_p2023_09_02_pkey on events_p2023_09_02 ...
Filter: (date_trunc('day'::text, created_at) = '2023-10-25 00:00:00'::timestamp with time zone)
-> Index Scan using events_p2023_09_03_pkey on events_p2023_09_03 ...
... (and so on for every single partition) ...
The key indicator of failure is the Append node followed by scans on every single partition. The planner could not prune, so it must check each one, applying the expensive date_trunc function to every row. The performance difference between these two queries on a large dataset can be orders of magnitude.
Monitoring for Pruning Issues
* Enable auto_explain: For a subset of slow queries, auto_explain can log the execution plans, allowing you to proactively find queries that are failing to prune.
* Review Query Logs: Regularly review your slow query logs (log_min_duration_statement). Look for queries against your partitioned tables that have unexpectedly high execution times. Manually run EXPLAIN on them to diagnose pruning issues.
Conclusion: From Mechanism to Strategy
PostgreSQL partitioning, supercharged with pg_partman, is a formidable solution for scaling time-series data. However, achieving genuine production resilience and performance requires moving beyond the initial setup and adopting a strategic approach.
Senior engineers must:
pg_partman's detach feature is the professional standard.EXPLAIN plans to ensure your queries are efficient and your database is not performing unnecessary work.By embracing these advanced patterns, you can transform a potentially unmanageable, monolithic table into a highly performant, scalable, and operationally robust system capable of handling petabytes of data for years to come.