PostgreSQL Partitioning at Scale with `pg_partman` for Time-Series Data
The Inevitable Scaling Failure of Monolithic Time-Series Tables
As a senior engineer, you've likely witnessed this scenario: a service is launched with a simple events or metrics table in PostgreSQL. Initially, it performs beautifully. Months later, with billions of rows accumulated, the system grinds to a halt. INSERT statements become unpredictably latent, VACUUM processes run for hours consuming significant I/O, index maintenance is a high-risk operation, and even simple time-range queries take minutes instead of milliseconds. Deleting old data with DELETE FROM ... WHERE created_at < ... locks the table and generates massive WAL traffic, risking a production outage.
This isn't a failure of PostgreSQL; it's a failure of data modeling at scale. The solution is table partitioning. While PostgreSQL has offered native declarative partitioning since version 10, managing the lifecycle of partitions—creating new ones for upcoming data and archiving or dropping old ones—remains a manual, error-prone task.
This is the precise problem pg_partman solves. It's not a magical extension that changes how partitioning works; it's a sophisticated automation and management layer built on top of PostgreSQL's native capabilities. In this post, we will move beyond the basics and implement a production-grade partitioning strategy for a high-volume time-series table, focusing on the nuances and advanced patterns required for mission-critical systems.
Our scenario will be an IoT platform logging millions of events per hour from various devices.
Baseline: The Problem Schema and its Performance Degradation
Let's start with the problematic, unpartitioned table structure.
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
device_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
event_type VARCHAR(50) NOT NULL,
payload JSONB
);
CREATE INDEX idx_events_created_at ON events (created_at DESC);
CREATE INDEX idx_events_device_id_created_at ON events (device_id, created_at DESC);
After a year of operation, this table contains 5 billion rows and is approximately 2TB in size. Let's analyze a common query: fetching the last 24 hours of events for a specific device.
EXPLAIN ANALYZE
SELECT id, event_type, payload, created_at
FROM events
WHERE device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1'
AND created_at >= NOW() - INTERVAL '1 day'
ORDER BY created_at DESC;
Even with the compound index idx_events_device_id_created_at, the query plan will look something like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_events_device_id_created_at on events (cost=0.56..54321.23 rows=12345 width=150) (actual time=0.123..15432.123 rows=15000 loops=1)
Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
Planning Time: 0.234 ms
Execution Time: 15435.456 ms
A 15-second execution time is unacceptable. The core issue is that the B-tree index is enormous. Traversing this massive index to find the relatively few rows for the last day is inefficient. The database must navigate through terabytes of data, much of which is cold and irrelevant to this query, causing significant buffer cache churn.
Dropping old data is even more catastrophic:
DELETE FROM events WHERE created_at < NOW() - INTERVAL '90 days';
This command will acquire row-level locks, potentially escalating to a table lock, generate an enormous amount of WAL, and leave behind dead tuples that require an aggressive VACUUM, further impacting performance.
Production Implementation with `pg_partman`
We will now architect a solution using native partitioning automated by pg_partman.
Step 1: Environment Setup and Schema Redesign
First, ensure pg_partman is installed. On most managed database services (like AWS RDS), this is a matter of enabling it in the shared_preload_libraries parameter and then running:
CREATE EXTENSION pg_partman;
The most critical change is to our table schema. For native partitioning, the partition key must be part of the primary key (or any unique index). This is a non-negotiable constraint that often trips up teams migrating existing tables.
Our new schema:
CREATE TABLE events (
id BIGSERIAL,
device_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
event_type VARCHAR(50) NOT NULL,
payload JSONB,
PRIMARY KEY (id, created_at) -- Partition key is now part of the PK
)
PARTITION BY RANGE (created_at);
Notice the PARTITION BY RANGE (created_at) clause. This declares events as the partitioned table (the "parent"). It will not hold any data itself but will act as a proxy for its child partitions.
Step 2: Advanced `create_parent` Configuration
This is where we instruct pg_partman on how to manage our events table. We'll go beyond a basic setup and use parameters crucial for high-performance environments.
-- Create a role for partman to run its background tasks if desired
-- CREATE ROLE partman_role;
-- Run the main configuration function
SELECT pg_partman.create_parent(
p_parent_table := 'public.events',
p_control := 'created_at',
p_type := 'native',
p_interval := '1 day',
p_premake := 7,
p_start_partition := (NOW() - INTERVAL '3 days')::text,
p_template_table := 'public.events_template',
p_jobmon := 'on'
);
Let's dissect these parameters from a senior engineer's perspective:
* p_parent_table: public.events - The target partitioned table.
* p_control: created_at - The column to partition by.
* p_type: 'native' - CRITICAL. We are explicitly using PostgreSQL's declarative partitioning. The older, trigger-based method (partman) has significant overhead and should be avoided for new projects.
* p_interval: '1 day' - The heart of the strategy. We're creating one new table per day.
* Trade-offs: A shorter interval (e.g., '1 hour') means smaller, more manageable tables and indexes, faster DROP operations, and quicker VACUUMs per partition. However, it can lead to a very high number of tables, which can slightly increase query planning time and strain metadata caches. A longer interval (e.g., '1 week') reduces table count but leads to larger partitions that may eventually exhibit the same problems as the original monolithic table. For high-volume IoT data, daily is a common and effective starting point.
* p_premake: 7 - Performance critical. This tells pg_partman to always maintain 7 partitions for future data. If today is Monday, it ensures partitions for Tuesday, Wednesday, ..., through next Monday already exist. This prevents INSERT statements from being the ones to trigger partition creation, which would introduce significant write latency. In a high-throughput system, this is non-negotiable.
* p_start_partition: We're telling it to create partitions starting from 3 days ago. This is useful if we have recent data to load immediately.
* p_template_table: public.events_template - The key to sane index and constraint management. We will define this next. It's a blueprint for all future partitions.
* p_jobmon: 'on' - Highly recommended. pg_partman comes with a logging and monitoring schema (jobmon) that tracks every run of its maintenance procedures, logging successes, failures, and timings. In production, this is invaluable for debugging and monitoring automation.
Step 3: The Template Table for Indexing Strategy
Without a template table, you would have to manually add indexes to every new partition created by pg_partman. This is unmanageable. The template table ensures that all indexes, constraints, and other properties are automatically applied to new child tables.
-- Create a template table with the EXACT same structure as the parent.
-- This table will never hold data.
CREATE TABLE events_template (
id BIGINT,
device_id UUID,
created_at TIMESTAMPTZ,
event_type VARCHAR(50),
payload JSONB
);
-- Define all indexes needed for our queries on the template table.
-- These will be automatically created on each new partition.
CREATE INDEX idx_events_template_device_id_created_at ON events_template (device_id, created_at DESC);
-- Optional: Add other constraints or properties
-- ALTER TABLE events_template ADD CONSTRAINT ...;
Now, when pg_partman creates events_p2024_10_28, it will be created with the idx_events_template_device_id_created_at index (renamed appropriately, of course). This pattern is fundamental to maintaining query performance over time.
Step 4: Automating Maintenance and Retention
Partitions must be created and dropped on a schedule. We use pg_partman.run_maintenance_proc() for this, and pg_cron to automate it.
First, let's configure the retention policy in pg_partman's configuration table.
UPDATE partman.part_config
SET
retention = '90 days',
retention_keep_table = false,
retention_keep_index = false
WHERE parent_table = 'public.events';
* retention = '90 days': Partitions containing data older than 90 days from the current date will be targeted for removal.
* retention_keep_table = false: CRITICAL. This tells pg_partman to DROP the table entirely. The alternative, true, would simply detach the partition from the parent, leaving it as a standalone table. You might detach for archival to cold storage (e.g., COPY to S3 then DROP). For simple data purging, false is what we want. It's an instantaneous, metadata-only operation that doesn't cause the locking or WAL overhead of a mass DELETE.
Now, let's schedule the maintenance job. This should run more frequently than your partition interval. For daily partitions, running it hourly is a robust choice.
-- Ensure pg_cron is enabled in your instance
-- CREATE EXTENSION pg_cron;
-- Schedule the maintenance procedure to run at the top of every hour.
-- This will create new partitions and drop old ones based on our config.
SELECT cron.schedule('hourly-partition-maintenance', '0 * * * *', 'CALL pg_partman.run_maintenance_proc()');
The CALL syntax is important for procedures. This single cron job now handles the entire lifecycle of our events table partitions.
Performance Analysis: The Payoff
Let's re-run our query from before on the newly partitioned table.
EXPLAIN ANALYZE
SELECT id, event_type, payload, created_at
FROM events
WHERE device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1'
AND created_at >= NOW() - INTERVAL '1 day'
ORDER BY created_at DESC;
The query plan is now radically different and dramatically more efficient.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Append (cost=0.42..150.78 rows=15 width=150) (actual time=0.045..12.345 rows=15000 loops=1)
-> Index Scan using events_p2024_10_27_device_id_created_at_idx on events_p2024_10_27 (cost=0.42..75.34 rows=7 width=150) (actual time=0.030..5.123 rows=7500 loops=1)
Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
-> Index Scan using events_p2024_10_28_device_id_created_at_idx on events_p2024_10_28 (cost=0.42..75.34 rows=8 width=150) (actual time=0.015..7.222 rows=7500 loops=1)
Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
Planning Time: 0.876 ms
Execution Time: 13.123 ms
From 15 seconds to 13 milliseconds.
What happened? The PostgreSQL query planner performed Partition Pruning. Because our WHERE clause contains a constraint on the partition key (created_at), the planner knew that only two partitions could possibly contain the requested data: today's (events_p2024_10_28) and yesterday's (events_p2024_10_27). It completely ignored the other 88 partitions representing the rest of our 90-day retention window. The query now only has to scan two small, highly-cached indexes instead of one monolithic, 2TB index.
This is the fundamental benefit of partitioning for time-series data.
Advanced Edge Case: Backfilling a Multi-Terabyte Live Table
This all sounds great for a new system, but what about the 2TB events table that's already on fire in production? You can't just take a multi-hour outage to migrate the data. This requires a zero-downtime migration strategy.
Here is a battle-tested pattern:
events_partitioned) and configure pg_partman for it as described above. Do not drop the old table (events_old).INSERTs to the new partitioned table. This ensures no new data is lost during the backfill. CREATE OR REPLACE FUNCTION forward_events_to_partitioned()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO events_partitioned (id, device_id, created_at, event_type, payload)
VALUES (NEW.id, NEW.device_id, NEW.created_at, NEW.event_type, NEW.payload);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_dual_write_events
BEFORE INSERT ON events_old
FOR EACH ROW EXECUTE FUNCTION forward_events_to_partitioned();
Your application continues to write to events_old. Reads can also continue from events_old for now.
events_old to events_partitioned in small, manageable chunks. This is critical to avoid long-running transactions and overwhelming the database. -- Conceptual backfill script logic (to be run repeatedly)
DO $$
DECLARE
batch_start timestamptz;
batch_end timestamptz;
BEGIN
-- Find the last backfilled timestamp from a control table
SELECT last_processed_timestamp INTO batch_start FROM backfill_status WHERE table_name = 'events';
batch_end := batch_start - INTERVAL '1 hour';
-- Copy one hour of data
INSERT INTO events_partitioned (id, device_id, created_at, event_type, payload)
SELECT id, device_id, created_at, event_type, payload
FROM events_old
WHERE created_at BETWEEN batch_end AND batch_start
ON CONFLICT DO NOTHING; -- The trigger might have already inserted recent data
-- Update control table
UPDATE backfill_status SET last_processed_timestamp = batch_end WHERE table_name = 'events';
END;
$$;
This process is run in a loop, moving backward in time, until all data is copied. The ON CONFLICT DO NOTHING is a safeguard in case the backfill process overlaps with the dual-write trigger.
a. Put the application in a brief maintenance mode.
b. Rename the tables: ALTER TABLE events_old RENAME TO events_archive;, ALTER TABLE events_partitioned RENAME TO events;.
c. Remove the trigger.
d. Bring the application back online.
events table is working correctly, you can drop events_archive.This phased approach allows for the migration of enormous tables with minimal impact on live services.
Conclusion: From Reactive Firefighting to Proactive Architecture
Implementing pg_partman is more than a database tuning exercise; it's a fundamental architectural shift for handling time-series data in PostgreSQL. By automating the lifecycle of native partitions, you move from a reactive state of constantly fighting performance fires to a proactive, scalable architecture.
The key takeaways for production systems are:
* The Partition Key MUST be in the Primary/Unique Key: This is the most common blocker and requires careful schema design.
* Use Native Partitioning: p_type := 'native' is the modern, high-performance choice.
* Leverage p_premake: Pre-creating partitions is essential to avoid write latency on high-throughput systems.
* Automate Everything with pg_cron: Schedule run_maintenance_proc() to run reliably and frequently.
* Use Template Tables: This is the only sane way to manage indexes and constraints across dozens or hundreds of partitions.
* Plan for Backfilling: For existing systems, a zero-downtime backfill strategy is a project in itself and must be carefully planned and executed.
By adopting these advanced patterns, you can confidently scale PostgreSQL to handle terabytes of time-series data, achieving query performance and operational stability that rivals specialized time-series databases, all within the rich, familiar ecosystem of PostgreSQL.