Scaling Time-Series Data with PostgreSQL and pg_partman Extensions
The Inevitable Decay of Monolithic Time-Series Tables
As a senior engineer managing a high-throughput system, you've likely witnessed the predictable lifecycle of a monolithic time-series table in PostgreSQL. It begins innocently: a simple events or logs table. Performance is snappy. But as data accumulates at a rate of millions or billions of rows per month, a creeping decay sets in. Query latency for time-range scans increases polynomially. VACUUM processes become resource hogs, struggling to keep up with dead tuples. Index bloat becomes a constant battle, and the sheer size of the B-tree indexes on your created_at column starts to impact insert performance and memory usage.
This isn't a failure of PostgreSQL; it's a failure of data modeling at scale. The standard solution is table partitioning, a technique PostgreSQL natively supports. However, native partitioning is declarative for the what but procedural for the how. You must manually create new partitions, manage triggers or routing rules, and script the detachment and dropping of old partitions. This manual approach is brittle, error-prone, and adds significant operational overhead.
This is where we move beyond the fundamentals. We're not here to discuss if you should partition, but how to do it in a robust, automated, and performant way in a production environment. We will focus on pg_partman, a PostgreSQL extension that provides a configuration-driven framework for managing time and serial-based partition sets, effectively turning the procedural headache of partition maintenance into a declarative, fire-and-forget setup.
This article will dissect advanced strategies for implementing pg_partman, focusing on:
The Scenario: A High-Throughput IoT Metrics Platform
Imagine we're building a platform that ingests metrics from millions of IoT devices. Our core table, sensor_readings, receives a constant stream of data.
Here's our initial, unpartitioned schema:
CREATE TABLE sensor_readings (
reading_id BIGSERIAL PRIMARY KEY,
device_id UUID NOT NULL,
reading_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metric_type VARCHAR(50) NOT NULL,
payload JSONB
);
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, reading_time DESC);
CREATE INDEX idx_sensor_readings_time ON sensor_readings (reading_time DESC);
At 10,000 readings per second, this table grows by over 860 million rows per day. Within a week, we're dealing with a multi-billion row table. A simple query to get the last hour of data for a specific device, which should be fast, starts to degrade:
EXPLAIN ANALYZE SELECT *
FROM sensor_readings
WHERE device_id = 'a1b2c3d4-...'
AND reading_time >= NOW() - INTERVAL '1 hour';
Initially, the planner can efficiently use the composite index. But as the table grows to terabytes, the index itself becomes massive. The planner might still perform an Index Scan, but the number of pages it has to fetch from disk or cache increases dramatically. The cost of traversing the enormous B-tree becomes non-trivial. Maintenance operations like VACUUM and REINDEX become epic, multi-hour (or multi-day) tasks.
Step 1: `pg_partman` Configuration for Declarative Partitioning
Let's solve this problem correctly from the start. We'll assume you have the pg_partman extension installed (CREATE EXTENSION pg_partman;).
Our goal is to partition sensor_readings by reading_time into daily partitions.
1.1. Create the Parent Table
The parent table will be identical to our original, but with the PARTITION BY RANGE clause. Crucially, it will contain no data itself; it acts as a template and a routing point.
CREATE TABLE sensor_readings_partitioned (
reading_id BIGSERIAL NOT NULL,
device_id UUID NOT NULL,
reading_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metric_type VARCHAR(50) NOT NULL,
payload JSONB
) PARTITION BY RANGE (reading_time);
-- IMPORTANT: Indexes are created on the parent table and will be propagated to new partitions.
CREATE INDEX idx_srp_device_time ON sensor_readings_partitioned (device_id, reading_time DESC);
CREATE INDEX idx_srp_time ON sensor_readings_partitioned (reading_time DESC);
-- The primary key must include the partition key in PostgreSQL.
-- We will manage uniqueness differently, often at the application layer or with a composite primary key.
-- For this example, we'll drop the simple PRIMARY KEY constraint and handle uniqueness via other means.
Note on Primary Keys: A key limitation in PostgreSQL partitioning is that any PRIMARY KEY or UNIQUE constraint must include all columns from the partition key. Our original PRIMARY KEY (reading_id) is invalid. A common pattern is to use a composite primary key: PRIMARY KEY (reading_id, reading_time). This ensures uniqueness within the constraint of the partitioning scheme.
1.2. Configure `pg_partman`
Now, we tell pg_partman how to manage this table set. We use the create_parent function.
SELECT partman.create_parent(
p_parent_table := 'public.sensor_readings_partitioned',
p_control := 'reading_time',
p_type := 'native',
p_interval := 'daily',
p_premake := 4,
p_start_partition := (NOW() - INTERVAL '3 days')::text
);
Let's break down these critical parameters:
* p_parent_table: The table to manage.
* p_control: The partition key column.
* p_type := 'native': We're using PostgreSQL's native partitioning, which has been the standard since version 10. The alternative, 'trigger-based', is a legacy method.
* p_interval := 'daily': The heart of the configuration. pg_partman will create one new table per day.
p_premake := 4: This is a crucial performance optimization. It tells pg_partman to always maintain 4 partitions for future* data. If today is Wednesday, it ensures partitions for Thursday, Friday, Saturday, and Sunday already exist. This prevents write-blocking; without it, the very first write of a new day would be blocked while the new partition is created. p_premake ensures the table is always ready for incoming data.
* p_start_partition: Allows you to pre-create partitions for historical data if you're migrating.
1.3. Automating Maintenance
pg_partman doesn't run itself. You need to periodically call its maintenance function, run_maintenance(). This function checks all configured parent tables, creates new partitions as needed (respecting premake), and drops old ones (based on retention policies we'll set later).
The best way to schedule this is with an extension like pg_cron.
-- Install pg_cron if you haven't already
-- Add to postgresql.conf: shared_preload_libraries = 'pg_cron'
-- Then run: CREATE EXTENSION pg_cron;
-- Schedule maintenance to run every hour.
SELECT cron.schedule('pg_partman_maintenance', '0 * * * *', 'CALL partman.run_maintenance_proc()');
Running it hourly is a safe default. The procedure is idempotent and lightweight if there's nothing to do.
Advanced Pattern 1: Zero-Downtime Migration of a Live Table
This is the most common and challenging scenario. You have a 5TB sensor_readings table and you can't afford downtime to partition it.
Here's a production-tested, step-by-step strategy:
Phase 1: Setup and Dual-Writing
sensor_readings_partitioned and configure it with pg_partman.sensor_readings) that copies every new INSERT to the new partitioned table. CREATE OR REPLACE FUNCTION duplicate_to_partitioned_table()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO sensor_readings_partitioned (reading_id, device_id, reading_time, metric_type, payload)
VALUES (NEW.reading_id, NEW.device_id, NEW.reading_time, NEW.metric_type, NEW.payload);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_dual_write
BEFORE INSERT ON sensor_readings
FOR EACH ROW EXECUTE FUNCTION duplicate_to_partitioned_table();
At this point, all new data is flowing to both tables. Your application is still reading from the old table.
Phase 2: The Backfill
sensor_readings to sensor_readings_partitioned. Do not do this in a single transaction. It will lock resources and fill your transaction logs. Instead, backfill in small, manageable chunks. Here's a sample script to run in a screen or tmux session. It copies one day of data at a time, with a small delay between batches to reduce load.
-- This is a conceptual script. You'd run this from a shell script or admin tool.
DO $$
DECLARE
start_date TIMESTAMPTZ := (SELECT MIN(reading_time) FROM sensor_readings);
end_date TIMESTAMPTZ := (SELECT MAX(reading_time) FROM sensor_readings);
current_date TIMESTAMPTZ := start_date;
BEGIN
WHILE current_date < end_date LOOP
RAISE NOTICE 'Backfilling data for %', current_date::date;
INSERT INTO sensor_readings_partitioned (reading_id, device_id, reading_time, metric_type, payload)
SELECT reading_id, device_id, reading_time, metric_type, payload
FROM sensor_readings
WHERE reading_time >= current_date AND reading_time < (current_date + INTERVAL '1 day');
-- Commit each chunk
COMMIT;
current_date := current_date + INTERVAL '1 day';
-- Be a good database citizen
PERFORM pg_sleep(1);
END LOOP;
END;
$$;
Phase 3: The Switchover
-- Start a transaction
BEGIN;
-- Acquire an exclusive lock on both tables to prevent any writes.
LOCK TABLE sensor_readings IN ACCESS EXCLUSIVE MODE;
LOCK TABLE sensor_readings_partitioned IN ACCESS EXCLUSIVE MODE;
-- Rename the tables.
ALTER TABLE sensor_readings RENAME TO sensor_readings_old;
ALTER TABLE sensor_readings_partitioned RENAME TO sensor_readings;
-- End the transaction. The lock is released.
COMMIT;
The entire transaction should take milliseconds. Your application code, which references sensor_readings, now transparently interacts with the new partitioned table.
Phase 4: Cleanup
DROP TRIGGER trigger_dual_write ON sensor_readings_old;
DROP TABLE sensor_readings_old;
Advanced Pattern 2: Composite Partitioning for Multi-Dimensional Queries
Our daily partitioning by reading_time is great for time-based scans. But what if a common query pattern is to find all data for a specific device_id within a time range? With only time-based partitioning, the query planner still has to look inside each daily partition and scan the (device_id, reading_time) index.
We can do better with composite partitioning. Let's partition by reading_time (Range) first, and then sub-partition each day by device_id (List or Hash).
This is not directly supported by pg_partman in a single create_parent call, but we can achieve it by applying pg_partman recursively.
Step 1: Create the top-level time partition.
This is the same as before.
-- Already done
CREATE TABLE sensor_readings (...) PARTITION BY RANGE (reading_time);
SELECT partman.create_parent('public.sensor_readings', 'reading_time', 'native', 'daily');
Step 2: Create a trigger to apply sub-partitioning.
We need a mechanism that, whenever pg_partman creates a new daily partition (e.g., sensor_readings_p2024_10_27), it immediately configures that new partition to be a parent for device_id sub-partitions.
pg_partman doesn't have a direct hook for this, so an event trigger is a powerful, if complex, solution. A simpler, more operational approach is to modify the maintenance job.
-- A custom maintenance procedure
CREATE OR REPLACE PROCEDURE custom_run_maintenance()
LANGUAGE plpgsql AS $$
DECLARE
new_partitions TEXT[];
part TEXT;
BEGIN
-- Find partitions that exist now but didn't before maintenance
SELECT array_agg(relname) INTO new_partitions
FROM pg_class
WHERE relname LIKE 'sensor_readings_p%'
AND relname NOT IN (SELECT partition_tablename FROM partman.part_config_sub);
-- Run standard pg_partman maintenance
CALL partman.run_maintenance_proc();
-- Now, for any newly created tables, configure them as sub-partition parents
FOREACH part IN ARRAY new_partitions
LOOP
-- Make the daily partition a parent for HASH partitions on device_id
-- We'll create 16 sub-partitions per day.
EXECUTE format('ALTER TABLE %I DETACH PARTITION ...'); -- This part is complex
-- You would need to alter the table to add the new partition key
-- and then call create_parent on it.
-- This shows the complexity; a better approach is designing it from the start.
END LOOP;
END;
$$;
The above shows the difficulty of retrofitting. A better way is to design it from the start with a helper function that pg_partman can call.
A more direct approach: While pg_partman focuses on the primary time-series axis, you can manually define the second-level partitioning on the template table it uses. This is a far cleaner pattern.
Advanced Pattern 3: Performance Tuning with Template Tables and BRIN Indexes
By default, pg_partman creates new partitions by inheriting indexes from the parent. This is good, but we can have more granular control using a template table.
This allows us to define a specific structure, including indexes and constraints, that is only used for new partitions, independent of the parent.
This is incredibly powerful for time-series data, where we can leverage specialized index types like BRIN (Block Range Index).
A BRIN index is much smaller than a B-tree and is ideal for columns that have a strong natural correlation with their physical storage order, like reading_time. For a 1TB partition, a B-tree index on reading_time could be 50-100GB, while a BRIN index might be a few megabytes.
Step 1: Create the template table.
It must have the exact same column structure as the parent.
CREATE TABLE sensor_readings_template (
reading_id BIGSERIAL NOT NULL,
device_id UUID NOT NULL,
reading_time TIMESTAMPTZ NOT NULL,
metric_type VARCHAR(50) NOT NULL,
payload JSONB
);
-- Define OPTIMIZED indexes for our partitions here
-- B-tree for high-selectivity device_id lookups
CREATE INDEX ON sensor_readings_template (device_id, reading_time DESC);
-- BRIN index for fast time-range scans on the whole partition
CREATE INDEX ON sensor_readings_template USING BRIN (reading_time);
Step 2: Link the template table in part_config.
Instead of creating the template table manually, the recommended pg_partman approach is to let it manage the template table for you. You can specify this during creation or update the config later.
-- When calling create_parent initially
SELECT partman.create_parent(
...,
p_template_table := 'public.sensor_readings_template'
);
-- Or update the configuration for an existing setup
UPDATE partman.part_config
SET template_table = 'public.sensor_readings_template'
WHERE parent_table = 'public.sensor_readings_partitioned';
Now, every new partition pg_partman creates will use the structure and, most importantly, the indexes from sensor_readings_template, not the parent. This lets you have a BRIN index on reading_time on each daily partition while potentially having no index on the parent at all, saving space and overhead.
Query Performance Analysis
Let's see the impact. A query for a 6-hour window on a multi-terabyte, unpartitioned table might look like this:
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM sensor_readings
WHERE reading_time BETWEEN '2024-10-26 00:00:00' AND '2024-10-26 06:00:00';
-- Result on unpartitioned table:
-- Index Scan using idx_sensor_readings_time on sensor_readings (cost=0.57..5,834,192.44 rows=250,194,231 width=8)
-- Index Cond: ((reading_time >= '...') AND (reading_time <= '...'))
-- Buffers: shared hit=1234567, read=876543
-- Execution Time: 35,421.123 ms
Now, on the partitioned table:
-- Result on partitioned table:
-- Aggregate (cost=...)
-- -> Append (cost=...)
-- -> Bitmap Heap Scan on sensor_readings_p2024_10_26 (cost=...)
-- Recheck Cond: ((reading_time >= '...') AND (reading_time <= '...'))
-- Buffers: shared hit=54321, read=1234
-- -> Bitmap Index Scan using sensor_readings_p2024_10_26_reading_time_brin_idx (cost=...)
-- Execution Time: 210.456 ms
The key is Constraint Exclusion. The planner knows the query only targets a specific day, so it completely ignores all other partitions. It scans only the single, much smaller sensor_readings_p2024_10_26 table and its tiny BRIN index. The performance difference is often 100x or more.
Operational Gotchas and Edge Cases
1. Out-of-Order Data Ingestion
What happens if a device that was offline for a week suddenly sends its buffered data? Your system tries to insert a record with a reading_time from 7 days ago. If you've already dropped that partition as part of your retention policy, the insert will fail because no suitable partition exists.
* Solution 1 (Default): The insert fails. You must have robust error handling and a dead-letter queue in your application to handle these.
* Solution 2 (The default partition): You can create a DEFAULT partition to catch any data that doesn't fit into the defined partitions. This prevents insert failures but requires a separate process to periodically check the default partition and move data to the correct place (or discard it).
CREATE TABLE sensor_readings_default PARTITION OF sensor_readings_partitioned DEFAULT;
2. Retention Policy Management
pg_partman makes retention trivial. You configure it in the part_config table.
UPDATE partman.part_config SET
retention = '30 days',
retention_keep_table = true,
retention_keep_index = true
WHERE parent_table = 'public.sensor_readings_partitioned';
The next time run_maintenance() runs, it will detach any partition whose data is entirely older than 30 days. DETACH is a near-instant, metadata-only operation. The old table (sensor_readings_p2024_09_26) still exists. You can then back it up to cold storage (e.g., S3) and drop it later.
If you set retention_keep_table = false, pg_partman will DROP the table immediately. This is faster and reclaims disk space but is permanent.
3. Locking Behavior
run_maintenance() needs to take locks to create or detach partitions. These are generally brief ACCESS EXCLUSIVE locks on the parent table. With p_premake set appropriately, partition creation happens long before it's needed, so it's unlikely to conflict with live traffic.
Dropping partitions can be more intensive. The key is to schedule the maintenance job during low-traffic periods. The hourly schedule with pg_cron is usually fine, as the job does nothing most of the time and is very fast when it does act.
Conclusion: From Reactive DBA to Proactive Architect
By leveraging pg_partman, you elevate your role from a reactive database administrator fighting daily fires of index bloat and slow queries to a proactive architect designing a system that is scalable by default. This declarative approach to partitioning abstracts away the complex, error-prone maintenance scripts and provides a robust framework for managing the entire lifecycle of your time-series data.
The key takeaways for production implementation are:
* Automate Everything: Use pg_partman and pg_cron to fully automate partition creation and retention. Manual intervention should be the exception.
* Pre-create Partitions: Always use the p_premake setting to avoid insert latency and contention.
* Plan Your Migration: For existing large tables, a zero-downtime migration using a dual-write trigger and chunked backfill is a proven, safe strategy.
* Optimize Indexes: Use template tables to apply partition-specific indexes like BRIN, drastically reducing storage overhead and accelerating range scans.
* Monitor and Alert: Monitor for failures in the maintenance job and for data landing in a default partition, as these indicate potential data pipeline issues.
Adopting these advanced patterns transforms PostgreSQL from a simple relational database into a highly performant, scalable time-series engine capable of handling enterprise-level workloads.