PostgreSQL Partitioning at Scale with `pg_partman` for Time-Series Data

October 1, 2025

14 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Scaling Failure of Monolithic Time-Series Tables

As a senior engineer, you've likely witnessed this scenario: a service is launched with a simple events or metrics table in PostgreSQL. Initially, it performs beautifully. Months later, with billions of rows accumulated, the system grinds to a halt. INSERT statements become unpredictably latent, VACUUM processes run for hours consuming significant I/O, index maintenance is a high-risk operation, and even simple time-range queries take minutes instead of milliseconds. Deleting old data with DELETE FROM ... WHERE created_at < ... locks the table and generates massive WAL traffic, risking a production outage.

This isn't a failure of PostgreSQL; it's a failure of data modeling at scale. The solution is table partitioning. While PostgreSQL has offered native declarative partitioning since version 10, managing the lifecycle of partitions—creating new ones for upcoming data and archiving or dropping old ones—remains a manual, error-prone task.

This is the precise problem pg_partman solves. It's not a magical extension that changes how partitioning works; it's a sophisticated automation and management layer built on top of PostgreSQL's native capabilities. In this post, we will move beyond the basics and implement a production-grade partitioning strategy for a high-volume time-series table, focusing on the nuances and advanced patterns required for mission-critical systems.

Our scenario will be an IoT platform logging millions of events per hour from various devices.

Baseline: The Problem Schema and its Performance Degradation

Let's start with the problematic, unpartitioned table structure.

sql

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    device_id UUID NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type VARCHAR(50) NOT NULL,
    payload JSONB
);

CREATE INDEX idx_events_created_at ON events (created_at DESC);
CREATE INDEX idx_events_device_id_created_at ON events (device_id, created_at DESC);

After a year of operation, this table contains 5 billion rows and is approximately 2TB in size. Let's analyze a common query: fetching the last 24 hours of events for a specific device.

sql

EXPLAIN ANALYZE
SELECT id, event_type, payload, created_at
FROM events
WHERE device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1'
  AND created_at >= NOW() - INTERVAL '1 day'
ORDER BY created_at DESC;

Even with the compound index idx_events_device_id_created_at, the query plan will look something like this:

text

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_events_device_id_created_at on events  (cost=0.56..54321.23 rows=12345 width=150) (actual time=0.123..15432.123 rows=15000 loops=1)
   Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
 Planning Time: 0.234 ms
 Execution Time: 15435.456 ms

A 15-second execution time is unacceptable. The core issue is that the B-tree index is enormous. Traversing this massive index to find the relatively few rows for the last day is inefficient. The database must navigate through terabytes of data, much of which is cold and irrelevant to this query, causing significant buffer cache churn.

Dropping old data is even more catastrophic:

DELETE FROM events WHERE created_at < NOW() - INTERVAL '90 days';

This command will acquire row-level locks, potentially escalating to a table lock, generate an enormous amount of WAL, and leave behind dead tuples that require an aggressive VACUUM, further impacting performance.

Production Implementation with `pg_partman`

We will now architect a solution using native partitioning automated by pg_partman.

Step 1: Environment Setup and Schema Redesign

First, ensure pg_partman is installed. On most managed database services (like AWS RDS), this is a matter of enabling it in the shared_preload_libraries parameter and then running:

sql

CREATE EXTENSION pg_partman;

The most critical change is to our table schema. For native partitioning, the partition key must be part of the primary key (or any unique index). This is a non-negotiable constraint that often trips up teams migrating existing tables.

Our new schema:

sql

CREATE TABLE events (
    id BIGSERIAL,
    device_id UUID NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    event_type VARCHAR(50) NOT NULL,
    payload JSONB,
    PRIMARY KEY (id, created_at) -- Partition key is now part of the PK
)
PARTITION BY RANGE (created_at);

Notice the PARTITION BY RANGE (created_at) clause. This declares events as the partitioned table (the "parent"). It will not hold any data itself but will act as a proxy for its child partitions.

Step 2: Advanced `create_parent` Configuration

This is where we instruct pg_partman on how to manage our events table. We'll go beyond a basic setup and use parameters crucial for high-performance environments.

sql

-- Create a role for partman to run its background tasks if desired
-- CREATE ROLE partman_role;

-- Run the main configuration function
SELECT pg_partman.create_parent(
    p_parent_table := 'public.events',
    p_control := 'created_at',
    p_type := 'native',
    p_interval := '1 day',
    p_premake := 7,
    p_start_partition := (NOW() - INTERVAL '3 days')::text,
    p_template_table := 'public.events_template',
    p_jobmon := 'on'
);

Let's dissect these parameters from a senior engineer's perspective:

* p_parent_table: public.events - The target partitioned table.

* p_control: created_at - The column to partition by.

* p_type: 'native' - CRITICAL. We are explicitly using PostgreSQL's declarative partitioning. The older, trigger-based method (partman) has significant overhead and should be avoided for new projects.

* p_interval: '1 day' - The heart of the strategy. We're creating one new table per day.

* Trade-offs: A shorter interval (e.g., '1 hour') means smaller, more manageable tables and indexes, faster DROP operations, and quicker VACUUMs per partition. However, it can lead to a very high number of tables, which can slightly increase query planning time and strain metadata caches. A longer interval (e.g., '1 week') reduces table count but leads to larger partitions that may eventually exhibit the same problems as the original monolithic table. For high-volume IoT data, daily is a common and effective starting point.

* p_premake: 7 - Performance critical. This tells pg_partman to always maintain 7 partitions for future data. If today is Monday, it ensures partitions for Tuesday, Wednesday, ..., through next Monday already exist. This prevents INSERT statements from being the ones to trigger partition creation, which would introduce significant write latency. In a high-throughput system, this is non-negotiable.

* p_start_partition: We're telling it to create partitions starting from 3 days ago. This is useful if we have recent data to load immediately.

* p_template_table: public.events_template - The key to sane index and constraint management. We will define this next. It's a blueprint for all future partitions.

* p_jobmon: 'on' - Highly recommended. pg_partman comes with a logging and monitoring schema (jobmon) that tracks every run of its maintenance procedures, logging successes, failures, and timings. In production, this is invaluable for debugging and monitoring automation.

Step 3: The Template Table for Indexing Strategy

Without a template table, you would have to manually add indexes to every new partition created by pg_partman. This is unmanageable. The template table ensures that all indexes, constraints, and other properties are automatically applied to new child tables.

sql

-- Create a template table with the EXACT same structure as the parent.
-- This table will never hold data.
CREATE TABLE events_template (
    id BIGINT,
    device_id UUID,
    created_at TIMESTAMPTZ,
    event_type VARCHAR(50),
    payload JSONB
);

-- Define all indexes needed for our queries on the template table.
-- These will be automatically created on each new partition.
CREATE INDEX idx_events_template_device_id_created_at ON events_template (device_id, created_at DESC);

-- Optional: Add other constraints or properties
-- ALTER TABLE events_template ADD CONSTRAINT ...;

Now, when pg_partman creates events_p2024_10_28, it will be created with the idx_events_template_device_id_created_at index (renamed appropriately, of course). This pattern is fundamental to maintaining query performance over time.

Step 4: Automating Maintenance and Retention

Partitions must be created and dropped on a schedule. We use pg_partman.run_maintenance_proc() for this, and pg_cron to automate it.

First, let's configure the retention policy in pg_partman's configuration table.

sql

UPDATE partman.part_config
SET 
    retention = '90 days',
    retention_keep_table = false,
    retention_keep_index = false
WHERE parent_table = 'public.events';

* retention = '90 days': Partitions containing data older than 90 days from the current date will be targeted for removal.

* retention_keep_table = false: CRITICAL. This tells pg_partman to DROP the table entirely. The alternative, true, would simply detach the partition from the parent, leaving it as a standalone table. You might detach for archival to cold storage (e.g., COPY to S3 then DROP). For simple data purging, false is what we want. It's an instantaneous, metadata-only operation that doesn't cause the locking or WAL overhead of a mass DELETE.

Now, let's schedule the maintenance job. This should run more frequently than your partition interval. For daily partitions, running it hourly is a robust choice.

sql

-- Ensure pg_cron is enabled in your instance
-- CREATE EXTENSION pg_cron;

-- Schedule the maintenance procedure to run at the top of every hour.
-- This will create new partitions and drop old ones based on our config.
SELECT cron.schedule('hourly-partition-maintenance', '0 * * * *', 'CALL pg_partman.run_maintenance_proc()');

The CALL syntax is important for procedures. This single cron job now handles the entire lifecycle of our events table partitions.

Performance Analysis: The Payoff

Let's re-run our query from before on the newly partitioned table.

sql

EXPLAIN ANALYZE
SELECT id, event_type, payload, created_at
FROM events
WHERE device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1'
  AND created_at >= NOW() - INTERVAL '1 day'
ORDER BY created_at DESC;

The query plan is now radically different and dramatically more efficient.

text

                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.42..150.78 rows=15 width=150) (actual time=0.045..12.345 rows=15000 loops=1)
   ->  Index Scan using events_p2024_10_27_device_id_created_at_idx on events_p2024_10_27  (cost=0.42..75.34 rows=7 width=150) (actual time=0.030..5.123 rows=7500 loops=1)
         Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
   ->  Index Scan using events_p2024_10_28_device_id_created_at_idx on events_p2024_10_28  (cost=0.42..75.34 rows=8 width=150) (actual time=0.015..7.222 rows=7500 loops=1)
         Index Cond: ((device_id = 'a1e8e5e8-0b5b-4c8f-a2b1-9b9b76d4a1c1') AND (created_at >= (now() - '1 day'::interval)))
 Planning Time: 0.876 ms
 Execution Time: 13.123 ms

From 15 seconds to 13 milliseconds.

What happened? The PostgreSQL query planner performed Partition Pruning. Because our WHERE clause contains a constraint on the partition key (created_at), the planner knew that only two partitions could possibly contain the requested data: today's (events_p2024_10_28) and yesterday's (events_p2024_10_27). It completely ignored the other 88 partitions representing the rest of our 90-day retention window. The query now only has to scan two small, highly-cached indexes instead of one monolithic, 2TB index.

This is the fundamental benefit of partitioning for time-series data.

Advanced Edge Case: Backfilling a Multi-Terabyte Live Table

This all sounds great for a new system, but what about the 2TB events table that's already on fire in production? You can't just take a multi-hour outage to migrate the data. This requires a zero-downtime migration strategy.

Here is a battle-tested pattern:

Setup New Partitioned Table: Create the new partitioned table (events_partitioned) and configure pg_partman for it as described above. Do not drop the old table (events_old).

Implement Dual Writes: Create a trigger on the old table to forward all new INSERTs to the new partitioned table. This ensures no new data is lost during the backfill.

sql

    CREATE OR REPLACE FUNCTION forward_events_to_partitioned()
    RETURNS TRIGGER AS $$
    BEGIN
        INSERT INTO events_partitioned (id, device_id, created_at, event_type, payload)
        VALUES (NEW.id, NEW.device_id, NEW.created_at, NEW.event_type, NEW.payload);
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;

    CREATE TRIGGER trigger_dual_write_events
    BEFORE INSERT ON events_old
    FOR EACH ROW EXECUTE FUNCTION forward_events_to_partitioned();

Your application continues to write to events_old. Reads can also continue from events_old for now.

Backfill in Batches: Write a script (in Python, Go, or even PL/pgSQL) to copy data from events_old to events_partitioned in small, manageable chunks. This is critical to avoid long-running transactions and overwhelming the database.

sql

    -- Conceptual backfill script logic (to be run repeatedly)
    DO $$
    DECLARE
        batch_start timestamptz;
        batch_end timestamptz;
    BEGIN
        -- Find the last backfilled timestamp from a control table
        SELECT last_processed_timestamp INTO batch_start FROM backfill_status WHERE table_name = 'events';
        
        batch_end := batch_start - INTERVAL '1 hour';

        -- Copy one hour of data
        INSERT INTO events_partitioned (id, device_id, created_at, event_type, payload)
        SELECT id, device_id, created_at, event_type, payload
        FROM events_old
        WHERE created_at BETWEEN batch_end AND batch_start
        ON CONFLICT DO NOTHING; -- The trigger might have already inserted recent data

        -- Update control table
        UPDATE backfill_status SET last_processed_timestamp = batch_end WHERE table_name = 'events';
    END;
    $$;

This process is run in a loop, moving backward in time, until all data is copied. The ON CONFLICT DO NOTHING is a safeguard in case the backfill process overlaps with the dual-write trigger.

The Cutover: Once the backfill is complete and the data has been validated, you can perform the final cutover in a short maintenance window:

a. Put the application in a brief maintenance mode.

b. Rename the tables: ALTER TABLE events_old RENAME TO events_archive;, ALTER TABLE events_partitioned RENAME TO events;.

c. Remove the trigger.

d. Bring the application back online.

Cleanup: Once you are confident the new events table is working correctly, you can drop events_archive.

This phased approach allows for the migration of enormous tables with minimal impact on live services.

Conclusion: From Reactive Firefighting to Proactive Architecture

Implementing pg_partman is more than a database tuning exercise; it's a fundamental architectural shift for handling time-series data in PostgreSQL. By automating the lifecycle of native partitions, you move from a reactive state of constantly fighting performance fires to a proactive, scalable architecture.

The key takeaways for production systems are:

* The Partition Key MUST be in the Primary/Unique Key: This is the most common blocker and requires careful schema design.

* Use Native Partitioning: p_type := 'native' is the modern, high-performance choice.

* Leverage p_premake: Pre-creating partitions is essential to avoid write latency on high-throughput systems.

* Automate Everything with pg_cron: Schedule run_maintenance_proc() to run reliably and frequently.

* Use Template Tables: This is the only sane way to manage indexes and constraints across dozens or hundreds of partitions.

* Plan for Backfilling: For existing systems, a zero-downtime backfill strategy is a project in itself and must be carefully planned and executed.

By adopting these advanced patterns, you can confidently scale PostgreSQL to handle terabytes of time-series data, achieving query performance and operational stability that rivals specialized time-series databases, all within the rich, familiar ecosystem of PostgreSQL.