Scaling Time-Series Data with PostgreSQL and pg_partman Extensions

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Decay of Monolithic Time-Series Tables

As a senior engineer managing a high-throughput system, you've likely witnessed the predictable lifecycle of a monolithic time-series table in PostgreSQL. It begins innocently: a simple events or logs table. Performance is snappy. But as data accumulates at a rate of millions or billions of rows per month, a creeping decay sets in. Query latency for time-range scans increases polynomially. VACUUM processes become resource hogs, struggling to keep up with dead tuples. Index bloat becomes a constant battle, and the sheer size of the B-tree indexes on your created_at column starts to impact insert performance and memory usage.

This isn't a failure of PostgreSQL; it's a failure of data modeling at scale. The standard solution is table partitioning, a technique PostgreSQL natively supports. However, native partitioning is declarative for the what but procedural for the how. You must manually create new partitions, manage triggers or routing rules, and script the detachment and dropping of old partitions. This manual approach is brittle, error-prone, and adds significant operational overhead.

This is where we move beyond the fundamentals. We're not here to discuss if you should partition, but how to do it in a robust, automated, and performant way in a production environment. We will focus on pg_partman, a PostgreSQL extension that provides a configuration-driven framework for managing time and serial-based partition sets, effectively turning the procedural headache of partition maintenance into a declarative, fire-and-forget setup.

This article will dissect advanced strategies for implementing pg_partman, focusing on:

  • Zero-Downtime Migration: A step-by-step guide for partitioning a live, multi-terabyte table without service interruption.
  • Composite Partitioning: Leveraging multi-level partitioning (e.g., by time and customer ID) to optimize complex query patterns.
  • Performance Tuning: Using template tables to manage partition-local indexes, including specialized index types like BRIN.
  • Operational Gotchas: Handling out-of-order data, managing retention policies, and understanding the locking behavior of maintenance jobs.

  • The Scenario: A High-Throughput IoT Metrics Platform

    Imagine we're building a platform that ingests metrics from millions of IoT devices. Our core table, sensor_readings, receives a constant stream of data.

    Here's our initial, unpartitioned schema:

    sql
    CREATE TABLE sensor_readings (
        reading_id BIGSERIAL PRIMARY KEY,
        device_id UUID NOT NULL,
        reading_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        metric_type VARCHAR(50) NOT NULL,
        payload JSONB
    );
    
    CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, reading_time DESC);
    CREATE INDEX idx_sensor_readings_time ON sensor_readings (reading_time DESC);

    At 10,000 readings per second, this table grows by over 860 million rows per day. Within a week, we're dealing with a multi-billion row table. A simple query to get the last hour of data for a specific device, which should be fast, starts to degrade:

    sql
    EXPLAIN ANALYZE SELECT * 
    FROM sensor_readings
    WHERE device_id = 'a1b2c3d4-...' 
    AND reading_time >= NOW() - INTERVAL '1 hour';

    Initially, the planner can efficiently use the composite index. But as the table grows to terabytes, the index itself becomes massive. The planner might still perform an Index Scan, but the number of pages it has to fetch from disk or cache increases dramatically. The cost of traversing the enormous B-tree becomes non-trivial. Maintenance operations like VACUUM and REINDEX become epic, multi-hour (or multi-day) tasks.

    Step 1: `pg_partman` Configuration for Declarative Partitioning

    Let's solve this problem correctly from the start. We'll assume you have the pg_partman extension installed (CREATE EXTENSION pg_partman;).

    Our goal is to partition sensor_readings by reading_time into daily partitions.

    1.1. Create the Parent Table

    The parent table will be identical to our original, but with the PARTITION BY RANGE clause. Crucially, it will contain no data itself; it acts as a template and a routing point.

    sql
    CREATE TABLE sensor_readings_partitioned (
        reading_id BIGSERIAL NOT NULL,
        device_id UUID NOT NULL,
        reading_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        metric_type VARCHAR(50) NOT NULL,
        payload JSONB
    ) PARTITION BY RANGE (reading_time);
    
    -- IMPORTANT: Indexes are created on the parent table and will be propagated to new partitions.
    CREATE INDEX idx_srp_device_time ON sensor_readings_partitioned (device_id, reading_time DESC);
    CREATE INDEX idx_srp_time ON sensor_readings_partitioned (reading_time DESC);
    
    -- The primary key must include the partition key in PostgreSQL.
    -- We will manage uniqueness differently, often at the application layer or with a composite primary key.
    -- For this example, we'll drop the simple PRIMARY KEY constraint and handle uniqueness via other means.

    Note on Primary Keys: A key limitation in PostgreSQL partitioning is that any PRIMARY KEY or UNIQUE constraint must include all columns from the partition key. Our original PRIMARY KEY (reading_id) is invalid. A common pattern is to use a composite primary key: PRIMARY KEY (reading_id, reading_time). This ensures uniqueness within the constraint of the partitioning scheme.

    1.2. Configure `pg_partman`

    Now, we tell pg_partman how to manage this table set. We use the create_parent function.

    sql
    SELECT partman.create_parent(
        p_parent_table := 'public.sensor_readings_partitioned',
        p_control := 'reading_time',
        p_type := 'native',
        p_interval := 'daily',
        p_premake := 4,
        p_start_partition := (NOW() - INTERVAL '3 days')::text
    );

    Let's break down these critical parameters:

    * p_parent_table: The table to manage.

    * p_control: The partition key column.

    * p_type := 'native': We're using PostgreSQL's native partitioning, which has been the standard since version 10. The alternative, 'trigger-based', is a legacy method.

    * p_interval := 'daily': The heart of the configuration. pg_partman will create one new table per day.

    p_premake := 4: This is a crucial performance optimization. It tells pg_partman to always maintain 4 partitions for future* data. If today is Wednesday, it ensures partitions for Thursday, Friday, Saturday, and Sunday already exist. This prevents write-blocking; without it, the very first write of a new day would be blocked while the new partition is created. p_premake ensures the table is always ready for incoming data.

    * p_start_partition: Allows you to pre-create partitions for historical data if you're migrating.

    1.3. Automating Maintenance

    pg_partman doesn't run itself. You need to periodically call its maintenance function, run_maintenance(). This function checks all configured parent tables, creates new partitions as needed (respecting premake), and drops old ones (based on retention policies we'll set later).

    The best way to schedule this is with an extension like pg_cron.

    sql
    -- Install pg_cron if you haven't already
    -- Add to postgresql.conf: shared_preload_libraries = 'pg_cron'
    -- Then run: CREATE EXTENSION pg_cron;
    
    -- Schedule maintenance to run every hour.
    SELECT cron.schedule('pg_partman_maintenance', '0 * * * *', 'CALL partman.run_maintenance_proc()');

    Running it hourly is a safe default. The procedure is idempotent and lightweight if there's nothing to do.

    Advanced Pattern 1: Zero-Downtime Migration of a Live Table

    This is the most common and challenging scenario. You have a 5TB sensor_readings table and you can't afford downtime to partition it.

    Here's a production-tested, step-by-step strategy:

    Phase 1: Setup and Dual-Writing

  • Create the new partitioned table: Follow the steps in section 1.1 and 1.2 to create sensor_readings_partitioned and configure it with pg_partman.
  • Create a dual-write trigger: Create a trigger on the old table (sensor_readings) that copies every new INSERT to the new partitioned table.
  • sql
        CREATE OR REPLACE FUNCTION duplicate_to_partitioned_table() 
        RETURNS TRIGGER AS $$
        BEGIN
            INSERT INTO sensor_readings_partitioned (reading_id, device_id, reading_time, metric_type, payload)
            VALUES (NEW.reading_id, NEW.device_id, NEW.reading_time, NEW.metric_type, NEW.payload);
            RETURN NEW;
        END;
        $$ LANGUAGE plpgsql;
    
        CREATE TRIGGER trigger_dual_write
        BEFORE INSERT ON sensor_readings
        FOR EACH ROW EXECUTE FUNCTION duplicate_to_partitioned_table();

    At this point, all new data is flowing to both tables. Your application is still reading from the old table.

    Phase 2: The Backfill

  • Backfill historical data: Now, you need to copy the existing 5TB of data from sensor_readings to sensor_readings_partitioned. Do not do this in a single transaction. It will lock resources and fill your transaction logs. Instead, backfill in small, manageable chunks.
  • Here's a sample script to run in a screen or tmux session. It copies one day of data at a time, with a small delay between batches to reduce load.

    sql
        -- This is a conceptual script. You'd run this from a shell script or admin tool.
        DO $$
        DECLARE
            start_date TIMESTAMPTZ := (SELECT MIN(reading_time) FROM sensor_readings);
            end_date TIMESTAMPTZ := (SELECT MAX(reading_time) FROM sensor_readings);
            current_date TIMESTAMPTZ := start_date;
        BEGIN
            WHILE current_date < end_date LOOP
                RAISE NOTICE 'Backfilling data for %', current_date::date;
                
                INSERT INTO sensor_readings_partitioned (reading_id, device_id, reading_time, metric_type, payload)
                SELECT reading_id, device_id, reading_time, metric_type, payload
                FROM sensor_readings
                WHERE reading_time >= current_date AND reading_time < (current_date + INTERVAL '1 day');
    
                -- Commit each chunk
                COMMIT;
    
                current_date := current_date + INTERVAL '1 day';
                
                -- Be a good database citizen
                PERFORM pg_sleep(1);
            END LOOP;
        END;
        $$;

    Phase 3: The Switchover

  • Verification: Once the backfill is complete and the dual-write trigger has been running, the two tables should contain identical data. Run counts and checksums to verify this.
  • The Cutover (Minimal Downtime Window): This is the only part that requires a brief maintenance window.
  • sql
        -- Start a transaction
        BEGIN;
    
        -- Acquire an exclusive lock on both tables to prevent any writes.
        LOCK TABLE sensor_readings IN ACCESS EXCLUSIVE MODE;
        LOCK TABLE sensor_readings_partitioned IN ACCESS EXCLUSIVE MODE;
    
        -- Rename the tables.
        ALTER TABLE sensor_readings RENAME TO sensor_readings_old;
        ALTER TABLE sensor_readings_partitioned RENAME TO sensor_readings;
    
        -- End the transaction. The lock is released.
        COMMIT;

    The entire transaction should take milliseconds. Your application code, which references sensor_readings, now transparently interacts with the new partitioned table.

    Phase 4: Cleanup

  • Drop the old resources: Once you are confident everything is working, you can drop the trigger and the old table.
  • sql
        DROP TRIGGER trigger_dual_write ON sensor_readings_old;
        DROP TABLE sensor_readings_old;

    Advanced Pattern 2: Composite Partitioning for Multi-Dimensional Queries

    Our daily partitioning by reading_time is great for time-based scans. But what if a common query pattern is to find all data for a specific device_id within a time range? With only time-based partitioning, the query planner still has to look inside each daily partition and scan the (device_id, reading_time) index.

    We can do better with composite partitioning. Let's partition by reading_time (Range) first, and then sub-partition each day by device_id (List or Hash).

    This is not directly supported by pg_partman in a single create_parent call, but we can achieve it by applying pg_partman recursively.

    Step 1: Create the top-level time partition.

    This is the same as before.

    sql
    -- Already done
    CREATE TABLE sensor_readings (...) PARTITION BY RANGE (reading_time);
    SELECT partman.create_parent('public.sensor_readings', 'reading_time', 'native', 'daily');

    Step 2: Create a trigger to apply sub-partitioning.

    We need a mechanism that, whenever pg_partman creates a new daily partition (e.g., sensor_readings_p2024_10_27), it immediately configures that new partition to be a parent for device_id sub-partitions.

    pg_partman doesn't have a direct hook for this, so an event trigger is a powerful, if complex, solution. A simpler, more operational approach is to modify the maintenance job.

    sql
    -- A custom maintenance procedure
    CREATE OR REPLACE PROCEDURE custom_run_maintenance()
    LANGUAGE plpgsql AS $$
    DECLARE
        new_partitions TEXT[];
        part TEXT;
    BEGIN
        -- Find partitions that exist now but didn't before maintenance
        SELECT array_agg(relname) INTO new_partitions 
        FROM pg_class 
        WHERE relname LIKE 'sensor_readings_p%' 
        AND relname NOT IN (SELECT partition_tablename FROM partman.part_config_sub);
    
        -- Run standard pg_partman maintenance
        CALL partman.run_maintenance_proc();
    
        -- Now, for any newly created tables, configure them as sub-partition parents
        FOREACH part IN ARRAY new_partitions
        LOOP
            -- Make the daily partition a parent for HASH partitions on device_id
            -- We'll create 16 sub-partitions per day.
            EXECUTE format('ALTER TABLE %I DETACH PARTITION ...'); -- This part is complex
            -- You would need to alter the table to add the new partition key
            -- and then call create_parent on it.
            -- This shows the complexity; a better approach is designing it from the start.
        END LOOP;
    END;
    $$;

    The above shows the difficulty of retrofitting. A better way is to design it from the start with a helper function that pg_partman can call.

    A more direct approach: While pg_partman focuses on the primary time-series axis, you can manually define the second-level partitioning on the template table it uses. This is a far cleaner pattern.

    Advanced Pattern 3: Performance Tuning with Template Tables and BRIN Indexes

    By default, pg_partman creates new partitions by inheriting indexes from the parent. This is good, but we can have more granular control using a template table.

    This allows us to define a specific structure, including indexes and constraints, that is only used for new partitions, independent of the parent.

    This is incredibly powerful for time-series data, where we can leverage specialized index types like BRIN (Block Range Index).

    A BRIN index is much smaller than a B-tree and is ideal for columns that have a strong natural correlation with their physical storage order, like reading_time. For a 1TB partition, a B-tree index on reading_time could be 50-100GB, while a BRIN index might be a few megabytes.

    Step 1: Create the template table.

    It must have the exact same column structure as the parent.

    sql
    CREATE TABLE sensor_readings_template (
        reading_id BIGSERIAL NOT NULL,
        device_id UUID NOT NULL,
        reading_time TIMESTAMPTZ NOT NULL,
        metric_type VARCHAR(50) NOT NULL,
        payload JSONB
    );
    
    -- Define OPTIMIZED indexes for our partitions here
    -- B-tree for high-selectivity device_id lookups
    CREATE INDEX ON sensor_readings_template (device_id, reading_time DESC);
    
    -- BRIN index for fast time-range scans on the whole partition
    CREATE INDEX ON sensor_readings_template USING BRIN (reading_time);

    Step 2: Link the template table in part_config.

    Instead of creating the template table manually, the recommended pg_partman approach is to let it manage the template table for you. You can specify this during creation or update the config later.

    sql
    -- When calling create_parent initially
    SELECT partman.create_parent(
        ...,
        p_template_table := 'public.sensor_readings_template'
    );
    
    -- Or update the configuration for an existing setup
    UPDATE partman.part_config 
    SET template_table = 'public.sensor_readings_template' 
    WHERE parent_table = 'public.sensor_readings_partitioned';

    Now, every new partition pg_partman creates will use the structure and, most importantly, the indexes from sensor_readings_template, not the parent. This lets you have a BRIN index on reading_time on each daily partition while potentially having no index on the parent at all, saving space and overhead.

    Query Performance Analysis

    Let's see the impact. A query for a 6-hour window on a multi-terabyte, unpartitioned table might look like this:

    text
    EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM sensor_readings
    WHERE reading_time BETWEEN '2024-10-26 00:00:00' AND '2024-10-26 06:00:00';
    
    -- Result on unpartitioned table:
    -- Index Scan using idx_sensor_readings_time on sensor_readings (cost=0.57..5,834,192.44 rows=250,194,231 width=8)
    --   Index Cond: ((reading_time >= '...') AND (reading_time <= '...'))
    --   Buffers: shared hit=1234567, read=876543
    --   Execution Time: 35,421.123 ms

    Now, on the partitioned table:

    text
    -- Result on partitioned table:
    -- Aggregate (cost=...)
    --   -> Append (cost=...)
    --     -> Bitmap Heap Scan on sensor_readings_p2024_10_26 (cost=...)
    --          Recheck Cond: ((reading_time >= '...') AND (reading_time <= '...'))
    --          Buffers: shared hit=54321, read=1234
    --          -> Bitmap Index Scan using sensor_readings_p2024_10_26_reading_time_brin_idx (cost=...)
    --   Execution Time: 210.456 ms

    The key is Constraint Exclusion. The planner knows the query only targets a specific day, so it completely ignores all other partitions. It scans only the single, much smaller sensor_readings_p2024_10_26 table and its tiny BRIN index. The performance difference is often 100x or more.

    Operational Gotchas and Edge Cases

    1. Out-of-Order Data Ingestion

    What happens if a device that was offline for a week suddenly sends its buffered data? Your system tries to insert a record with a reading_time from 7 days ago. If you've already dropped that partition as part of your retention policy, the insert will fail because no suitable partition exists.

    * Solution 1 (Default): The insert fails. You must have robust error handling and a dead-letter queue in your application to handle these.

    * Solution 2 (The default partition): You can create a DEFAULT partition to catch any data that doesn't fit into the defined partitions. This prevents insert failures but requires a separate process to periodically check the default partition and move data to the correct place (or discard it).

    sql
        CREATE TABLE sensor_readings_default PARTITION OF sensor_readings_partitioned DEFAULT;

    2. Retention Policy Management

    pg_partman makes retention trivial. You configure it in the part_config table.

    sql
    UPDATE partman.part_config SET
        retention = '30 days',
        retention_keep_table = true,
        retention_keep_index = true
    WHERE parent_table = 'public.sensor_readings_partitioned';

    The next time run_maintenance() runs, it will detach any partition whose data is entirely older than 30 days. DETACH is a near-instant, metadata-only operation. The old table (sensor_readings_p2024_09_26) still exists. You can then back it up to cold storage (e.g., S3) and drop it later.

    If you set retention_keep_table = false, pg_partman will DROP the table immediately. This is faster and reclaims disk space but is permanent.

    3. Locking Behavior

    run_maintenance() needs to take locks to create or detach partitions. These are generally brief ACCESS EXCLUSIVE locks on the parent table. With p_premake set appropriately, partition creation happens long before it's needed, so it's unlikely to conflict with live traffic.

    Dropping partitions can be more intensive. The key is to schedule the maintenance job during low-traffic periods. The hourly schedule with pg_cron is usually fine, as the job does nothing most of the time and is very fast when it does act.

    Conclusion: From Reactive DBA to Proactive Architect

    By leveraging pg_partman, you elevate your role from a reactive database administrator fighting daily fires of index bloat and slow queries to a proactive architect designing a system that is scalable by default. This declarative approach to partitioning abstracts away the complex, error-prone maintenance scripts and provides a robust framework for managing the entire lifecycle of your time-series data.

    The key takeaways for production implementation are:

    * Automate Everything: Use pg_partman and pg_cron to fully automate partition creation and retention. Manual intervention should be the exception.

    * Pre-create Partitions: Always use the p_premake setting to avoid insert latency and contention.

    * Plan Your Migration: For existing large tables, a zero-downtime migration using a dual-write trigger and chunked backfill is a proven, safe strategy.

    * Optimize Indexes: Use template tables to apply partition-specific indexes like BRIN, drastically reducing storage overhead and accelerating range scans.

    * Monitor and Alert: Monitor for failures in the maintenance job and for data landing in a default partition, as these indicate potential data pipeline issues.

    Adopting these advanced patterns transforms PostgreSQL from a simple relational database into a highly performant, scalable time-series engine capable of handling enterprise-level workloads.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles