Advanced PostgreSQL Partitioning: Scaling Time-Series with pg_partman

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Scaling Problem with Time-Series Tables

Any system logging events, metrics, or sensor data eventually confronts the monolith table problem. A single events table, growing by millions or billions of rows, becomes a significant bottleneck. INSERT performance degrades due to index bloat, SELECT queries become agonizingly slow without precise time-range scoping, and maintenance operations like VACUUM and ALTER TABLE can lock up the system for hours.

PostgreSQL's native declarative partitioning is the foundational solution. It allows a large table to be physically split into smaller, more manageable child tables (partitions) while being logically queryable as a single entity. The query planner can then perform partition pruning, scanning only the relevant child tables based on the WHERE clause, leading to dramatic performance gains.

However, native partitioning alone is insufficient for a robust, automated production system. It provides the mechanism but not the management. Key operational questions remain:

  • Partition Creation: How are new partitions for future data created automatically and ahead of time to prevent write failures?
  • Data Retention: How are old partitions dropped or archived to manage storage costs and maintain performance?
  • Maintenance: How is the entire lifecycle managed, monitored, and maintained without constant manual intervention?
  • This is where pg_partman enters. It's an extension that automates the creation and management of time-based and serial-based table partition sets. But simply installing and running create_parent() is just scratching the surface. This article focuses on the advanced patterns required to leverage pg_partman in demanding, high-throughput production environments.


    Section 1: Beyond the Basics - A Production-Ready `pg_partman` Setup

    Let's assume we're building a system to ingest high-frequency IoT device events. A naive table might look like this:

    sql
    CREATE TABLE events (
        event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        device_id UUID NOT NULL,
        created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
        payload JSONB
    );
    
    CREATE INDEX idx_events_device_id ON events (device_id);
    CREATE INDEX idx_events_created_at ON events (created_at DESC);

    This schema will fail at scale. A UUID primary key on a multi-terabyte table is inefficient, and the indexes will become enormous. The first step is to design for partitioning from day one.

    The Partition Key Constraint: Rethinking Primary Keys

    A critical limitation of PostgreSQL partitioning is that any PRIMARY KEY or UNIQUE constraint on a partitioned table must include all columns of the partition key. Our partition key will be created_at. This means event_id alone can no longer be the primary key.

    The standard production pattern is to create a composite primary key.

    sql
    -- The parent table for our partitioned setup
    CREATE TABLE events (
        event_id UUID NOT NULL DEFAULT gen_random_uuid(),
        device_id UUID NOT NULL,
        created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
        payload JSONB,
        PRIMARY KEY (created_at, event_id) -- Partition key first is a common convention
    ) PARTITION BY RANGE (created_at);
    
    -- We still need fast lookups by device_id
    CREATE INDEX idx_events_device_id_created_at ON events (device_id, created_at DESC);

    By including created_at in the primary key, we satisfy PostgreSQL's constraint. This has implications for your application logic. Uniqueness is now guaranteed on the combination of created_at and event_id. While collisions on this composite key are astronomically unlikely, it's a fundamental shift from a simple UUID primary key.

    Code Example 1: Detailed `create_parent()` Configuration

    Now, let's configure pg_partman to manage this table. We won't just set an interval; we'll configure it for a high-write production workload.

    sql
    -- Ensure the extension is installed
    CREATE EXTENSION IF NOT EXISTS pg_partman;
    
    -- Configure pg_partman to manage the 'events' table
    SELECT partman.create_parent(
        p_parent_table := 'public.events',
        p_control := 'created_at',
        p_type := 'native',
        p_interval := 'daily',
        p_premake := 14,  -- Create 14 days of partitions in advance
        p_start_partition := (now() - interval '7 day')::text,
        p_jobmon := 'on' -- Enable logging via the pg_jobmon extension
    );

    Let's break down the advanced considerations here:

    * p_type := 'native': Always use this for modern PostgreSQL (11+). It leverages declarative partitioning, which is superior to the old trigger-based method.

    * p_premake := 14: This is a critical tuning parameter. It tells pg_partman to maintain 14 future daily partitions at all times. For a high-throughput system, if your maintenance job fails to run for a day, you don't want writes to start failing because a partition for the current day doesn't exist. The premake value is your buffer. The trade-off is storage overhead for empty tables vs. write availability. A value between 7 and 30 is common, depending on your operational confidence and data velocity.

    * p_start_partition: Explicitly setting a start partition is crucial if you are migrating existing data. If omitted, pg_partman might create partitions from the earliest data in the table, which could be years ago, creating thousands of unnecessary partitions.

    p_jobmon := 'on': In production, you must* have insight into pg_partman's maintenance runs. pg_jobmon (another extension by the same author) provides detailed logging of run times, steps, and failures. Without this, you are flying blind.

    After running this, pg_partman creates the necessary child partitions (events_p2023_10_26, events_p2023_10_27, etc.) and automatically attaches them to the parent events table. All indexes from the parent are automatically replicated to the children.


    Section 2: Advanced Data Retention and Archiving Patterns

    pg_partman provides a simple mechanism for data retention via the p_retention parameter in part_config and the run_maintenance_proc() function. It will drop partitions older than the specified interval. In a production system with valuable data, automatically dropping data is extremely dangerous.

    A network hiccup, a misconfiguration, or a bug could lead to permanent data loss. The professional pattern is to separate the detachment of a partition from its destruction.

    Our strategy will be:

  • Use pg_partman to detach old partitions, but not drop them.
    • Implement a separate, idempotent process to archive the detached partitions to cold storage (e.g., AWS S3).
    • Only after successful archival does the process drop the table.

    Step 1: Configure `pg_partman` for Detach-Only

    Update the part_config table to disable automatic dropping.

    sql
    UPDATE partman.part_config
    SET 
        retention = '30 days',
        retention_keep_table = 'true',
        retention_keep_index = 'true'
    WHERE parent_table = 'public.events';

    * retention = '30 days': We still define the retention window.

    * retention_keep_table = 'true': This is the key. When run_maintenance_proc() runs, it will detach any partition whose data is entirely older than 30 days, but it will leave the table (e.g., public.events_p2023_09_25) in the database as a standalone table.

    Code Example 2: A Robust Archival and Cleanup Script

    This script would be run by an external scheduler (like cron or an Airflow DAG) after the pg_partman maintenance job.

    bash
    #!/bin/bash
    
    set -e
    set -o pipefail
    
    PG_HOST="your_db_host"
    PG_USER="your_db_user"
    PG_DB="your_db_name"
    S3_BUCKET="s3://your-company-pg-archives"
    
    # Find detached partitions. pg_partman leaves them in the parent's schema.
    # We look for tables that are no longer in pg_inherits.
    DETACHED_TABLES=$(psql -h $PG_HOST -U $PG_USER -d $PG_DB -t -c "
        SELECT c.relname
        FROM pg_class c
        LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
        WHERE c.relkind = 'r'
          AND n.nspname = 'public'
          AND c.relname ~ '^events_p[0-9]{4}_[0-9]{2}_[0-9]{2}$'
          AND NOT EXISTS (
            SELECT 1 FROM pg_inherits WHERE inhrelid = c.oid
          );
    ")
    
    if [ -z "$DETACHED_TABLES" ]; then
        echo "No detached partitions found to archive."
        exit 0
    fi
    
    echo "Found detached partitions: $DETACHED_TABLES"
    
    for TABLE_NAME in $DETACHED_TABLES; do
        echo "Archiving table: $TABLE_NAME"
        
        FILENAME="${TABLE_NAME}_$(date +%Y%m%d%H%M%S).sql.gz"
        S3_PATH="${S3_BUCKET}/${PG_DB}/${TABLE_NAME}/${FILENAME}"
    
        # 1. Archive the table to S3
        pg_dump -h $PG_HOST -U $PG_USER -d $PG_DB -t "public.${TABLE_NAME}" | gzip | aws s3 cp - "$S3_PATH"
        
        # Check exit code of the pipe. If pg_dump or aws cp fails, this will be non-zero.
        if [ $? -ne 0 ]; then
            echo "ERROR: Archival of $TABLE_NAME to $S3_PATH failed. Aborting drop."
            exit 1
        fi
    
        echo "Successfully archived $TABLE_NAME to $S3_PATH"
    
        # 2. Verify archive exists (optional but recommended)
        aws s3 ls "$S3_PATH" > /dev/null
        if [ $? -ne 0 ]; then
            echo "ERROR: Verification of archive $S3_PATH failed. Aborting drop."
            exit 1
        fi
    
        # 3. Drop the table only after successful archival and verification
        echo "Dropping table $TABLE_NAME..."
        psql -h $PG_HOST -U $PG_USER -d $PG_DB -c "DROP TABLE public.\"${TABLE_NAME}\";"
    
        echo "Successfully dropped table $TABLE_NAME."
    
    done
    
    echo "Archival and cleanup process complete."

    This script introduces a safety layer. Data is only dropped after it's securely backed up. It's idempotent; if the script fails midway (e.g., the S3 upload is interrupted), the next run will find the same detached table and try again. The detached table acts as a buffer or staging area for archival.


    Section 3: Evolving Schemas and Indexes with Template Tables

    As a system matures, its data access patterns change. Perhaps for recent data (last 7 days), you need fast key-value lookups on a specific field in the payload JSONB, but for older data, this index is just costly overhead.

    pg_partman allows you to specify a template_table for newly created partitions. This provides a powerful mechanism for managing heterogeneous index strategies across your partition set.

    Scenario: A GIN index for recent data only

    Let's say we want a GIN index on payload for fast searching, but only on partitions for the next 7 days, as ad-hoc queries primarily target recent events.

    Code Example 3: Implementing a Template Table

  • Create the template table. It must have the exact same column structure as the parent, but can have a different set of indexes.
  • sql
        -- This table will never hold data, it's just a template for structure.
        CREATE TABLE events_template (
            event_id UUID NOT NULL,
            device_id UUID NOT NULL,
            created_at TIMESTAMPTZ NOT NULL,
            payload JSONB,
            PRIMARY KEY (created_at, event_id)
        );
    
        -- Add the standard indexes that ALL partitions need
        CREATE INDEX idx_events_template_device_id_created_at ON events_template (device_id, created_at DESC);
    
        -- Add the special GIN index only for new partitions
        CREATE INDEX idx_events_template_payload_gin ON events_template USING gin (payload jsonb_path_ops);
  • Associate the template table with your partition set.
  • sql
        UPDATE partman.part_config 
        SET template_table = 'public.events_template' 
        WHERE parent_table = 'public.events';

    Now, whenever pg_partman's maintenance job runs and creates new partitions, it will use events_template to define the structure, including the special GIN index. Existing partitions are unaffected.

    Edge Case: Removing the index from older partitions

    This setup only applies the index to new partitions. What about the partitions that are now older than 7 days but still have the expensive GIN index? You need a cleanup process.

    This can be handled by a scheduled procedure.

    sql
    CREATE OR REPLACE PROCEDURE drop_old_gin_indexes() LANGUAGE plpgsql AS $$
    DECLARE
        v_partition_name TEXT;
        v_index_name TEXT;
        v_cutoff_date TIMESTAMPTZ := now() - interval '7 days';
    BEGIN
        -- Find child partitions of 'events' that are older than the cutoff
        FOR v_partition_name IN 
            SELECT pg_class.relname
            FROM pg_inherits
            JOIN pg_class ON pg_inherits.inhrelid = pg_class.oid
            JOIN pg_class as parent ON pg_inherits.inhparent = parent.oid
            WHERE parent.relname = 'events'
              AND to_date(substring(pg_class.relname from 'events_p(\d{4}_\d{2}_\d{2})'), 'YYYY_MM_DD') < v_cutoff_date::date
        LOOP
            -- Construct the expected index name based on pg's naming convention for partitioned indexes
            v_index_name := v_partition_name || '_payload_gin_idx'; -- Note: a better way is to query pg_indexes
    
            -- Check if the index exists on this partition and drop it
            IF EXISTS (SELECT 1 FROM pg_indexes WHERE tablename = v_partition_name AND indexname = v_index_name) THEN
                RAISE NOTICE 'Dropping index % on partition %', v_index_name, v_partition_name;
                EXECUTE 'DROP INDEX CONCURRENTLY ' || quote_ident(v_index_name);
            END IF;
        END LOOP;
    END;
    $$;
    
    -- This would be called daily by a scheduler
    CALL drop_old_gin_indexes();

    Performance Note: DROP INDEX CONCURRENTLY is used to avoid locking the partition while the index is being removed. This is critical in a production environment.

    This combination of a template table for creation and a cleanup procedure for removal allows you to maintain a rolling window of specialized indexes, optimizing for both recent query performance and long-term storage costs.


    Section 4: Performance Tuning and Verifying Partition Pruning

    Partitioning is useless if the query planner doesn't use it. Partition pruning is the process where the planner intelligently excludes partitions from a query plan based on the WHERE clause. If this fails, the database may still scan all child tables, negating the benefits.

    Common Pitfalls that Break Pruning

  • Applying functions to the partition key in WHERE clauses:
  • * Bad: WHERE date_trunc('day', created_at) = '2023-10-26'

    * Why: The planner cannot guarantee that the result of date_trunc will fall within a specific partition's range. It must check all partitions.

    * Good: WHERE created_at >= '2023-10-26 00:00:00' AND created_at < '2023-10-27 00:00:00'

    * Why: This provides a direct, analyzable range that the planner can map directly to one or more partitions.

  • Using unstable functions or subqueries:
  • * Bad: WHERE created_at > (now() - interval '1 hour') if now() is not treated as a stable value during planning.

    Why: The planner might not evaluate now() at plan time, forcing a scan of all partitions that could* match.

    * Good: Pre-calculating the timestamp in your application and passing it as a literal parameter.

    Code Example 4: Analyzing `EXPLAIN` Plans

    Let's analyze the difference. Assume we have hundreds of daily partitions.

    Query 1: Correct Pruning

    sql
    EXPLAIN (ANALYZE, BUFFERS) 
    SELECT count(*)
    FROM events
    WHERE created_at >= '2023-10-25 00:00:00' AND created_at < '2023-10-26 00:00:00';

    The EXPLAIN output will look something like this:

    text
    Finalize Aggregate  (cost=4350.93..4350.94 rows=1 width=8) (actual time=24.571..24.572 rows=1 loops=1)
      Buffers: shared hit=1234
      ->  Gather  (cost=4350.72..4350.93 rows=2 width=8) (actual time=24.542..24.568 rows=3 loops=1)
            Workers Planned: 2
            Workers Launched: 2
            Buffers: shared hit=1234
            ->  Partial Aggregate  (cost=3350.72..3350.73 rows=1 width=8) (actual time=21.169..21.170 rows=1 loops=3)
                  Buffers: shared hit=1234
                  ->  Parallel Append  (cost=0.00..3210.50 rows=56088 width=0) (actual time=0.027..17.891 rows=45123 loops=3)
                        Buffers: shared hit=1234
                        ->  Parallel Index Only Scan using events_p2023_10_25_pkey on events_p2023_10_25 ...
                              (cost=0.00..2890.00 rows=56088 width=0) (actual time=0.023..15.432 rows=45123 loops=3)
                              Buffers: shared hit=1234

    The most important part is the last line: Parallel Index Only Scan using ... on events_p2023_10_25. The planner correctly identified that only one partition (events_p2023_10_25) needed to be scanned. This is a successful prune.

    Query 2: Pruning Failure

    sql
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT count(*)
    FROM events
    WHERE date_trunc('day', created_at) = '2023-10-25'::date;

    The EXPLAIN output will be drastically different and much longer:

    text
    ... (Aggregate and Gather nodes) ...
      ->  Append  (cost=0.00..543210.98 rows=1234567 width=8)
            ->  Index Scan using events_p2023_09_01_pkey on events_p2023_09_01 ...
                  Filter: (date_trunc('day'::text, created_at) = '2023-10-25 00:00:00'::timestamp with time zone)
            ->  Index Scan using events_p2023_09_02_pkey on events_p2023_09_02 ...
                  Filter: (date_trunc('day'::text, created_at) = '2023-10-25 00:00:00'::timestamp with time zone)
            ->  Index Scan using events_p2023_09_03_pkey on events_p2023_09_03 ...
            ... (and so on for every single partition) ...

    The key indicator of failure is the Append node followed by scans on every single partition. The planner could not prune, so it must check each one, applying the expensive date_trunc function to every row. The performance difference between these two queries on a large dataset can be orders of magnitude.

    Monitoring for Pruning Issues

    * Enable auto_explain: For a subset of slow queries, auto_explain can log the execution plans, allowing you to proactively find queries that are failing to prune.

    * Review Query Logs: Regularly review your slow query logs (log_min_duration_statement). Look for queries against your partitioned tables that have unexpectedly high execution times. Manually run EXPLAIN on them to diagnose pruning issues.

    Conclusion: From Mechanism to Strategy

    PostgreSQL partitioning, supercharged with pg_partman, is a formidable solution for scaling time-series data. However, achieving genuine production resilience and performance requires moving beyond the initial setup and adopting a strategic approach.

    Senior engineers must:

  • Design for Partitioning from the Start: This includes adapting primary key strategies and understanding the constraints imposed by the partition key.
  • Implement Safe, Decoupled Retention Policies: Never rely on automated, destructive operations for valuable data. An archive-then-drop workflow using pg_partman's detach feature is the professional standard.
  • Use Template Tables for Evolving Needs: Your indexing strategy is not static. Use template tables to adapt to changing query patterns without downtime, applying expensive indexes only where they provide value.
  • Obsessively Verify Query Plans: The performance gains of partitioning are entirely dependent on the query planner's ability to prune. Continuously monitor and analyze EXPLAIN plans to ensure your queries are efficient and your database is not performing unnecessary work.
  • By embracing these advanced patterns, you can transform a potentially unmanageable, monolithic table into a highly performant, scalable, and operationally robust system capable of handling petabytes of data for years to come.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles