Postgres Partitioning with `pg_partman` for Multi-Tenant Time-Series

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Scaling Problem of Time-Series Data in Multi-Tenant Systems

In any mature SaaS application, tables storing time-series data—such as metrics, logs, events, or audit trails—are ticking time bombs. A single, monolithic events table that performs adequately with 100 tenants and 10 million rows will grind to a halt with 10,000 tenants and 10 billion rows. The reasons are familiar to any senior engineer who has fought this battle:

  • Index Bloat and Inefficiency: B-tree indexes on a massive table become deep and unwieldy. A simple index scan for a single tenant's data over a short time range might still require traversing large portions of an index that contains data for thousands of other tenants.
  • VACUUM Overhead: AUTOVACUUM becomes a major operational burden. On a multi-terabyte table, vacuums run for hours or even days, consuming significant I/O and CPU, often struggling to keep up with tuple churn.
  • Slow Data Deletion: Implementing data retention policies (e.g., "delete data older than 90 days") becomes a nightmare. A large-scale DELETE operation is incredibly slow, generates massive WAL traffic, and leaves behind dead tuples that exacerbate the VACUUM problem. It often requires careful, iterative batch-deleting to avoid locking contention and performance degradation.
  • PostgreSQL's native declarative partitioning is the textbook solution. It allows you to split one logical table into smaller, more manageable physical tables. The query planner can then perform partition pruning, ignoring partitions that don't contain relevant data based on the WHERE clause. While this is a powerful feature, managing a partitioned table with hundreds or thousands of partitions manually is an operational nightmare. Creating new partitions, enforcing retention policies, and managing indexes across the entire set is tedious and error-prone.

    This is where pg_partman enters. It's not just a convenience wrapper; it's a sophisticated automation engine for partition management that enables production-grade strategies. This article will dissect a specific, advanced pattern: using pg_partman to manage a multi-tenant time-series table partitioned by time, while ensuring high performance for tenant-specific queries through a carefully designed primary key and indexing strategy.


    The Core Strategy: Time-Partitioning with Tenant-Aware Indexing

    Our goal is to build a system that can efficiently answer queries like:

    SELECT * FROM metrics WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at BETWEEN '2023-10-26 10:00:00' AND '2023-10-26 11:00:00';

    To achieve this, we cannot partition by tenant_id. Partitioning by LIST (tenant_id) would create a separate table for each tenant, which is unmanageable with thousands of tenants. Instead, we partition by a time-based column (created_at) and make our indexes tenant-aware.

    Here is the parent table structure we'll work with. Note the deliberate choice of the primary key.

    sql
    CREATE TABLE public.metrics (
        metric_id uuid NOT NULL DEFAULT gen_random_uuid(),
        tenant_id uuid NOT NULL,
        created_at timestamptz NOT NULL,
        metric_name text NOT NULL,
        metric_value double precision NOT NULL,
        metadata jsonb
    ) PARTITION BY RANGE (created_at);
    
    -- CRITICAL: The partition key (created_at) MUST be part of the primary key.
    ALTER TABLE public.metrics 
    ADD CONSTRAINT metrics_pkey PRIMARY KEY (tenant_id, metric_id, created_at);
    
    -- The index that will serve most of our queries.
    CREATE INDEX metrics_tenant_id_created_at_idx ON public.metrics (tenant_id, created_at DESC);

    Why this structure is critical:

  • PARTITION BY RANGE (created_at): This enables efficient time-based partition pruning. When a query specifies a created_at range, the planner will only scan the relevant daily or hourly partitions.
  • PRIMARY KEY (tenant_id, metric_id, created_at): In PostgreSQL, any primary key or unique constraint on a partitioned table must include all columns from the partition key. We include tenant_id first to optimize for tenant-specific lookups. metric_id ensures uniqueness for each measurement.
  • INDEX ON (tenant_id, created_at DESC): This is our workhorse index. After the planner prunes partitions down to a small subset (e.g., a few days' worth of tables), this index allows for an extremely fast index scan to locate the specific rows for a single tenant_id within that time range.
  • Now, let's automate the management of this structure with pg_partman.


    Advanced `pg_partman` Configuration for Production

    First, ensure the extension is installed (CREATE EXTENSION pg_partman;). We'll configure pg_partman to create new partitions ahead of time and automatically drop old ones.

    Creating the Partition Set

    The create_parent function is the entry point. We will configure it for daily partitions, creating 4 days of partitions in advance and maintaining a 90-day retention period.

    sql
    -- Install the extension if you haven't already
    -- CREATE EXTENSION pg_partman;
    
    -- Create the partition set configuration
    SELECT partman.create_parent(
        p_parent_table := 'public.metrics',
        p_control := 'created_at',       -- The column to partition by
        p_type := 'native',               -- Use native declarative partitioning
        p_interval := '1 day',            -- Create one partition per day
        p_premake := 4,                   -- Create 4 future partitions (4 days)
        p_start_partition := (now() - interval '3 days')::text -- Optional: Bootstrap with some past partitions
    );

    This command populates partman.part_config and creates the initial set of partitions. The p_premake value is crucial. It ensures that partitions are always available for incoming data, preventing insertion errors. A value of 4 for daily partitions means you have a 4-day buffer if your maintenance job fails to run.

    Automating Maintenance and Retention

    While you can call partman.run_maintenance() manually, this must be automated. You have two primary production-ready options:

  • pg_partman_bgw (Background Worker): This is a dedicated PostgreSQL background worker that wakes up at a configured interval to run maintenance. It's managed entirely within Postgres.
  • * Configuration in postgresql.conf:

    ini
            shared_preload_libraries = 'pg_partman_bgw'
            pg_partman_bgw.interval = 3600  # Run every hour
            pg_partman_bgw.role = 'postgres'  # The role to run maintenance as
            pg_partman_bgw.database = 'your_db_name'

    * Pros: Self-contained within the database, no external dependencies.

    * Cons: Less observable than external job schedulers. Failures must be monitored by querying Postgres logs or pg_partman's log tables.

  • External Cron Job (e.g., Kubernetes CronJob, systemd timer): This approach uses an external scheduler to call run_maintenance_proc(), a stored procedure wrapper.
  • * Pros: Integrates with existing infrastructure monitoring and alerting (e.g., Prometheus, Datadog). Failures are immediately visible in your job scheduler's dashboard.

    * Cons: Requires external infrastructure configuration.

    For most production Kubernetes-based environments, the external CronJob is preferable due to superior observability.

    Configuring Retention

    Now, let's configure the 90-day retention policy directly in pg_partman's configuration table.

    sql
    UPDATE partman.part_config
    SET 
        retention = '90 days',
        retention_keep_table = true,  -- Keep the parent and child tables, just drop the data
        retention_keep_index = true
    WHERE parent_table = 'public.metrics';

    When run_maintenance() is executed, it will check for any partitions whose date range is entirely older than 90 days from the current time and DROP them. This is an instantaneous, metadata-only operation, vastly superior to a DELETE command.


    Deep Dive: Query Performance Before and After

    Let's prove the value with a concrete benchmark. We'll simulate a table with ~100 million rows, representing data for 1000 tenants over 100 days.

    Scenario 1: Single Monolithic Table

    sql
    -- (Hypothetical EXPLAIN output for a large, unpartitioned table)
    EXPLAIN ANALYZE SELECT count(*)
    FROM metrics_monolith
    WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b'
      AND created_at >= '2023-10-26 00:00:00'
      AND created_at < '2023-10-27 00:00:00';
    
    -- Likely Result:
    -- Finalize Aggregate  (cost=... rows=1 width=8) (actual time=850.123..850.124 rows=1 loops=1)
    --   ->  Gather  (cost=... rows=2 width=8) (actual time=849.555..850.120 rows=3 loops=1)
    --         Workers Planned: 2
    --         Workers Launched: 2
    --         ->  Partial Aggregate  (cost=... rows=1 width=8) (actual time=845.555..845.556 rows=1 loops=3)
    --               ->  Parallel Index Scan using metrics_mono_tenant_id_created_at_idx on metrics_monolith ...
    --                     Index Cond: (tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at >= ... AND created_at < ...)
    --                     Buffers: shared hit=150000
    -- Planning Time: 0.543 ms
    -- Execution Time: 850.456 ms

    The key takeaway here is the Execution Time and Buffers. Even with a perfectly good index, the database has to navigate a massive index and fetch data blocks that are interspersed with data from thousands of other tenants, leading to high buffer usage and slower execution.

    Scenario 2: pg_partman Partitioned Table

    Now, let's run the same query against our partitioned structure.

    sql
    EXPLAIN ANALYZE SELECT count(*)
    FROM metrics
    WHERE tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b'
      AND created_at >= '2023-10-26 00:00:00'
      AND created_at < '2023-10-27 00:00:00';
    
    -- Result:
    -- Finalize Aggregate  (cost=... rows=1 width=8) (actual time=15.123..15.124 rows=1 loops=1)
    --   ->  Gather  (cost=... rows=2 width=8) (actual time=14.555..15.120 rows=3 loops=1)
    --         Workers Planned: 2
    --         Workers Launched: 2
    --         ->  Partial Aggregate  (cost=... rows=1 width=8) (actual time=10.555..10.556 rows=1 loops=3)
    --               ->  Parallel Index Scan using metrics_p2023_10_26_tenant_id_created_at_idx on metrics_p2023_10_26 ...
    --                     Index Cond: (tenant_id = 'c6a9b4a4-8a4b-4f4b-8b4b-4b4b4b4b4b4b' AND created_at >= ... AND created_at < ...)
    --                     Buffers: shared hit=1500
    -- Planning Time: 1.234 ms (includes partition pruning)
    -- Execution Time: 15.456 ms

    Analysis of the Improvement:

  • Partition Pruning: The planner immediately identified that only the metrics_p2023_10_26 partition could possibly contain the requested data. It completely ignored the other 99+ partitions. This is the single biggest performance gain.
  • Smaller, Targeted Index: The index scan now operates on a much smaller, denser index that only contains data for that single day. This results in significantly fewer buffer hits (shared hit=1500 vs 150000).
  • Drastic Reduction in Execution Time: The query is over 50x faster (15ms vs 850ms). This difference becomes even more pronounced as the total data volume grows into the terabytes.

  • Advanced Edge Cases and Production Patterns

    Implementing partitioning is more than just setup; it's about handling the operational lifecycle.

    Edge Case 1: Archiving Data Instead of Dropping It

    Dropping data is not always acceptable. For compliance or analytics, you may need to archive it to cheaper storage like Amazon S3. pg_partman can facilitate this by detaching partitions instead of dropping them.

    Step 1: Modify the retention policy.

    We can't configure this directly in partman.part_config. Instead, we'll disable pg_partman's automatic dropping and handle it ourselves in a custom procedure.

    sql
    UPDATE partman.part_config
    SET retention = NULL
    WHERE parent_table = 'public.metrics';

    Step 2: Create a custom archiving procedure.

    This procedure will find old partitions, detach them, and export their data. Here, we'll use COPY to a local file, but in a real system, you would use an extension like aws_s3 to COPY directly to S3.

    sql
    CREATE OR REPLACE PROCEDURE public.run_metrics_archive()
    LANGUAGE plpgsql
    AS $$
    DECLARE
        v_partition_name text;
        v_archive_command text;
        v_detached_schema text := 'detached';
    BEGIN
        -- Find partitions older than 90 days
        FOR v_partition_name IN 
            SELECT child_table
            FROM partman.show_partitions('public.metrics', 'DESC')
            WHERE to_date(substring(child_table from '\d{4}_\d{2}_\d{2}$'), 'YYYY_MM_DD') < (now() - interval '90 days')
        LOOP
            RAISE NOTICE 'Archiving and detaching partition: %', v_partition_name;
    
            -- 1. Detach the partition. It becomes a standalone table.
            --    We move it to a 'detached' schema to keep things clean.
            EXECUTE format('ALTER TABLE public.metrics DETACH PARTITION %I', v_partition_name);
            EXECUTE format('ALTER TABLE %I SET SCHEMA %I', v_partition_name, v_detached_schema);
    
            -- 2. Archive the data (example: COPY to stdout, replace with aws_s3.query_export_to_s3)
            v_archive_command := format(
                'COPY %I.%I TO STDOUT WITH (FORMAT CSV, HEADER)',
                v_detached_schema, v_partition_name
            );
            RAISE NOTICE 'Run archive command: %', v_archive_command;
            -- In a real system, you would execute the command here.
            -- EXECUTE v_archive_command;
    
            -- 3. Drop the now-empty, detached table
            EXECUTE format('DROP TABLE %I.%I', v_detached_schema, v_partition_name);
    
            RAISE NOTICE 'Successfully archived and dropped %', v_partition_name;
        END LOOP;
    END;
    $$;

    This procedure would then be called by your external CronJob after partman.run_maintenance() completes.

    Edge Case 2: The `UPDATE` Partition Key Problem

    What happens if you try to update a row's created_at value, moving it from one partition to another?

    sql
    UPDATE metrics SET created_at = '2023-10-25 12:00:00' WHERE metric_id = '...';

    In PostgreSQL, this operation is executed as a DELETE from the old partition and an INSERT into the new one. This has several negative implications:

    * Performance: It's significantly slower than an in-place update.

    * Concurrency: It can acquire stronger locks than a simple UPDATE.

    * Trigger Complexity: UPDATE triggers will fire in a non-obvious way.

    Solution: The best solution is often architectural. Design your application so that the partition key is immutable. If a timestamp must be changed, model it as a logical DELETE (e.g., setting a deleted_at flag) and a new INSERT. If updates across partitions are unavoidable, ensure they are rare and performed during off-peak hours.

    Edge Case 3: Monitoring and Failure Recovery

    How do you know if pg_partman's maintenance is failing?

    pg_partman logs errors to the standard PostgreSQL log, but it also provides a dedicated table: partman.part_error_log.

    Your monitoring system must include a check on this table.

    sql
    -- A simple check for recent errors
    SELECT *
    FROM partman.part_error_log
    WHERE error_time > now() - interval '1 hour';

    A query like this should be integrated into a monitoring tool (e.g., a Datadog agent check, a Prometheus exporter) to fire an alert if any errors are logged. Common failures include:

    * Permissions issues: The role running maintenance doesn't have CREATE or DROP privileges.

    * Lock contention: run_maintenance() times out trying to acquire a lock on the parent table. The p_lock_wait parameter in run_maintenance() can be used to control this, but it's better to ensure maintenance runs during low-traffic periods.

    * Out of disk space: Partition creation fails because the tablespace is full.

    Conclusion: From Liability to Asset

    By combining PostgreSQL's native partitioning with the automation and advanced features of pg_partman, a massive, unwieldy time-series table can be transformed from a performance liability into a highly scalable and manageable asset. The key is not just to partition, but to do so with a strategy that aligns with your primary query patterns.

    The multi-tenant pattern discussed here—partitioning by time while indexing for tenant-specific access—is a proven approach for building high-performance SaaS backends. It allows the query planner to work on small, relevant data subsets, drastically reducing query latency and I/O load. By automating partition creation and retention, pg_partman eliminates the operational overhead, freeing up engineering teams to focus on building features, not on manual database janitoring. This is the level of robust engineering required to scale PostgreSQL into the petabyte range.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles