Advanced Time-Series Aggregation with TimescaleDB Continuous Aggregates

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Performance Cliff of Real-Time Time-Series Aggregation

As a senior engineer responsible for a system ingesting high-frequency time-series data—be it IoT metrics, financial ticks, or application performance monitoring—you inevitably hit a wall. Your primary data table, a PostgreSQL hypertable managed by TimescaleDB, has grown to several terabytes, containing billions, if not trillions, of rows. The business needs dashboards that display aggregated data: hourly averages, daily maximums, minutely sums.

The initial, straightforward approach is a direct GROUP BY query on the raw data:

sql
-- The query that brings your production database to its knees.
SELECT 
    time_bucket('1 hour', ts) as hour,
    device_id,
    AVG(cpu_usage) as avg_cpu,
    MAX(memory_usage) as max_mem
FROM device_metrics
WHERE ts > NOW() - INTERVAL '7 days'
GROUP BY hour, device_id;

On a small dataset, this is instantaneous. On a 10TB hypertable, this query triggers a massive I/O operation, scanning millions or billions of rows across numerous chunks. It consumes significant CPU and memory resources, query latency skyrockets from milliseconds to minutes, and you risk starving the database of resources needed for its primary function: ingestion. This is not a scalable architecture.

The introductory solution is TimescaleDB's Continuous Aggregates (caggs). Most engineers stop after creating their first cagg. This article is for those who have done that and found it's not enough. We will go deeper, exploring the advanced, production-hardening patterns required to manage these systems at massive scale: Hierarchical Continuous Aggregates and Compression on Aggregates.

Prerequisite: A Scalable Hypertable Schema

Let's establish our working schema. We're monitoring a fleet of IoT devices, collecting CPU, memory, and disk I/O metrics every second.

sql
-- Ensure the TimescaleDB extension is installed
CREATE EXTENSION IF NOT EXISTS timescaledb;

-- Main table for raw, high-cardinality time-series data
CREATE TABLE device_metrics (
    ts              TIMESTAMPTZ       NOT NULL,
    device_id       UUID              NOT NULL,
    cpu_usage       DOUBLE PRECISION  NOT NULL,
    memory_usage    DOUBLE PRECISION  NOT NULL,
    disk_io_ps      DOUBLE PRECISION  NOT NULL,
    -- Additional metadata can be stored in a separate table
    -- to keep this table lean and fast for writes.
    CONSTRAINT fk_device
      FOREIGN KEY(device_id) 
        REFERENCES devices(id)
);

-- Convert the table to a hypertable, partitioned by time
-- A chunk interval of 1 day is reasonable for high-volume data
SELECT create_hypertable('device_metrics', 'ts', chunk_time_interval => INTERVAL '1 day');

-- Create a supporting metadata table
CREATE TABLE devices (
    id              UUID              PRIMARY KEY,
    location        TEXT              NOT NULL,
    firmware_version TEXT             NOT NULL
);

-- Create indexes for common query patterns
CREATE INDEX ON device_metrics (device_id, ts DESC);
CREATE INDEX ON device_metrics (ts DESC);

With this schema, ingesting millions of rows per minute is feasible. Querying it directly for aggregates, as we've established, is not.

Level 1: The Standard Continuous Aggregate

First, let's create a standard hourly continuous aggregate to solve our initial problem. This is the baseline from which we will build.

sql
-- Create the continuous aggregate (materialized view)
CREATE MATERIALIZED VIEW device_metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', ts) AS hour,
    device_id,
    AVG(cpu_usage) as avg_cpu,
    MAX(cpu_usage) as max_cpu,
    MIN(cpu_usage) as min_cpu,
    AVG(memory_usage) as avg_mem,
    MAX(memory_usage) as max_mem,
    MIN(memory_usage) as min_mem
FROM device_metrics
GROUP BY hour, device_id;

This DDL creates a materialized view that stores the aggregated data. However, it doesn't update itself. We need a refresh policy.

sql
-- Add a policy to refresh the aggregate automatically
SELECT add_continuous_aggregate_policy('device_metrics_hourly',
    start_offset => INTERVAL '3 hours',
    end_offset   => INTERVAL '1 hour',
    schedule_interval => INTERVAL '30 minutes');

Let's break down this policy. It's critical to understand these parameters:

  • schedule_interval => '30 minutes': The policy job runs every 30 minutes.
  • end_offset => '1 hour': The job will not materialize data from the last hour. This is a crucial buffer for late-arriving data. If a metric from 59 minutes ago arrives, it will be included in the next run.
  • start_offset => '3 hours': The job will refresh data up to 3 hours old. This means each run re-calculates the last 2 hours of aggregated data (3 hours - 1 hour). This helps correct for out-of-order data but adds computational overhead.
  • Now, a query against device_metrics_hourly is orders of magnitude faster. EXPLAIN ANALYZE would show a scan on the much smaller materialized view hypertable instead of the raw data table.

    But what happens when you need daily rollups? Or weekly? Creating a daily aggregate from the raw device_metrics table is still a monstrously expensive operation, even if run only once a day. This is where the standard approach fails at scale.

    Level 2: Hierarchical Continuous Aggregates

    The most powerful optimization for multi-level aggregation is to build aggregates from other aggregates. Instead of Raw Data -> Daily, we create a chain: Raw Data -> Hourly -> Daily. Each step in the chain processes a dramatically smaller dataset than the last.

    Step 1: Create the Hourly Base Aggregate (as before)

    We already have our device_metrics_hourly view. This will be the foundation of our hierarchy.

    Step 2: Create a Daily Aggregate from the Hourly Aggregate

    Notice the FROM clause. We are selecting from device_metrics_hourly, not device_metrics.

    sql
    -- NOTE: We are aggregating the pre-aggregated data.
    -- We need to compute the daily AVG from the hourly AVGs.
    -- This requires a weighted average. For MIN/MAX, it's simpler.
    
    -- To do this correctly, our base aggregate needs to store the COUNT and SUM
    -- to calculate weighted averages later.
    
    -- Let's DROP and RECREATE our hourly aggregate with the necessary components.
    DROP MATERIALIZED VIEW device_metrics_hourly;
    
    CREATE MATERIALIZED VIEW device_metrics_hourly
    WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('1 hour', ts) AS hour,
        device_id,
        COUNT(*) as num_samples,
        SUM(cpu_usage) as sum_cpu,
        MAX(cpu_usage) as max_cpu,
        MIN(cpu_usage) as min_cpu,
        SUM(memory_usage) as sum_mem,
        MAX(memory_usage) as max_mem,
        MIN(memory_usage) as min_mem
    FROM device_metrics
    GROUP BY hour, device_id;
    
    -- Re-add the policy
    SELECT add_continuous_aggregate_policy('device_metrics_hourly',
        start_offset => INTERVAL '3 hours',
        end_offset   => INTERVAL '1 hour',
        schedule_interval => INTERVAL '30 minutes');
    
    -- Now, create the daily aggregate FROM the hourly one.
    CREATE MATERIALIZED VIEW device_metrics_daily
    WITH (timescaledb.continuous) AS
    SELECT
        time_bucket('1 day', hour) AS day,
        device_id,
        SUM(sum_cpu) / SUM(num_samples) as avg_cpu, -- Correct weighted average
        MAX(max_cpu) as max_cpu,
        MIN(min_cpu) as min_cpu,
        SUM(sum_mem) / SUM(num_samples) as avg_mem,
        MAX(max_mem) as max_mem,
        MIN(min_mem) as min_mem
    FROM device_metrics_hourly
    GROUP BY day, device_id;

    This is a critical architectural detail. Simple AVG(avg_cpu) would be mathematically incorrect. You must carry forward the components (SUM, COUNT) to calculate accurate downstream averages.

    Now, let's add a policy for the daily aggregate.

    sql
    SELECT add_continuous_aggregate_policy('device_metrics_daily',
        start_offset => INTERVAL '2 days',
        end_offset   => INTERVAL '1 day',
        schedule_interval => INTERVAL '1 hour');

    Performance Impact Analysis:

    Let's assume we collect 1,000 metrics per second for 1,000 devices.

  • Raw Data Rate: 1,000,000 rows/second. ~86.4 billion rows/day.
  • Hourly Aggregate Input: The daily materialization job, without hierarchy, would need to scan all 86.4 billion rows.
  • Hourly Aggregate Output: 1,000 devices * 24 hours = 24,000 rows/day.
  • Hierarchical Daily Aggregate Input: The daily materialization job now only needs to scan the 24,000 rows from the hourly aggregate.
  • This is a reduction in data processed by a factor of over 3.5 million. The resource consumption of the daily aggregation job drops from catastrophic to trivial.

    Level 3: Compression Policies on Aggregates

    Over years, even your aggregated data can become enormous. A daily aggregate for 10,000 devices over 5 years is 10,000 365 5 = 18,250,000 rows. While much smaller than the raw data, this still consumes expensive storage and can slow down queries that span long time ranges.

    TimescaleDB's columnar compression is a perfect solution here. Most engineers apply it to the raw hypertable, but its true power is realized when applied to continuous aggregates.

    Compressed data has trade-offs:

  • Pros: Massive storage reduction (often >90%). Faster queries due to less data being read from disk.
  • Cons: Data in compressed chunks cannot be modified (UPDATE, DELETE). INSERTs are still possible but may be slower.
  • This immutability constraint is often perfectly acceptable for aggregated data older than a few days or weeks.

    Implementing a Compression Policy

    Let's apply compression to our device_metrics_daily aggregate. We want to compress data that is more than 7 days old.

    sql
    -- First, enable compression on the materialized hypertable
    ALTER MATERIALIZED VIEW device_metrics_daily SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id');
    
    -- Add a policy to compress data older than 7 days
    SELECT add_compression_policy('device_metrics_daily', compress_after => INTERVAL '7 days');

    Dissecting compress_segmentby:

    This is the most important compression parameter. compress_segmentby = 'device_id' tells TimescaleDB to group all data for a single device_id together within a compressed chunk. This dramatically improves query performance for queries that filter by device_id, which is our most common pattern:

    sql
    -- This query will be extremely fast because TimescaleDB can decompress
    -- only the segments related to the specified device_id.
    SELECT * FROM device_metrics_daily WHERE device_id = 'some-uuid' AND day > NOW() - INTERVAL '1 year';

    If you omitted compress_segmentby, TimescaleDB would have to decompress significantly more data to find the rows for that specific device.

    Benchmarking Storage Savings

    Let's visualize the impact.

    Data TypeTime RangeUncompressed SizeCompressed SizeReduction
    Raw device_metrics1 Year300 TB30 TB90%
    device_metrics_daily cagg1 Year500 GB45 GB91%

    By compressing the aggregate, we've reduced its long-term storage footprint from 500GB to just 45GB, making year-over-year analysis queries faster and cheaper.

    Advanced Edge Cases and Production Patterns

    Implementing these patterns in a live production environment requires navigating several complex scenarios.

    1. Historical Backfilling

    When you first create a continuous aggregate on a hypertable with years of existing data, it is not automatically populated. You must backfill it manually. A naive REFRESH MATERIALIZED VIEW will fail or lock up your database for hours.

    The correct approach is a batched, manual refresh.

    sql
    -- When creating the cagg, use WITH NO DATA to avoid an initial, blocking refresh
    CREATE MATERIALIZED VIEW device_metrics_hourly
    WITH (timescaledb.continuous, timescaledb.with_no_data = true) AS
    ...
    
    -- Now, write a script or function to backfill in manageable windows.
    -- This is a simplified example; a production script would have more robust error handling and logging.
    
    DO $$
    DECLARE
        start_ts timestamptz := '2022-01-01 00:00:00+00';
        end_ts   timestamptz := '2022-02-01 00:00:00+00';
        -- Find the true beginning of data
        -- SELECT min(ts) into start_ts FROM device_metrics;
    BEGIN
        WHILE start_ts < NOW() - INTERVAL '1 day' LOOP
            RAISE NOTICE 'Refreshing window: % to %', start_ts, end_ts;
    
            CALL refresh_continuous_aggregate('device_metrics_hourly', start_ts, end_ts);
    
            -- Commit the transaction to release locks and save progress
            COMMIT;
    
            start_ts := end_ts;
            end_ts := end_ts + INTERVAL '1 month';
        END LOOP;
    END $$;

    Key considerations for backfilling:

  • Window Size: Choose a window (1 month in the example) that your database can handle without performance degradation. Monitor I/O and CPU during the first run to tune it.
  • Run During Off-Peak Hours: Schedule the backfill script when the system load is lowest.
  • Hierarchical Backfilling: You must backfill the hierarchy in order. First, backfill the entire hourly aggregate. Only then can you begin backfilling the daily aggregate.
  • 2. Schema Migrations and Altering Continuous Aggregates

    You cannot simply run ALTER MATERIALIZED VIEW on a continuous aggregate to add a column. The process is destructive and requires careful planning to avoid downtime.

    Scenario: We need to add avg_disk_io to our hourly aggregate.

    The Zero-Downtime Migration Strategy:

  • Create a New Cagg: Create a second, new continuous aggregate with the desired schema (device_metrics_hourly_v2).
  • sql
        CREATE MATERIALIZED VIEW device_metrics_hourly_v2
        WITH (timescaledb.continuous, timescaledb.with_no_data = true) AS
        SELECT
            time_bucket('1 hour', ts) AS hour,
            device_id,
            -- ... all old columns ...
            AVG(disk_io_ps) as avg_disk_io -- The new column
        FROM device_metrics
        GROUP BY hour, device_id;
  • Backfill the New Cagg: Use the batched backfilling script from the previous section to populate device_metrics_hourly_v2 with historical data.
  • Add a Policy to the New Cagg: Start the automated refresh policy on the new cagg. For a period, both aggregates will be refreshing in parallel.
  • sql
        SELECT add_continuous_aggregate_policy('device_metrics_hourly_v2', ...);
  • Transition Application Logic: Deploy new application code that reads from device_metrics_hourly_v2. The old aggregate device_metrics_hourly is still available for old code versions, ensuring a smooth transition.
  • Decommission the Old Cagg: Once all traffic has migrated to the new aggregate and you've verified its correctness, you can safely decommission the old one.
  • sql
        -- First, remove the policy to stop it from running
        SELECT remove_continuous_aggregate_policy('device_metrics_hourly');
        -- Then, drop the view
        DROP MATERIALIZED VIEW device_metrics_hourly;
  • Rename (Optional): Finally, you can rename the new view to the old name for consistency.
  • sql
        ALTER MATERIALIZED VIEW device_metrics_hourly_v2 RENAME TO device_metrics_hourly;

    This process is complex but provides a robust, zero-downtime path for evolving your aggregates as business requirements change.

    3. Monitoring and Debugging

    Continuous aggregates are not fire-and-forget. You must monitor their health.

    TimescaleDB provides several useful views for this:

  • timescaledb_information.continuous_aggregates: High-level information about your caggs.
  • timescaledb_information.jobs: Status of all TimescaleDB background jobs, including cagg policies.
  • timescaledb_information.job_stats: Detailed statistics for each job run, including successes, failures, and duration. This is your primary tool for debugging.
  • Example Debugging Query: Find failing cagg jobs.

    sql
    SELECT
        js.job_id,
        j.application_name,
        js.last_run_started_at,
        js.last_run_status,
        js.last_run_duration,
        js.last_error_message
    FROM timescaledb_information.job_stats js
    JOIN timescaledb_information.jobs j ON js.job_id = j.job_id
    WHERE j.application_name LIKE 'Continuous Aggregate%'
      AND js.last_run_status = 'Failed'
    ORDER BY js.last_run_started_at DESC
    LIMIT 10;

    Common failure modes include data type mismatches from underlying schema changes, insufficient disk space, or statement timeouts on very large materialization windows.

    Conclusion: From Brute Force to Surgical Precision

    Managing time-series data at scale is an exercise in moving from brute-force computation to surgical pre-computation. A simple GROUP BY is the former; a multi-level, compressed hierarchical continuous aggregate is the latter.

    By layering your aggregates, you transform computationally impossible queries into trivial lookups on progressively smaller datasets. By applying compression, you solve the long-term storage cost and performance degradation associated with retaining years of aggregated metrics.

    These patterns are not configuration tweaks; they represent a fundamental architectural shift in how you handle time-series analytics. Mastering them requires a deep understanding of the underlying mechanics of materialization, refresh policies, and compression. For senior engineers building robust, scalable data platforms, this level of understanding is not optional—it's the difference between a system that works and a system that thrives under extreme load.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles