Hierarchical Continuous Aggregates in TimescaleDB for Multi-Resolution Analysis

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Performance Cliff of Long-Range Time-Series Queries

In any mature time-series application—be it IoT monitoring, financial tick data analysis, or application performance monitoring—a common performance bottleneck emerges: analytical queries over long time horizons. A dashboard attempting to render a year-long trend by querying raw, high-frequency data is doomed to fail. It will thrash disk I/O, consume excessive CPU, and likely time out. The standard solution is data rollups or downsampling.

TimescaleDB elegantly solves this with Continuous Aggregates (caggs), which are essentially materialized views that automatically and incrementally refresh in the background. A senior engineer will quickly implement a cagg to aggregate raw data into, say, hourly summaries. But what happens when the product requires views for daily, weekly, and monthly trends? The naive approach is to create three more caggs, each one sourcing its data from the base hypertable.

This is a subtle but critical performance trap. Each independent cagg re-scans the raw data, leading to:

  • Redundant Computation: The daily aggregate re-calculates averages and sums that were already computed by the hourly aggregate.
  • I/O Amplification: The database reads the same raw data chunks multiple times for different refresh policies, creating unnecessary I/O pressure.
  • Increased Refresh Latency: Refreshing a daily aggregate over a large hypertable can be a heavy operation, impacting overall system performance.
  • This article details a superior pattern: Hierarchical Continuous Aggregates. We will build aggregates on top of other aggregates, creating an efficient data processing pipeline within the database. We'll start with a production-grade schema and walk through the implementation, performance analysis, and critical edge cases you will encounter in a real-world system.


    Scenario Setup: IoT Device Metrics

    Let's model a common IoT scenario where thousands of devices report metrics every few seconds. Our goal is to build a dashboard that can efficiently display data over hourly, daily, and monthly resolutions.

    First, ensure you have the timescaledb extension enabled.

    sql
    CREATE EXTENSION IF NOT EXISTS timescaledb;

    Next, we define our hypertable for raw metrics.

    sql
    -- The main table for raw, high-frequency sensor data
    CREATE TABLE device_metrics (
        time TIMESTAMPTZ NOT NULL,
        device_id UUID NOT NULL,
        temperature DOUBLE PRECISION,
        humidity DOUBLE PRECISION,
        cpu_usage DOUBLE PRECISION,
        bytes_sent BIGINT
    );
    
    -- Convert it into a hypertable, partitioned by the 'time' column
    SELECT create_hypertable('device_metrics', 'time');
    
    -- Add indexes for efficient querying by device and time
    CREATE INDEX ON device_metrics (device_id, time DESC);
    CREATE INDEX ON device_metrics (time DESC);

    Now, let's populate it with a realistic dataset for demonstration. We'll insert 10 million data points for 100 devices over approximately 3 months.

    sql
    -- Generate sample data for 100 devices over ~90 days
    INSERT INTO device_metrics (time, device_id, temperature, humidity, cpu_usage, bytes_sent)
    SELECT
        t.time,
        'a0eebc99-9c0b-4ef8-bb6d-6bb9bd38'::UUID || (floor(random() * 100))::INT,
        random() * 50 + 10, -- Temp between 10 and 60
        random() * 80 + 10, -- Humidity between 10 and 90
        random() * 100,     -- CPU usage %
        floor(random() * 1024 * 1024) -- Bytes sent
    FROM (
        SELECT generate_series(
            NOW() - INTERVAL '90 days',
            NOW(),
            INTERVAL '1 minute'
        ) AS time
    ) AS t;

    The Naive Approach: Independent Aggregates

    The straightforward, but inefficient, method is to create an hourly and a daily aggregate, both reading from device_metrics.

    Hourly Aggregate (from raw data):

    sql
    CREATE MATERIALIZED VIEW device_metrics_summary_hourly
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        device_id,
        COUNT(*) AS samples,
        AVG(temperature) AS avg_temp,
        MAX(temperature) AS max_temp,
        AVG(cpu_usage) AS avg_cpu,
        SUM(bytes_sent) AS total_bytes
    FROM
        device_metrics
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    -- Add a policy to refresh this aggregate automatically
    SELECT add_continuous_aggregate_policy(
        'device_metrics_summary_hourly',
        start_offset => INTERVAL '3 hours',
        end_offset => INTERVAL '1 hour',
        schedule_interval => INTERVAL '30 minutes'
    );

    Daily Aggregate (also from raw data):

    sql
    CREATE MATERIALIZED VIEW device_metrics_summary_daily_naive
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 day', time) AS bucket,
        device_id,
        COUNT(*) AS samples,
        AVG(temperature) AS avg_temp,
        MAX(temperature) AS max_temp,
        AVG(cpu_usage) AS avg_cpu,
        SUM(bytes_sent) AS total_bytes
    FROM
        device_metrics
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    -- Add its own refresh policy
    SELECT add_continuous_aggregate_policy(
        'device_metrics_summary_daily_naive',
        start_offset => INTERVAL '2 days',
        end_offset => INTERVAL '1 day',
        schedule_interval => INTERVAL '1 hour'
    );

    Let's manually refresh them to populate data for our tests:

    REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;

    REFRESH MATERIALIZED VIEW device_metrics_summary_daily_naive;

    The problem here is that when the policy for device_metrics_summary_daily_naive runs, it must scan the last day's worth of raw data from the device_metrics hypertable—data that has already been scanned and processed by the hourly aggregate's refresh policy. This is wasted work.


    The Advanced Pattern: Hierarchical Continuous Aggregates

    The performant solution is to build the daily aggregate from the hourly one. This creates a dependency chain: raw data -> hourly cagg -> daily cagg.

    Step 1: The Base Aggregate (Hourly)

    We already have our device_metrics_summary_hourly. This remains our base, reading directly from the raw hypertable.

    Step 2: The Hierarchical Aggregate (Daily from Hourly)

    Now, we define our new daily aggregate to source its data from device_metrics_summary_hourly. Notice how we re-aggregate the already aggregated data. For this to work correctly, your aggregate functions must be re-aggregatable. SUM, COUNT, MAX, MIN are trivial. AVG is more complex; we must compute it from the SUM and COUNT of the lower-level aggregate.

    sql
    -- We need SUM(samples) and SUM(avg_temp * samples) to correctly calculate the new AVG
    -- To do this, we modify our hourly aggregate to store the necessary components.
    
    -- Let's drop the old one and create a better one
    DROP MATERIALIZED VIEW device_metrics_summary_hourly;
    
    CREATE MATERIALIZED VIEW device_metrics_summary_hourly
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        device_id,
        COUNT(*) AS samples,
        -- To make AVG re-aggregatable, we store the sum and count
        SUM(temperature) AS sum_temp,
        MAX(temperature) AS max_temp,
        SUM(cpu_usage) AS sum_cpu,
        SUM(bytes_sent) AS total_bytes
    FROM
        device_metrics
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    -- Re-add the policy and refresh
    SELECT add_continuous_aggregate_policy(
        'device_metrics_summary_hourly',
        start_offset => INTERVAL '3 hours',
        end_offset => INTERVAL '1 hour',
        schedule_interval => INTERVAL '30 minutes'
    );
    REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;
    
    -- Now, create the hierarchical daily aggregate
    CREATE MATERIALIZED VIEW device_metrics_summary_daily_hierarchical
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 day', bucket) AS bucket,
        device_id,
        SUM(samples) AS samples,
        -- Re-calculate the average correctly from the partial sums
        SUM(sum_temp) / SUM(samples) AS avg_temp,
        MAX(max_temp) AS max_temp,
        SUM(sum_cpu) / SUM(samples) AS avg_cpu,
        SUM(total_bytes) AS total_bytes
    FROM
        device_metrics_summary_hourly
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    SELECT add_continuous_aggregate_policy(
        'device_metrics_summary_daily_hierarchical',
        start_offset => INTERVAL '2 days',
        end_offset => INTERVAL '1 day',
        schedule_interval => INTERVAL '1 hour'
    );
    REFRESH MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;

    This is the core of the pattern. The device_metrics_summary_daily_hierarchical refresh policy now scans a much smaller dataset: the hourly rollups. Instead of scanning 24 * 60 = 1440 raw data points per device per day, it only needs to scan 24 rows from the hourly aggregate.

    Performance Benchmark: Querying and Refreshing

    Let's prove the performance gains. We will use EXPLAIN (ANALYZE, BUFFERS) to compare querying for one device's average temperature over the last 80 days.

    Query 1: Against the Raw Hypertable (The Anti-Pattern)

    sql
    EXPLAIN (ANALYZE, BUFFERS) SELECT
        time_bucket('1 day', time) AS bucket,
        AVG(temperature)
    FROM device_metrics
    WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001' -- An example device_id
      AND time > NOW() - INTERVAL '80 days'
    GROUP BY bucket
    ORDER BY bucket;

    Typical Result: The query plan will show a Hypertable Scan on device_metrics, reading thousands of chunks. It will be slow, with high buffer reads.

    text
    -- Planning Time: ~50ms
    -- Execution Time: ~3500ms
    -- Buffers: shared hit=... read=... (very high)

    Query 2: Against the Naive Daily Aggregate

    sql
    EXPLAIN (ANALYZE, BUFFERS) SELECT
        bucket,
        avg_temp
    FROM device_metrics_summary_daily_naive
    WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
      AND bucket > NOW() - INTERVAL '80 days'
    ORDER BY bucket;

    Typical Result: This will be much faster. The plan will scan the cagg's internal materialized hypertable.

    text
    -- Planning Time: ~5ms
    -- Execution Time: ~50ms
    -- Buffers: shared hit=... (much lower)

    Query 3: Against the Hierarchical Daily Aggregate

    sql
    EXPLAIN (ANALYZE, BUFFERS) SELECT
        bucket,
        avg_temp
    FROM device_metrics_summary_daily_hierarchical
    WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
      AND bucket > NOW() - INTERVAL '80 days'
    ORDER BY bucket;

    Typical Result: The query performance will be nearly identical to the naive cagg. The real win is not in query speed (both are fast), but in the refresh overhead.

    To analyze refresh performance, you would need to monitor the database during the policy execution. The refresh for device_metrics_summary_daily_naive would show significantly higher I/O and CPU usage compared to the refresh for device_metrics_summary_daily_hierarchical, as the latter is processing a tiny fraction of the data.


    Edge Case 1: Handling Late-Arriving Data

    In distributed systems, data can arrive out of order or late. How does our hierarchy handle this?

    TimescaleDB's cagg mechanism is designed for this. When a late data point is inserted into device_metrics, TimescaleDB identifies which time bucket(s) in the continuous aggregate(s) are now stale.

    Let's insert a point from two days ago:

    sql
    INSERT INTO device_metrics (time, device_id, temperature) VALUES
    (NOW() - INTERVAL '2 days', 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001', 99.9);
  • The corresponding hourly bucket in device_metrics_summary_hourly is marked for re-computation.
    • The next time the refresh policy for the hourly cagg runs, it will re-calculate that specific bucket, incorporating the late data.
  • Crucially, because the hourly cagg was updated, TimescaleDB's dependency tracking automatically invalidates the corresponding daily bucket in device_metrics_summary_daily_hierarchical.
    • On its next scheduled run, the daily cagg's refresh policy will re-read the updated hourly bucket and update its own state.

    This cascading invalidation and refresh works automatically, ensuring eventual consistency throughout the hierarchy with no manual intervention.

    Edge Case 2: Non-Re-aggregatable Functions (Percentiles)

    What if you need to calculate a median or a 95th percentile (P95)? These are not naively re-aggregatable. You cannot calculate the P95 of a day from the P95 of each hour.

    This is where the timescaledb-toolkit extension becomes essential. It provides access to advanced aggregate functions, including percentile approximation algorithms like tdigest and uddsketch which can be combined.

    First, install the extension: CREATE EXTENSION IF NOT EXISTS timescaledb_toolkit;

    Now, let's modify our aggregates to track the P95 of CPU usage.

    Modified Hourly Aggregate with percentile_agg:

    sql
    -- This requires dropping and recreating the views
    DROP MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;
    DROP MATERIALIZED VIEW device_metrics_summary_hourly;
    
    CREATE MATERIALIZED VIEW device_metrics_summary_hourly
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        device_id,
        COUNT(*) AS samples,
        SUM(temperature) AS sum_temp,
        MAX(temperature) AS max_temp,
        -- Use percentile_agg to create an intermediate aggregate state
        percentile_agg(cpu_usage) AS cpu_percentiles
    FROM
        device_metrics
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;

    The percentile_agg function creates a percentile_estimator object, which is a statistical summary of the data distribution.

    Modified Hierarchical Aggregate that rolls up the percentiles:

    sql
    CREATE MATERIALIZED VIEW device_metrics_summary_daily_hierarchical
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 day', bucket) AS bucket,
        device_id,
        SUM(samples) AS samples,
        SUM(sum_temp) / SUM(samples) AS avg_temp,
        MAX(max_temp) AS max_temp,
        -- Use rollup() to combine the percentile estimators from the hourly data
        rollup(cpu_percentiles) AS cpu_percentiles
    FROM
        device_metrics_summary_hourly
    GROUP BY bucket, device_id
    WITH NO DATA;
    
    REFRESH MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;

    Now, when you query the daily aggregate, you can extract the final percentile value using the approx_percentile function.

    sql
    -- Get the approximate P95 CPU usage for each day
    SELECT
        bucket,
        device_id,
        approx_percentile(0.95, cpu_percentiles) AS p95_cpu_usage
    FROM device_metrics_summary_daily_hierarchical
    WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
    ORDER BY bucket DESC
    LIMIT 10;

    This powerful technique allows you to perform sophisticated statistical rollups across a hierarchy, which is impossible with standard SQL AVG/SUM logic for percentiles.

    Edge Case 3: Schema Evolution and Backfilling

    A dreaded production scenario: you need to add a new metric to your aggregates (e.g., AVG(humidity)). There is no ALTER MATERIALIZED VIEW for caggs. The process is destructive and requires careful planning:

  • Drop Dependents: You must drop the hierarchy from the top down (daily_cagg, then hourly_cagg).
  • Recreate Views: Re-create the views with the new column in their definitions.
  • Re-add Policies: Add the add_continuous_aggregate_policy for each cagg.
  • Backfill: The most intensive step. A REFRESH MATERIALIZED VIEW on a large, historical dataset can take hours or days and lock resources. A better strategy is to backfill manually in controlled batches. You can call refresh_continuous_aggregate with explicit window_start and window_end parameters in a script to process historical data chunk by chunk during off-peak hours.
  • This highlights the importance of designing your aggregates carefully upfront. For mission-critical systems, a blue-green deployment strategy for aggregates (create new ones, switch application queries, then drop the old ones) might be necessary to avoid downtime or performance degradation.

    Conclusion

    Hierarchical continuous aggregates are not just a clever trick; they are a foundational pattern for building scalable, performant time-series analytical systems in TimescaleDB. By moving beyond naive, independent rollups and creating an efficient, tiered aggregation pipeline, you can:

  • Drastically reduce I/O and CPU load during data materialization.
  • Speed up refresh cycles, providing fresher data to end-users.
  • Enable complex, multi-resolution queries that would be infeasible on raw data.
  • Incorporate advanced statistics like approximate percentiles into long-term rollups.
  • While the pattern requires more careful planning, especially regarding re-aggregatable function design and schema evolution, the performance benefits for any non-trivial time-series workload are immense. For senior engineers responsible for the stability and performance of data-intensive platforms, mastering this technique is a critical skill.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles