Hierarchical Continuous Aggregates in TimescaleDB for Multi-Resolution Analysis
The Performance Cliff of Long-Range Time-Series Queries
In any mature time-series application—be it IoT monitoring, financial tick data analysis, or application performance monitoring—a common performance bottleneck emerges: analytical queries over long time horizons. A dashboard attempting to render a year-long trend by querying raw, high-frequency data is doomed to fail. It will thrash disk I/O, consume excessive CPU, and likely time out. The standard solution is data rollups or downsampling.
TimescaleDB elegantly solves this with Continuous Aggregates (caggs), which are essentially materialized views that automatically and incrementally refresh in the background. A senior engineer will quickly implement a cagg to aggregate raw data into, say, hourly summaries. But what happens when the product requires views for daily, weekly, and monthly trends? The naive approach is to create three more caggs, each one sourcing its data from the base hypertable.
This is a subtle but critical performance trap. Each independent cagg re-scans the raw data, leading to:
This article details a superior pattern: Hierarchical Continuous Aggregates. We will build aggregates on top of other aggregates, creating an efficient data processing pipeline within the database. We'll start with a production-grade schema and walk through the implementation, performance analysis, and critical edge cases you will encounter in a real-world system.
Scenario Setup: IoT Device Metrics
Let's model a common IoT scenario where thousands of devices report metrics every few seconds. Our goal is to build a dashboard that can efficiently display data over hourly, daily, and monthly resolutions.
First, ensure you have the timescaledb
extension enabled.
CREATE EXTENSION IF NOT EXISTS timescaledb;
Next, we define our hypertable for raw metrics.
-- The main table for raw, high-frequency sensor data
CREATE TABLE device_metrics (
time TIMESTAMPTZ NOT NULL,
device_id UUID NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION,
cpu_usage DOUBLE PRECISION,
bytes_sent BIGINT
);
-- Convert it into a hypertable, partitioned by the 'time' column
SELECT create_hypertable('device_metrics', 'time');
-- Add indexes for efficient querying by device and time
CREATE INDEX ON device_metrics (device_id, time DESC);
CREATE INDEX ON device_metrics (time DESC);
Now, let's populate it with a realistic dataset for demonstration. We'll insert 10 million data points for 100 devices over approximately 3 months.
-- Generate sample data for 100 devices over ~90 days
INSERT INTO device_metrics (time, device_id, temperature, humidity, cpu_usage, bytes_sent)
SELECT
t.time,
'a0eebc99-9c0b-4ef8-bb6d-6bb9bd38'::UUID || (floor(random() * 100))::INT,
random() * 50 + 10, -- Temp between 10 and 60
random() * 80 + 10, -- Humidity between 10 and 90
random() * 100, -- CPU usage %
floor(random() * 1024 * 1024) -- Bytes sent
FROM (
SELECT generate_series(
NOW() - INTERVAL '90 days',
NOW(),
INTERVAL '1 minute'
) AS time
) AS t;
The Naive Approach: Independent Aggregates
The straightforward, but inefficient, method is to create an hourly and a daily aggregate, both reading from device_metrics
.
Hourly Aggregate (from raw data):
CREATE MATERIALIZED VIEW device_metrics_summary_hourly
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 hour', time) AS bucket,
device_id,
COUNT(*) AS samples,
AVG(temperature) AS avg_temp,
MAX(temperature) AS max_temp,
AVG(cpu_usage) AS avg_cpu,
SUM(bytes_sent) AS total_bytes
FROM
device_metrics
GROUP BY bucket, device_id
WITH NO DATA;
-- Add a policy to refresh this aggregate automatically
SELECT add_continuous_aggregate_policy(
'device_metrics_summary_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '30 minutes'
);
Daily Aggregate (also from raw data):
CREATE MATERIALIZED VIEW device_metrics_summary_daily_naive
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 day', time) AS bucket,
device_id,
COUNT(*) AS samples,
AVG(temperature) AS avg_temp,
MAX(temperature) AS max_temp,
AVG(cpu_usage) AS avg_cpu,
SUM(bytes_sent) AS total_bytes
FROM
device_metrics
GROUP BY bucket, device_id
WITH NO DATA;
-- Add its own refresh policy
SELECT add_continuous_aggregate_policy(
'device_metrics_summary_daily_naive',
start_offset => INTERVAL '2 days',
end_offset => INTERVAL '1 day',
schedule_interval => INTERVAL '1 hour'
);
Let's manually refresh them to populate data for our tests:
REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;
REFRESH MATERIALIZED VIEW device_metrics_summary_daily_naive;
The problem here is that when the policy for device_metrics_summary_daily_naive
runs, it must scan the last day's worth of raw data from the device_metrics
hypertable—data that has already been scanned and processed by the hourly aggregate's refresh policy. This is wasted work.
The Advanced Pattern: Hierarchical Continuous Aggregates
The performant solution is to build the daily aggregate from the hourly one. This creates a dependency chain: raw data -> hourly cagg -> daily cagg
.
Step 1: The Base Aggregate (Hourly)
We already have our device_metrics_summary_hourly
. This remains our base, reading directly from the raw hypertable.
Step 2: The Hierarchical Aggregate (Daily from Hourly)
Now, we define our new daily aggregate to source its data from device_metrics_summary_hourly
. Notice how we re-aggregate the already aggregated data. For this to work correctly, your aggregate functions must be re-aggregatable. SUM
, COUNT
, MAX
, MIN
are trivial. AVG
is more complex; we must compute it from the SUM
and COUNT
of the lower-level aggregate.
-- We need SUM(samples) and SUM(avg_temp * samples) to correctly calculate the new AVG
-- To do this, we modify our hourly aggregate to store the necessary components.
-- Let's drop the old one and create a better one
DROP MATERIALIZED VIEW device_metrics_summary_hourly;
CREATE MATERIALIZED VIEW device_metrics_summary_hourly
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 hour', time) AS bucket,
device_id,
COUNT(*) AS samples,
-- To make AVG re-aggregatable, we store the sum and count
SUM(temperature) AS sum_temp,
MAX(temperature) AS max_temp,
SUM(cpu_usage) AS sum_cpu,
SUM(bytes_sent) AS total_bytes
FROM
device_metrics
GROUP BY bucket, device_id
WITH NO DATA;
-- Re-add the policy and refresh
SELECT add_continuous_aggregate_policy(
'device_metrics_summary_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '30 minutes'
);
REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;
-- Now, create the hierarchical daily aggregate
CREATE MATERIALIZED VIEW device_metrics_summary_daily_hierarchical
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 day', bucket) AS bucket,
device_id,
SUM(samples) AS samples,
-- Re-calculate the average correctly from the partial sums
SUM(sum_temp) / SUM(samples) AS avg_temp,
MAX(max_temp) AS max_temp,
SUM(sum_cpu) / SUM(samples) AS avg_cpu,
SUM(total_bytes) AS total_bytes
FROM
device_metrics_summary_hourly
GROUP BY bucket, device_id
WITH NO DATA;
SELECT add_continuous_aggregate_policy(
'device_metrics_summary_daily_hierarchical',
start_offset => INTERVAL '2 days',
end_offset => INTERVAL '1 day',
schedule_interval => INTERVAL '1 hour'
);
REFRESH MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;
This is the core of the pattern. The device_metrics_summary_daily_hierarchical
refresh policy now scans a much smaller dataset: the hourly rollups. Instead of scanning 24 * 60 = 1440 raw data points per device per day, it only needs to scan 24 rows from the hourly aggregate.
Performance Benchmark: Querying and Refreshing
Let's prove the performance gains. We will use EXPLAIN (ANALYZE, BUFFERS)
to compare querying for one device's average temperature over the last 80 days.
Query 1: Against the Raw Hypertable (The Anti-Pattern)
EXPLAIN (ANALYZE, BUFFERS) SELECT
time_bucket('1 day', time) AS bucket,
AVG(temperature)
FROM device_metrics
WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001' -- An example device_id
AND time > NOW() - INTERVAL '80 days'
GROUP BY bucket
ORDER BY bucket;
Typical Result: The query plan will show a Hypertable Scan
on device_metrics
, reading thousands of chunks. It will be slow, with high buffer reads.
-- Planning Time: ~50ms
-- Execution Time: ~3500ms
-- Buffers: shared hit=... read=... (very high)
Query 2: Against the Naive Daily Aggregate
EXPLAIN (ANALYZE, BUFFERS) SELECT
bucket,
avg_temp
FROM device_metrics_summary_daily_naive
WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
AND bucket > NOW() - INTERVAL '80 days'
ORDER BY bucket;
Typical Result: This will be much faster. The plan will scan the cagg's internal materialized hypertable.
-- Planning Time: ~5ms
-- Execution Time: ~50ms
-- Buffers: shared hit=... (much lower)
Query 3: Against the Hierarchical Daily Aggregate
EXPLAIN (ANALYZE, BUFFERS) SELECT
bucket,
avg_temp
FROM device_metrics_summary_daily_hierarchical
WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
AND bucket > NOW() - INTERVAL '80 days'
ORDER BY bucket;
Typical Result: The query performance will be nearly identical to the naive cagg. The real win is not in query speed (both are fast), but in the refresh overhead.
To analyze refresh performance, you would need to monitor the database during the policy execution. The refresh for device_metrics_summary_daily_naive
would show significantly higher I/O and CPU usage compared to the refresh for device_metrics_summary_daily_hierarchical
, as the latter is processing a tiny fraction of the data.
Edge Case 1: Handling Late-Arriving Data
In distributed systems, data can arrive out of order or late. How does our hierarchy handle this?
TimescaleDB's cagg mechanism is designed for this. When a late data point is inserted into device_metrics
, TimescaleDB identifies which time bucket(s) in the continuous aggregate(s) are now stale.
Let's insert a point from two days ago:
INSERT INTO device_metrics (time, device_id, temperature) VALUES
(NOW() - INTERVAL '2 days', 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001', 99.9);
device_metrics_summary_hourly
is marked for re-computation.- The next time the refresh policy for the hourly cagg runs, it will re-calculate that specific bucket, incorporating the late data.
device_metrics_summary_daily_hierarchical
.- On its next scheduled run, the daily cagg's refresh policy will re-read the updated hourly bucket and update its own state.
This cascading invalidation and refresh works automatically, ensuring eventual consistency throughout the hierarchy with no manual intervention.
Edge Case 2: Non-Re-aggregatable Functions (Percentiles)
What if you need to calculate a median or a 95th percentile (P95
)? These are not naively re-aggregatable. You cannot calculate the P95
of a day from the P95
of each hour.
This is where the timescaledb-toolkit
extension becomes essential. It provides access to advanced aggregate functions, including percentile approximation algorithms like tdigest
and uddsketch
which can be combined.
First, install the extension: CREATE EXTENSION IF NOT EXISTS timescaledb_toolkit;
Now, let's modify our aggregates to track the P95
of CPU usage.
Modified Hourly Aggregate with percentile_agg
:
-- This requires dropping and recreating the views
DROP MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;
DROP MATERIALIZED VIEW device_metrics_summary_hourly;
CREATE MATERIALIZED VIEW device_metrics_summary_hourly
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 hour', time) AS bucket,
device_id,
COUNT(*) AS samples,
SUM(temperature) AS sum_temp,
MAX(temperature) AS max_temp,
-- Use percentile_agg to create an intermediate aggregate state
percentile_agg(cpu_usage) AS cpu_percentiles
FROM
device_metrics
GROUP BY bucket, device_id
WITH NO DATA;
REFRESH MATERIALIZED VIEW device_metrics_summary_hourly;
The percentile_agg
function creates a percentile_estimator
object, which is a statistical summary of the data distribution.
Modified Hierarchical Aggregate that rolls up the percentiles:
CREATE MATERIALIZED VIEW device_metrics_summary_daily_hierarchical
WITH (timescaledb.continuous)
AS
SELECT
time_bucket('1 day', bucket) AS bucket,
device_id,
SUM(samples) AS samples,
SUM(sum_temp) / SUM(samples) AS avg_temp,
MAX(max_temp) AS max_temp,
-- Use rollup() to combine the percentile estimators from the hourly data
rollup(cpu_percentiles) AS cpu_percentiles
FROM
device_metrics_summary_hourly
GROUP BY bucket, device_id
WITH NO DATA;
REFRESH MATERIALIZED VIEW device_metrics_summary_daily_hierarchical;
Now, when you query the daily aggregate, you can extract the final percentile value using the approx_percentile
function.
-- Get the approximate P95 CPU usage for each day
SELECT
bucket,
device_id,
approx_percentile(0.95, cpu_percentiles) AS p95_cpu_usage
FROM device_metrics_summary_daily_hierarchical
WHERE device_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380001'
ORDER BY bucket DESC
LIMIT 10;
This powerful technique allows you to perform sophisticated statistical rollups across a hierarchy, which is impossible with standard SQL AVG
/SUM
logic for percentiles.
Edge Case 3: Schema Evolution and Backfilling
A dreaded production scenario: you need to add a new metric to your aggregates (e.g., AVG(humidity)
). There is no ALTER MATERIALIZED VIEW
for caggs. The process is destructive and requires careful planning:
daily_cagg
, then hourly_cagg
).add_continuous_aggregate_policy
for each cagg.REFRESH MATERIALIZED VIEW
on a large, historical dataset can take hours or days and lock resources. A better strategy is to backfill manually in controlled batches. You can call refresh_continuous_aggregate
with explicit window_start
and window_end
parameters in a script to process historical data chunk by chunk during off-peak hours.This highlights the importance of designing your aggregates carefully upfront. For mission-critical systems, a blue-green deployment strategy for aggregates (create new ones, switch application queries, then drop the old ones) might be necessary to avoid downtime or performance degradation.
Conclusion
Hierarchical continuous aggregates are not just a clever trick; they are a foundational pattern for building scalable, performant time-series analytical systems in TimescaleDB. By moving beyond naive, independent rollups and creating an efficient, tiered aggregation pipeline, you can:
While the pattern requires more careful planning, especially regarding re-aggregatable function design and schema evolution, the performance benefits for any non-trivial time-series workload are immense. For senior engineers responsible for the stability and performance of data-intensive platforms, mastering this technique is a critical skill.