Advanced Time-Series Aggregation with TimescaleDB Continuous Aggregates
The Performance Cliff of Real-Time Time-Series Aggregation
As a senior engineer responsible for a system ingesting high-frequency time-series data—be it IoT metrics, financial ticks, or application performance monitoring—you inevitably hit a wall. Your primary data table, a PostgreSQL hypertable managed by TimescaleDB, has grown to several terabytes, containing billions, if not trillions, of rows. The business needs dashboards that display aggregated data: hourly averages, daily maximums, minutely sums.
The initial, straightforward approach is a direct GROUP BY query on the raw data:
-- The query that brings your production database to its knees.
SELECT
time_bucket('1 hour', ts) as hour,
device_id,
AVG(cpu_usage) as avg_cpu,
MAX(memory_usage) as max_mem
FROM device_metrics
WHERE ts > NOW() - INTERVAL '7 days'
GROUP BY hour, device_id;
On a small dataset, this is instantaneous. On a 10TB hypertable, this query triggers a massive I/O operation, scanning millions or billions of rows across numerous chunks. It consumes significant CPU and memory resources, query latency skyrockets from milliseconds to minutes, and you risk starving the database of resources needed for its primary function: ingestion. This is not a scalable architecture.
The introductory solution is TimescaleDB's Continuous Aggregates (caggs). Most engineers stop after creating their first cagg. This article is for those who have done that and found it's not enough. We will go deeper, exploring the advanced, production-hardening patterns required to manage these systems at massive scale: Hierarchical Continuous Aggregates and Compression on Aggregates.
Prerequisite: A Scalable Hypertable Schema
Let's establish our working schema. We're monitoring a fleet of IoT devices, collecting CPU, memory, and disk I/O metrics every second.
-- Ensure the TimescaleDB extension is installed
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- Main table for raw, high-cardinality time-series data
CREATE TABLE device_metrics (
ts TIMESTAMPTZ NOT NULL,
device_id UUID NOT NULL,
cpu_usage DOUBLE PRECISION NOT NULL,
memory_usage DOUBLE PRECISION NOT NULL,
disk_io_ps DOUBLE PRECISION NOT NULL,
-- Additional metadata can be stored in a separate table
-- to keep this table lean and fast for writes.
CONSTRAINT fk_device
FOREIGN KEY(device_id)
REFERENCES devices(id)
);
-- Convert the table to a hypertable, partitioned by time
-- A chunk interval of 1 day is reasonable for high-volume data
SELECT create_hypertable('device_metrics', 'ts', chunk_time_interval => INTERVAL '1 day');
-- Create a supporting metadata table
CREATE TABLE devices (
id UUID PRIMARY KEY,
location TEXT NOT NULL,
firmware_version TEXT NOT NULL
);
-- Create indexes for common query patterns
CREATE INDEX ON device_metrics (device_id, ts DESC);
CREATE INDEX ON device_metrics (ts DESC);
With this schema, ingesting millions of rows per minute is feasible. Querying it directly for aggregates, as we've established, is not.
Level 1: The Standard Continuous Aggregate
First, let's create a standard hourly continuous aggregate to solve our initial problem. This is the baseline from which we will build.
-- Create the continuous aggregate (materialized view)
CREATE MATERIALIZED VIEW device_metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', ts) AS hour,
device_id,
AVG(cpu_usage) as avg_cpu,
MAX(cpu_usage) as max_cpu,
MIN(cpu_usage) as min_cpu,
AVG(memory_usage) as avg_mem,
MAX(memory_usage) as max_mem,
MIN(memory_usage) as min_mem
FROM device_metrics
GROUP BY hour, device_id;
This DDL creates a materialized view that stores the aggregated data. However, it doesn't update itself. We need a refresh policy.
-- Add a policy to refresh the aggregate automatically
SELECT add_continuous_aggregate_policy('device_metrics_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '30 minutes');
Let's break down this policy. It's critical to understand these parameters:
schedule_interval => '30 minutes': The policy job runs every 30 minutes.end_offset => '1 hour': The job will not materialize data from the last hour. This is a crucial buffer for late-arriving data. If a metric from 59 minutes ago arrives, it will be included in the next run.start_offset => '3 hours': The job will refresh data up to 3 hours old. This means each run re-calculates the last 2 hours of aggregated data (3 hours - 1 hour). This helps correct for out-of-order data but adds computational overhead.Now, a query against device_metrics_hourly is orders of magnitude faster. EXPLAIN ANALYZE would show a scan on the much smaller materialized view hypertable instead of the raw data table.
But what happens when you need daily rollups? Or weekly? Creating a daily aggregate from the raw device_metrics table is still a monstrously expensive operation, even if run only once a day. This is where the standard approach fails at scale.
Level 2: Hierarchical Continuous Aggregates
The most powerful optimization for multi-level aggregation is to build aggregates from other aggregates. Instead of Raw Data -> Daily, we create a chain: Raw Data -> Hourly -> Daily. Each step in the chain processes a dramatically smaller dataset than the last.
Step 1: Create the Hourly Base Aggregate (as before)
We already have our device_metrics_hourly view. This will be the foundation of our hierarchy.
Step 2: Create a Daily Aggregate from the Hourly Aggregate
Notice the FROM clause. We are selecting from device_metrics_hourly, not device_metrics.
-- NOTE: We are aggregating the pre-aggregated data.
-- We need to compute the daily AVG from the hourly AVGs.
-- This requires a weighted average. For MIN/MAX, it's simpler.
-- To do this correctly, our base aggregate needs to store the COUNT and SUM
-- to calculate weighted averages later.
-- Let's DROP and RECREATE our hourly aggregate with the necessary components.
DROP MATERIALIZED VIEW device_metrics_hourly;
CREATE MATERIALIZED VIEW device_metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', ts) AS hour,
device_id,
COUNT(*) as num_samples,
SUM(cpu_usage) as sum_cpu,
MAX(cpu_usage) as max_cpu,
MIN(cpu_usage) as min_cpu,
SUM(memory_usage) as sum_mem,
MAX(memory_usage) as max_mem,
MIN(memory_usage) as min_mem
FROM device_metrics
GROUP BY hour, device_id;
-- Re-add the policy
SELECT add_continuous_aggregate_policy('device_metrics_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '30 minutes');
-- Now, create the daily aggregate FROM the hourly one.
CREATE MATERIALIZED VIEW device_metrics_daily
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 day', hour) AS day,
device_id,
SUM(sum_cpu) / SUM(num_samples) as avg_cpu, -- Correct weighted average
MAX(max_cpu) as max_cpu,
MIN(min_cpu) as min_cpu,
SUM(sum_mem) / SUM(num_samples) as avg_mem,
MAX(max_mem) as max_mem,
MIN(min_mem) as min_mem
FROM device_metrics_hourly
GROUP BY day, device_id;
This is a critical architectural detail. Simple AVG(avg_cpu) would be mathematically incorrect. You must carry forward the components (SUM, COUNT) to calculate accurate downstream averages.
Now, let's add a policy for the daily aggregate.
SELECT add_continuous_aggregate_policy('device_metrics_daily',
start_offset => INTERVAL '2 days',
end_offset => INTERVAL '1 day',
schedule_interval => INTERVAL '1 hour');
Performance Impact Analysis:
Let's assume we collect 1,000 metrics per second for 1,000 devices.
~86.4 billion rows/day.86.4 billion rows.1,000 devices * 24 hours = 24,000 rows/day.24,000 rows from the hourly aggregate. This is a reduction in data processed by a factor of over 3.5 million. The resource consumption of the daily aggregation job drops from catastrophic to trivial.
Level 3: Compression Policies on Aggregates
Over years, even your aggregated data can become enormous. A daily aggregate for 10,000 devices over 5 years is 10,000 365 5 = 18,250,000 rows. While much smaller than the raw data, this still consumes expensive storage and can slow down queries that span long time ranges.
TimescaleDB's columnar compression is a perfect solution here. Most engineers apply it to the raw hypertable, but its true power is realized when applied to continuous aggregates.
Compressed data has trade-offs:
UPDATE, DELETE). INSERTs are still possible but may be slower.This immutability constraint is often perfectly acceptable for aggregated data older than a few days or weeks.
Implementing a Compression Policy
Let's apply compression to our device_metrics_daily aggregate. We want to compress data that is more than 7 days old.
-- First, enable compression on the materialized hypertable
ALTER MATERIALIZED VIEW device_metrics_daily SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id');
-- Add a policy to compress data older than 7 days
SELECT add_compression_policy('device_metrics_daily', compress_after => INTERVAL '7 days');
Dissecting compress_segmentby:
This is the most important compression parameter. compress_segmentby = 'device_id' tells TimescaleDB to group all data for a single device_id together within a compressed chunk. This dramatically improves query performance for queries that filter by device_id, which is our most common pattern:
-- This query will be extremely fast because TimescaleDB can decompress
-- only the segments related to the specified device_id.
SELECT * FROM device_metrics_daily WHERE device_id = 'some-uuid' AND day > NOW() - INTERVAL '1 year';
If you omitted compress_segmentby, TimescaleDB would have to decompress significantly more data to find the rows for that specific device.
Benchmarking Storage Savings
Let's visualize the impact.
| Data Type | Time Range | Uncompressed Size | Compressed Size | Reduction |
|---|---|---|---|---|
Raw device_metrics | 1 Year | 300 TB | 30 TB | 90% |
device_metrics_daily cagg | 1 Year | 500 GB | 45 GB | 91% |
By compressing the aggregate, we've reduced its long-term storage footprint from 500GB to just 45GB, making year-over-year analysis queries faster and cheaper.
Advanced Edge Cases and Production Patterns
Implementing these patterns in a live production environment requires navigating several complex scenarios.
1. Historical Backfilling
When you first create a continuous aggregate on a hypertable with years of existing data, it is not automatically populated. You must backfill it manually. A naive REFRESH MATERIALIZED VIEW will fail or lock up your database for hours.
The correct approach is a batched, manual refresh.
-- When creating the cagg, use WITH NO DATA to avoid an initial, blocking refresh
CREATE MATERIALIZED VIEW device_metrics_hourly
WITH (timescaledb.continuous, timescaledb.with_no_data = true) AS
...
-- Now, write a script or function to backfill in manageable windows.
-- This is a simplified example; a production script would have more robust error handling and logging.
DO $$
DECLARE
start_ts timestamptz := '2022-01-01 00:00:00+00';
end_ts timestamptz := '2022-02-01 00:00:00+00';
-- Find the true beginning of data
-- SELECT min(ts) into start_ts FROM device_metrics;
BEGIN
WHILE start_ts < NOW() - INTERVAL '1 day' LOOP
RAISE NOTICE 'Refreshing window: % to %', start_ts, end_ts;
CALL refresh_continuous_aggregate('device_metrics_hourly', start_ts, end_ts);
-- Commit the transaction to release locks and save progress
COMMIT;
start_ts := end_ts;
end_ts := end_ts + INTERVAL '1 month';
END LOOP;
END $$;
Key considerations for backfilling:
1 month in the example) that your database can handle without performance degradation. Monitor I/O and CPU during the first run to tune it.2. Schema Migrations and Altering Continuous Aggregates
You cannot simply run ALTER MATERIALIZED VIEW on a continuous aggregate to add a column. The process is destructive and requires careful planning to avoid downtime.
Scenario: We need to add avg_disk_io to our hourly aggregate.
The Zero-Downtime Migration Strategy:
device_metrics_hourly_v2). CREATE MATERIALIZED VIEW device_metrics_hourly_v2
WITH (timescaledb.continuous, timescaledb.with_no_data = true) AS
SELECT
time_bucket('1 hour', ts) AS hour,
device_id,
-- ... all old columns ...
AVG(disk_io_ps) as avg_disk_io -- The new column
FROM device_metrics
GROUP BY hour, device_id;
device_metrics_hourly_v2 with historical data. SELECT add_continuous_aggregate_policy('device_metrics_hourly_v2', ...);
device_metrics_hourly_v2. The old aggregate device_metrics_hourly is still available for old code versions, ensuring a smooth transition. -- First, remove the policy to stop it from running
SELECT remove_continuous_aggregate_policy('device_metrics_hourly');
-- Then, drop the view
DROP MATERIALIZED VIEW device_metrics_hourly;
ALTER MATERIALIZED VIEW device_metrics_hourly_v2 RENAME TO device_metrics_hourly;
This process is complex but provides a robust, zero-downtime path for evolving your aggregates as business requirements change.
3. Monitoring and Debugging
Continuous aggregates are not fire-and-forget. You must monitor their health.
TimescaleDB provides several useful views for this:
timescaledb_information.continuous_aggregates: High-level information about your caggs.timescaledb_information.jobs: Status of all TimescaleDB background jobs, including cagg policies.timescaledb_information.job_stats: Detailed statistics for each job run, including successes, failures, and duration. This is your primary tool for debugging.Example Debugging Query: Find failing cagg jobs.
SELECT
js.job_id,
j.application_name,
js.last_run_started_at,
js.last_run_status,
js.last_run_duration,
js.last_error_message
FROM timescaledb_information.job_stats js
JOIN timescaledb_information.jobs j ON js.job_id = j.job_id
WHERE j.application_name LIKE 'Continuous Aggregate%'
AND js.last_run_status = 'Failed'
ORDER BY js.last_run_started_at DESC
LIMIT 10;
Common failure modes include data type mismatches from underlying schema changes, insufficient disk space, or statement timeouts on very large materialization windows.
Conclusion: From Brute Force to Surgical Precision
Managing time-series data at scale is an exercise in moving from brute-force computation to surgical pre-computation. A simple GROUP BY is the former; a multi-level, compressed hierarchical continuous aggregate is the latter.
By layering your aggregates, you transform computationally impossible queries into trivial lookups on progressively smaller datasets. By applying compression, you solve the long-term storage cost and performance degradation associated with retaining years of aggregated metrics.
These patterns are not configuration tweaks; they represent a fundamental architectural shift in how you handle time-series analytics. Mastering them requires a deep understanding of the underlying mechanics of materialization, refresh policies, and compression. For senior engineers building robust, scalable data platforms, this level of understanding is not optional—it's the difference between a system that works and a system that thrives under extreme load.