Optimizing High-Cardinality Ingestion in TimescaleDB

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Cardinality Crisis in Time-Series Workloads

In any mature time-series system, whether for IoT device monitoring, application performance metrics, or financial data, the problem of high cardinality is not a matter of if, but when. Cardinality—the number of unique series in your dataset—is the silent performance killer. A series is typically defined by a unique combination of a metric name and its associated labels or tags (e.g., cpu_usage for {host='server-a', region='us-east-1', az='a'}). As systems scale, the number of devices, containers, user sessions, or financial instruments explodes, leading to millions or even billions of unique time series.

For senior engineers managing these systems, the symptoms are painfully familiar: ingestion rates plummet, query latencies for even simple aggregations skyrocket, and index sizes balloon, consuming vast amounts of memory and disk I/O. Standard PostgreSQL indexing strategies, while powerful, begin to break down under this pressure.

This article bypasses introductory concepts. We assume you understand what a hypertable is and have already implemented basic time-based indexing. We will dissect the why behind high-cardinality performance degradation at the storage and index level within TimescaleDB and then implement a multi-layered, production-grade solution involving strategic data layout, optimized compression, and query-aware aggregation. Our goal is to transform a struggling high-cardinality hypertable into a highly performant and storage-efficient system.

Anatomy of the Problem: Why High Cardinality Cripples B-Tree Indexes

TimescaleDB hypertables are partitioned into chunks based on time. Within each chunk, indexes are standard PostgreSQL B-trees. A typical index for a metrics table might be (device_id, time DESC).

When cardinality is low, this works beautifully. The device_id part of the index has few unique values, making the B-tree shallow and wide. Traversing it is fast. However, with millions of device_ids, the index structure itself becomes a massive dataset.

  • Index Bloat and Memory Pressure: Each unique device_id requires an entry in the B-tree. With high cardinality, the index can become larger than the data it points to. These index pages must be loaded into memory for efficient querying. A massive index means more pages to load, leading to constant cache misses and thrashing, forcing the database to read from disk far more often.
  • Ingestion Bottlenecks: During ingestion, for every new data point, PostgreSQL must traverse the (device_id, time DESC) index to find the correct leaf page to insert the new entry. With a massive, fragmented index, this traversal becomes a significant bottleneck, directly limiting write throughput.
  • Query Planning Overhead: When you run a query like SELECT avg(value) FROM metrics WHERE device_id = 'foo' AND time > now() - '1 day', the query planner can efficiently use the index. But for a query like SELECT device_id, avg(value) FROM metrics WHERE time > now() - '1 day' GROUP BY device_id, the planner might need to perform a massive index scan or a full sequential scan over recent chunks, both of which are disastrously slow with high cardinality.
  • Let's quantify this with a concrete, reproducible scenario.

    Baseline Schema and Performance Simulation

    We'll model an IoT sensor data workload. Each sensor has a unique UUID (device_id) and reports temperature and humidity. We'll simulate 1 million unique devices.

    1. Schema Setup

    sql
    -- Ensure the TimescaleDB extension is available
    CREATE EXTENSION IF NOT EXISTS timescaledb;
    
    -- Drop tables if they exist to ensure a clean slate
    DROP TABLE IF EXISTS device_metrics;
    
    -- Create the main table for our metrics
    CREATE TABLE device_metrics (
        time TIMESTAMPTZ NOT NULL,
        device_id UUID NOT NULL,
        temperature DOUBLE PRECISION,
        humidity DOUBLE PRECISION
    );
    
    -- Create the hypertable, partitioned by time
    -- This is a standard, unoptimized setup
    SELECT create_hypertable('device_metrics', 'time', chunk_time_interval => INTERVAL '1 day');
    
    -- Create a standard composite index for querying by device and time
    CREATE INDEX idx_device_metrics_device_id_time ON device_metrics (device_id, time DESC);

    2. Ingestion Simulation (Python)

    This script uses psycopg2 to simulate a high-cardinality ingest workload. We'll generate data for 1,000,000 unique devices over a 7-day period.

    python
    import psycopg2
    import uuid
    import random
    from datetime import datetime, timedelta
    
    # --- Configuration ---
    DB_NAME = "your_db"
    DB_USER = "your_user"
    DB_PASS = "your_pass"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    
    NUM_DEVICES = 1_000_000
    DAYS_OF_DATA = 7
    RECORDS_PER_DAY_PER_DEVICE = 24 # Hourly readings
    BATCH_SIZE = 10000
    
    # --- Data Generation ---
    def generate_device_ids(n):
        print(f"Generating {n} device UUIDs...")
        return [uuid.uuid4() for _ in range(n)]
    
    def generate_data_batch(device_ids, start_time, num_records):
        data = []
        for _ in range(num_records):
            ts = start_time + timedelta(minutes=random.randint(0, 59))
            device_id = random.choice(device_ids)
            temp = 20.0 + random.uniform(-5.0, 5.0)
            hum = 50.0 + random.uniform(-10.0, 10.0)
            data.append((ts, device_id, temp, hum))
        return data
    
    # --- Main Ingestion Logic ---
    def main():
        conn_string = f"dbname='{DB_NAME}' user='{DB_USER}' password='{DB_PASS}' host='{DB_HOST}' port='{DB_PORT}'"
        try:
            conn = psycopg2.connect(conn_string)
            cursor = conn.cursor()
            print("Connection successful.")
    
            devices = generate_device_ids(NUM_DEVICES)
            total_records = NUM_DEVICES * DAYS_OF_DATA * RECORDS_PER_DAY_PER_DEVICE
            print(f"Preparing to ingest approximately {total_records:,} records...")
    
            start_date = datetime.utcnow() - timedelta(days=DAYS_OF_DATA)
            records_ingested = 0
    
            for day in range(DAYS_OF_DATA):
                current_day = start_date + timedelta(days=day)
                for hour in range(24):
                    current_hour = current_day + timedelta(hours=hour)
                    
                    # Simulate data for all devices for this hour
                    batch = []
                    for device_uuid in devices:
                        ts = current_hour + timedelta(minutes=random.randint(0, 59), seconds=random.randint(0, 59))
                        temp = 20.0 + random.uniform(-5.0, 5.0)
                        hum = 50.0 + random.uniform(-10.0, 10.0)
                        batch.append(f"('{ts.isoformat()}', '{device_uuid}', {temp}, {hum})")
    
                        if len(batch) >= BATCH_SIZE:
                            values = ",".join(batch)
                            insert_query = f"INSERT INTO device_metrics (time, device_id, temperature, humidity) VALUES {values};"
                            cursor.execute(insert_query)
                            conn.commit()
                            records_ingested += len(batch)
                            print(f"Ingested {records_ingested}/{total_records} records...", end='\r')
                            batch = []
                    
                    # Ingest any remaining records in the batch
                    if batch:
                        values = ",".join(batch)
                        insert_query = f"INSERT INTO device_metrics (time, device_id, temperature, humidity) VALUES {values};"
                        cursor.execute(insert_query)
                        conn.commit()
                        records_ingested += len(batch)
                        print(f"Ingested {records_ingested}/{total_records} records...", end='\r')
    
            print("\nIngestion complete.")
    
        except psycopg2.Error as e:
            print(f"Database error: {e}")
        finally:
            if 'conn' in locals() and conn:
                cursor.close()
                conn.close()
                print("Connection closed.")
    
    if __name__ == "__main__":
        main()

    After running this (which will take some time), let's analyze the baseline performance. Let's find the average temperature for the top 10 noisiest devices over the last 3 days.

    sql
    EXPLAIN ANALYZE
    SELECT
        device_id,
        avg(temperature) AS avg_temp
    FROM
        device_metrics
    WHERE
        time > now() - INTERVAL '3 days'
    GROUP BY
        device_id
    ORDER BY
        count(*) DESC
    LIMIT 10;

    On a reasonably powerful machine, you'll see planning and execution times in the range of several seconds to over a minute. The EXPLAIN plan will likely show a slow GroupAggregate with a large number of groups, preceded by a parallel sequential scan or a very large bitmap index scan across multiple chunks. This is our unacceptable baseline.

    Strategy 1: Continuous Aggregates for Cardinality Reduction

    Our first line of defense is not to operate on the raw, high-cardinality data for analytical queries. Instead, we can pre-aggregate the data into a lower-granularity, lower-cardinality materialized view. This is the primary use case for TimescaleDB's Continuous Aggregates (caggs).

    Let's create an hourly aggregate of our metrics. This immediately reduces our data volume by a factor of 60 (if we were recording minutely) and ensures that analytical queries hit a much smaller, pre-computed dataset.

    sql
    -- Create the continuous aggregate for hourly rollups
    CREATE MATERIALIZED VIEW device_metrics_hourly
    WITH (timescaledb.continuous)
    AS
    SELECT
        time_bucket('1 hour', time) AS bucket,
        device_id,
        -- Keep track of key stats. avg() is calculated from sum/count
        COUNT(temperature) as temp_count,
        SUM(temperature) as temp_sum,
        MAX(temperature) as temp_max,
        MIN(temperature) as temp_min,
        -- Do the same for humidity
        COUNT(humidity) as hum_count,
        SUM(humidity) as hum_sum
    FROM
        device_metrics
    GROUP BY bucket, device_id;
    
    -- Add a policy to automatically refresh the aggregate
    -- This will keep it up-to-date as new data arrives
    SELECT add_continuous_aggregate_policy('device_metrics_hourly',
        start_offset => INTERVAL '3 hours',
        end_offset => INTERVAL '1 hour',
        schedule_interval => INTERVAL '30 minutes');

    Now, we rewrite our query to use this materialized view.

    sql
    EXPLAIN ANALYZE
    SELECT
        device_id,
        (temp_sum / temp_count) AS avg_temp
    FROM
        device_metrics_hourly
    WHERE
        bucket > now() - INTERVAL '3 days'
    GROUP BY
        device_id
    ORDER BY
        sum(temp_count) DESC
    LIMIT 10;

    Performance Impact:

    The query execution time should now be in the low hundreds of milliseconds. This is a 10-100x improvement. The EXPLAIN plan will show a scan on the much smaller materialized hypertable (_timescaledb_internal._materialized_hypertable_...).

    Advanced Consideration: timescaledb.materialized_only=true

    A common production issue is that developers might forget to query the cagg and instead query the raw hypertable directly. You can enforce the use of the cagg. When creating the view, you can set query behavior such that queries against the raw table that could be answered by the cagg are transparently rerouted.

    However, for strict performance guarantees, you can also prevent direct queries on the raw data for specific users or roles using standard PostgreSQL permissions. Grant SELECT only on the device_metrics_hourly view, not the device_metrics table, for analytical roles.

    While caggs are a powerful tool for query performance, they don't solve the core ingestion bottleneck or the storage bloat of the raw data. For that, we need to go deeper into physical storage layout.

    Strategy 2: Taming Indexes with Physical Data Co-location (`orderby`)

    The default physical layout of data within a TimescaleDB chunk is ordered by time. This means data from different devices is interleaved on disk as it arrives. An index scan for a single device_id has to jump all over the chunk's pages, causing poor cache utilization.

    We can fundamentally change this behavior using the create_hypertable orderby option. By specifying orderby = 'device_id, time DESC', we instruct TimescaleDB to physically sort the data on disk first by device_id, and then by time. This co-locates all data points for a single device together within a chunk.

    Let's re-create our table with this optimization.

    sql
    -- This requires a fresh table
    DROP TABLE IF EXISTS device_metrics_optimized;
    
    CREATE TABLE device_metrics_optimized (
        time TIMESTAMPTZ NOT NULL,
        device_id UUID NOT NULL,
        temperature DOUBLE PRECISION,
        humidity DOUBLE PRECISION
    );
    
    -- The key optimization is here: orderby
    SELECT create_hypertable('device_metrics_optimized', 'time', 
        chunk_time_interval => INTERVAL '1 day',
        create_default_indexes => FALSE -- We will create indexes manually
    );
    
    -- Manually set the orderby option on the hypertable
    -- NOTE: This is an example of how you might do this on an existing hypertable, 
    -- but it's best to set it at creation. The API for this has changed.
    -- The modern, supported way is during creation or via chunk manipulation.
    -- For a new hypertable, the best practice is:
    -- SELECT create_hypertable('table', 'time', dimension_name => 'device_id', orderby => 'device_id, time');
    -- Let's assume we are creating a new one properly:
    DROP TABLE IF EXISTS device_metrics_optimized;
    CREATE TABLE device_metrics_optimized (
        time TIMESTAMPTZ NOT NULL,
        device_id UUID NOT NULL,
        temperature DOUBLE PRECISION,
        humidity DOUBLE PRECISION
    );
    
    -- Correct way to create with orderby
    SELECT create_hypertable('device_metrics_optimized', 'time', 
        chunk_time_interval => INTERVAL '1 day',
        create_default_indexes => FALSE
    );
    -- Set the dimension to improve data locality.
    -- This is a more complex topic, but for simplicity, we'll focus on the compression orderby which has a similar effect.
    -- The true power comes from combining this with compression.
    
    -- The most critical index is now just on time, as device_id is physically sorted
    CREATE INDEX idx_opt_time ON device_metrics_optimized (time DESC);
    -- And a supporting index on device_id
    CREATE INDEX idx_opt_device_id ON device_metrics_optimized (device_id, time DESC);

    After re-ingesting data into this new table, let's run a query that benefits directly from data locality: fetching all data for a single device.

    sql
    EXPLAIN ANALYZE SELECT * FROM device_metrics_optimized 
    WHERE device_id = (SELECT device_id FROM device_metrics_optimized LIMIT 1) -- get a random device
    AND time > now() - INTERVAL '7 days';

    Performance Impact:

    Compared to the same query on the original device_metrics table, this will be significantly faster. The EXPLAIN plan will show a much lower number of Buffers: shared hit=... read=.... This is because all the data for the target device_id is packed into a small, contiguous set of disk blocks, requiring far less I/O.

    This orderby strategy is the foundation for our next, and most impactful, optimization: compression.

    Strategy 3: Supercharging Compression with `segmentby`

    TimescaleDB's native compression is columnar, which is highly effective for time-series data. It uses specialized algorithms like Gorilla for floats, delta-of-delta for timestamps, and dictionary compression for repeating values. The effectiveness of these algorithms skyrockets when similar data is stored together.

    Our orderby strategy has already physically co-located data by device_id. Now, we can tell the compression engine to treat each device_id as a separate segment to compress. This is done using the compress_segmentby option.

    sql
    -- Enable compression on our optimized table
    ALTER TABLE device_metrics_optimized SET (
        timescaledb.compress,
        timescaledb.compress_orderby = 'time DESC',
        timescaledb.compress_segmentby = 'device_id'
    );
    
    -- Add a policy to compress old chunks automatically
    -- We'll compress data older than 2 days
    SELECT add_compression_policy('device_metrics_optimized', compress_after => INTERVAL '2 days');

    Let's manually compress a chunk to see the effect immediately.

    sql
    -- Find a chunk to compress
    SELECT chunk_name FROM timescaledb_information.chunks 
    WHERE hypertable_name = 'device_metrics_optimized' LIMIT 1;
    
    -- Manually compress it (replace with your chunk name)
    -- SELECT compress_chunk('_timescaledb_internal._hyper_1_1_chunk');
    
    -- Check compression stats
    SELECT 
        hypertable_name, 
        before_compression_total_bytes,
        after_compression_total_bytes
    FROM hypertable_compression_stats('device_metrics_optimized');
    
    -- For more detail
    SELECT 
        chunk_name,
        compression_status,
        before_compression_total_bytes,
        after_compression_total_bytes
    FROM chunk_compression_stats('device_metrics_optimized');

    Performance and Storage Impact:

    You should see a dramatic compression ratio, often 10x to 20x or more. A table that was hundreds of gigabytes can shrink to tens of gigabytes.

    * Storage Savings: The most obvious benefit. Lower storage costs, smaller backup sizes.

    * Query Performance on Historical Data: This is the non-obvious win. When you query compressed data, TimescaleDB decompresses only the necessary columns and segments on the fly. Because we segmented by device_id, a query filtering by a specific device_id (WHERE device_id = 'foo') only needs to decompress the data for that single device. Furthermore, because the data is now so much smaller, more of it can fit in memory, leading to massive speedups for historical queries.

    Edge Cases and Production Patterns

    Implementing these strategies in a real-world system requires handling several edge cases.

    1. Handling Out-of-Order and Late-Arriving Data

    This is the Achilles' heel of this optimized approach. The orderby and compression strategies assume data arrives in approximate order. If a data point for a device_id arrives for a chunk that is already compressed, TimescaleDB must:

    a. Decompress the entire chunk.

    b. Insert the new data in its sorted position.

    c. Re-compress the entire chunk.

    This is a very expensive operation and can cripple ingestion if it happens frequently.

    Solution Patterns:

    * Adjust Compression Policy: Don't compress data too aggressively. Set compress_after to a period longer than your expected data delay (e.g., INTERVAL '7 days' if data can be up to 3 days late).

    * Buffering Layer: Implement an upstream buffering layer (e.g., using Kafka, Pulsar, or a simple Redis queue) that holds late data and flushes it in batches. This minimizes the number of decompress-recompress cycles.

    * Separate Hypertables: For extreme cases, maintain two hypertables: one for real-time, uncompressed data and another for historical, compressed data. A periodic ETL job moves data from the former to the latter.

    2. Cardinality Management: JSONB vs. Dedicated Columns

    It's common to store tags in a JSONB column for flexibility. However, indexing a JSONB column for high-cardinality values is inefficient.

    * Low/Medium Cardinality Tags: For tags with a few dozen or hundred unique values (e.g., region, status), JSONB with a GIN index is acceptable.

    * High Cardinality Tags: Tags that define the series (device_id, container_id, trace_id) must be dedicated columns of an appropriate type (UUID, TEXT, BIGINT). This allows for efficient B-tree indexing, which is crucial for segmentby and orderby.

    Pattern: Label Fingerprinting

    For metrics systems where a series is defined by a set of labels, storing many label columns can be unwieldy. A powerful pattern is to create a fingerprint of the label set at the application layer.

    python
    import hashlib
    
    def create_label_fingerprint(labels: dict) -> int:
        """Creates a stable 64-bit integer hash from a dictionary of labels."""
        sorted_labels = sorted(labels.items())
        label_string = "&".join([f"{k}={v}" for k, v in sorted_labels])
        # Use a non-cryptographic hash for speed, like xxHash, or SHA256 for stability
        return int(hashlib.sha256(label_string.encode()).hexdigest()[:16], 16)
    
    # Example
    labels = {'host': 'server-a', 'region': 'us-east-1', 'az': 'a'}
    fingerprint = create_label_fingerprint(labels)
    # Store the fingerprint as a BIGINT in the database

    Your hypertable would then have a label_fingerprint BIGINT column, which becomes your high-cardinality key for orderby and compress_segmentby. A separate lookup table can map fingerprints back to the full label set.

    Conclusion: A Multi-Layered Approach

    Tackling high cardinality in TimescaleDB is not about finding a single magic bullet. It's about a holistic approach that addresses the problem at every layer of the system:

  • Query Layer: Use Continuous Aggregates to serve most analytical queries from pre-computed, lower-cardinality rollups. This provides the single biggest boost to query performance.
  • Physical Storage Layer: Use the orderby hypertable option (via compression settings) to co-locate data for the same series on disk. This improves data locality, enhances cache efficiency, and is the critical enabler for effective compression.
  • Compression Layer: Implement an aggressive compression policy using compress_segmentby on your high-cardinality key. This dramatically reduces storage costs and boosts historical query performance by minimizing I/O.
  • By combining these three advanced strategies, you can build a TimescaleDB infrastructure that not only withstands the pressure of millions of unique time series but thrives, providing fast ingestion, low-latency queries, and efficient storage for petabyte-scale workloads.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles