PostgreSQL LATERAL JOIN for Real-Time Top-N Per-Group Queries

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Ubiquitous Challenge: Efficient Top-N-Per-Group Retrieval

In any non-trivial data model, particularly those involving time-series, multi-tenancy, or event sourcing, the "top-N-per-group" query is a recurring and often performance-critical requirement. The business ask is simple: "Get the most recent sensor reading for every IoT device," or "Find the top 3 most expensive orders for each customer." The SQL implementation, however, is fraught with performance traps that can cripple an application at scale.

Senior engineers have typically encountered the standard textbook solutions: ROW_NUMBER() window functions or correlated subqueries. While functionally correct, these approaches often conceal catastrophic performance characteristics when applied to tables with millions or billions of rows. They frequently lead to full table scans, massive sorts on disk, and query plans that don't scale horizontally or vertically.

This article bypasses introductory explanations and dives directly into a production-grade, high-performance solution using a powerful and often underutilized feature in PostgreSQL: LATERAL JOIN. We will dissect why traditional methods fail under load and demonstrate, with concrete schemas, indexing strategies, and EXPLAIN ANALYZE outputs, how LATERAL JOIN provides a more efficient, scalable, and elegant solution.

Our Scenario: A Large-Scale IoT Sensor Data Platform

To ground our discussion in a realistic context, let's model a system that ingests sensor data from a large fleet of IoT devices. Our schema is straightforward but representative of many real-world applications:

sql
-- Table to store device metadata
CREATE TABLE devices (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    location TEXT NOT NULL,
    firmware_version TEXT NOT NULL,
    is_active BOOLEAN NOT NULL DEFAULT TRUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Table to store time-series sensor readings
-- This table can grow to billions of rows
CREATE TABLE sensor_readings (
    id BIGSERIAL PRIMARY KEY,
    device_id UUID NOT NULL REFERENCES devices(id) ON DELETE CASCADE,
    timestamp TIMESTAMPTZ NOT NULL,
    temperature NUMERIC(5, 2),
    humidity NUMERIC(5, 2),
    -- other sensor metrics...
    metadata JSONB
);

-- CRITICAL: The index that will make or break performance
CREATE INDEX idx_sensor_readings_device_timestamp ON sensor_readings (device_id, timestamp DESC);

-- For a different query type, we might need another index
CREATE INDEX idx_sensor_readings_device_temperature ON sensor_readings (device_id, temperature DESC);

Our goal is to execute two common queries with sub-second latency, even with 10,000 devices and 500 million sensor_readings:

  • Get the single most recent reading for every active device.
  • Get the top 3 highest temperature readings for every active device in a specific location.
  • Let's first examine why the common approaches are inadequate.

    The Pitfalls of Traditional Approaches

    Before appreciating the elegance of LATERAL JOIN, we must understand the precise performance failures of its predecessors.

    Approach 1: The `ROW_NUMBER()` Window Function Trap

    A common solution found across developer forums is to use the ROW_NUMBER() window function. The logic is to partition the data by device_id, order it by timestamp, assign a rank, and then select only the rows where the rank is 1.

    sql
    -- DO NOT USE THIS QUERY ON LARGE-SCALE PRODUCTION SYSTEMS
    WITH ranked_readings AS (
        SELECT
            r.device_id,
            r.timestamp,
            r.temperature,
            ROW_NUMBER() OVER(PARTITION BY r.device_id ORDER BY r.timestamp DESC) as rn
        FROM
            sensor_readings r
    )
    SELECT
        d.id,
        d.location,
        rr.timestamp AS latest_timestamp,
        rr.temperature AS latest_temperature
    FROM
        devices d
    JOIN
        ranked_readings rr ON d.id = rr.device_id
    WHERE
        rr.rn = 1
        AND d.is_active = TRUE;

    Why this fails at scale:

    The PostgreSQL query planner must first materialize the entire Common Table Expression (CTE). This means it must:

  • Scan the entire sensor_readings table (or a very large portion of it).
  • Perform a massive sort operation on (device_id, timestamp DESC) to satisfy the PARTITION BY and ORDER BY clauses of the window function.
    • Assign a row number to every single row in the result set.
  • Only then can it apply the WHERE rn = 1 filter.
  • On a table with 500 million rows, this translates to sorting 500 million tuples, an operation that will inevitably spill to disk and take minutes, if not hours, to complete. It is fundamentally incapable of using our carefully crafted index idx_sensor_readings_device_timestamp to simply find the first entry for each device.

    An EXPLAIN ANALYZE would reveal a plan with a WindowAgg node sitting on top of a Sort node, both with astronomical costs and execution times.

    Approach 2: The Deceptive Correlated Subquery

    Another common pattern is the correlated subquery, which feels more direct but hides a different performance demon.

    sql
    -- ALSO INEFFICIENT, AVOID THIS PATTERN
    SELECT
        d.id,
        d.location,
        (
            SELECT r.timestamp
            FROM sensor_readings r
            WHERE r.device_id = d.id
            ORDER BY r.timestamp DESC
            LIMIT 1
        ) AS latest_timestamp,
        (
            SELECT r.temperature
            FROM sensor_readings r
            WHERE r.device_id = d.id
            ORDER BY r.timestamp DESC
            LIMIT 1
        ) AS latest_temperature
    FROM
        devices d
    WHERE
        d.is_active = TRUE;

    Why this fails at scale:

    While this query can use the index for each subquery execution, it suffers from a classic N+1 problem. The outer query selects N devices. For each of these N devices, PostgreSQL must execute the subqueries. In our example, we have two subqueries, so it's actually a 2N+1 problem.

    If you have 10,000 active devices, you are executing 20,001 distinct queries. While each individual subquery is fast thanks to the index, the overhead of planning and executing thousands of queries sequentially adds up significantly. Furthermore, this pattern becomes unwieldy and even more inefficient if you need to retrieve multiple columns from the sensor_readings table, as each column requires its own subquery.

    The `LATERAL JOIN` Solution: A Paradigm Shift

    LATERAL JOIN (often written as CROSS JOIN LATERAL) allows a subquery in the FROM clause to reference columns from preceding tables in the same FROM clause. This effectively creates a dependent, per-row join. Think of it as a foreach loop embedded within SQL, where the outer table provides the context for each iteration of the inner subquery.

    This capability is precisely what we need to solve our problem efficiently. We can iterate through each device and, for that specific device, run a highly optimized query to find its latest sensor reading.

    Implementation 1: Getting the Latest Reading Per Device

    Here is the correct, performant, and scalable way to get the latest reading for each active device:

    sql
    -- PRODUCTION-GRADE QUERY FOR LATEST-PER-GROUP
    SELECT
        d.id,
        d.location,
        latest_reading.timestamp,
        latest_reading.temperature
    FROM
        devices d
    CROSS JOIN LATERAL (
        SELECT
            r.timestamp,
            r.temperature
        FROM
            sensor_readings r
        WHERE
            r.device_id = d.id -- The 'lateral' reference is the key
        ORDER BY
            r.timestamp DESC
        LIMIT 1
    ) AS latest_reading
    WHERE
        d.is_active = TRUE;

    Dissecting the Query Plan and Performance:

    Let's analyze why this is so much faster. The PostgreSQL planner sees this as a Nested Loop join. For each row from the devices table that satisfies is_active = TRUE, it executes the inner subquery.

    Crucially, the inner subquery is WHERE r.device_id = d.id ... ORDER BY r.timestamp DESC LIMIT 1. With our composite index idx_sensor_readings_device_timestamp (device_id, timestamp DESC), PostgreSQL can satisfy this query with extreme efficiency:

  • It performs an index seek to the starting key for the given device_id.
  • Because the index is already ordered by timestamp DESC, the very first tuple it finds is the one we want.
  • The LIMIT 1 clause allows the planner to stop immediately after finding that first row.
  • This is an Index Only Scan or a very limited Index Scan. It reads a single entry from the index for each device. Instead of scanning and sorting 500 million rows, we are performing 10,000 tiny, hyper-efficient index lookups. The cost is proportional to the number of devices, not the number of readings, which is the key to scalability.

    `EXPLAIN ANALYZE` Comparison

    Let's simulate the output for a moderately sized dataset.

    Window Function Plan (Simplified):

    text
    ->  Subquery Scan on rr
        ->  WindowAgg  (cost=5000000.00..6000000.00 rows=500000000) (actual time=120000.50..180000.90)
            ->  Sort  (cost=5000000.00..5100000.00 rows=500000000) (actual time=110000.10..150000.20)
                Sort Key: device_id, timestamp DESC
                ->  Seq Scan on sensor_readings  (cost=0.00..1000000.00 rows=500000000)

    Note the massive costs and the Seq Scan followed by a Sort on the entire table.

    LATERAL JOIN Plan (Simplified):

    text
    ->  Nested Loop  (cost=0.57..50000.00 rows=10000) (actual time=0.05..150.30)
        ->  Seq Scan on devices d  (cost=0.00..300.00 rows=10000) (Filter: is_active)
        ->  Limit  (cost=0.57..4.50 rows=1) (actual time=0.01..0.01)
            ->  Index Scan using idx_sensor_readings_device_timestamp on sensor_readings r  (cost=0.57..40.00 rows=10) (actual time=0.01..0.01)
                Index Cond: (device_id = d.id)

    Note the Nested Loop where the inner side is a fast Index Scan with a Limit. The total cost is orders of magnitude lower, and the actual time reflects sub-second performance.

    Implementation 2: Getting the Top-N (e.g., Top 3) Per Device

    The pattern extends beautifully to the general "Top-N" problem. To get the top 3 highest temperature readings for devices in 'us-east-1', we simply adjust the ORDER BY and LIMIT clauses inside the LATERAL subquery.

    sql
    -- PRODUCTION-GRADE QUERY FOR TOP-N-PER-GROUP
    SELECT
        d.id AS device_id,
        d.location,
        top_readings.timestamp,
        top_readings.temperature
    FROM
        devices d
    CROSS JOIN LATERAL (
        SELECT
            r.timestamp,
            r.temperature
        FROM
            sensor_readings r
        WHERE
            r.device_id = d.id
        ORDER BY
            r.temperature DESC
        LIMIT 3
    ) AS top_readings
    WHERE
        d.is_active = TRUE
        AND d.location = 'us-east-1';

    For this query to be performant, our other index, idx_sensor_readings_device_temperature (device_id, temperature DESC), becomes essential. The query planner will use this index to instantly find the top 3 temperature readings for each device, again avoiding a full scan and sort.

    This demonstrates a critical aspect of database engineering: indexing strategy must be co-designed with query patterns. Without the correct composite index, the LATERAL subquery would still be forced to scan all readings for a given device and sort them, negating much of the performance benefit.

    Advanced Edge Cases and Production Patterns

    Real-world systems require handling more than just the happy path. LATERAL JOIN provides the flexibility to manage these scenarios gracefully.

    Edge Case 1: Devices with No Readings

    CROSS JOIN LATERAL (or just JOIN LATERAL) behaves like an INNER JOIN. If a device has no entries in sensor_readings, it will be omitted from the final result set. This might be the desired behavior, but often we want to see all devices and simply show NULL for those without data.

    The solution is LEFT JOIN LATERAL, which combines the behavior of a LEFT OUTER JOIN with the power of LATERAL.

    sql
    -- Handling devices with no readings using LEFT JOIN LATERAL
    SELECT
        d.id,
        d.location,
        latest_reading.timestamp,
        latest_reading.temperature
    FROM
        devices d
    LEFT JOIN LATERAL (
        SELECT
            r.timestamp,
            r.temperature
        FROM
            sensor_readings r
        WHERE
            r.device_id = d.id
        ORDER BY
            r.timestamp DESC
        LIMIT 1
    ) AS latest_reading ON TRUE -- The ON TRUE clause is idiomatic and required
    WHERE
        d.is_active = TRUE;

    Now, active devices with no corresponding readings will appear in the result with NULL values for the timestamp and temperature columns. The ON TRUE clause is necessary because LEFT JOIN requires an ON condition, and since the correlation is already handled by the WHERE r.device_id = d.id clause inside the subquery, a universally true condition is used.

    Edge Case 2: Complex Filtering and Date Ranges

    What if we need the latest reading, but only from the last 24 hours? Filters can be applied to both the outer and inner queries.

    sql
    SELECT
        d.id,
        d.location,
        latest_reading.timestamp,
        latest_reading.temperature
    FROM
        devices d
    CROSS JOIN LATERAL (
        SELECT
            r.timestamp,
            r.temperature
        FROM
            sensor_readings r
        WHERE
            r.device_id = d.id
            AND r.timestamp >= NOW() - INTERVAL '1 day' -- Inner query filter
        ORDER BY
            r.timestamp DESC
        LIMIT 1
    ) AS latest_reading
    WHERE
        d.is_active = TRUE -- Outer query filter
        AND d.location = 'eu-central-1';

    This query remains highly performant. The outer WHERE clause reduces the number of loops the Nested Loop has to perform. The inner WHERE clause is incorporated into the index scan. Our idx_sensor_readings_device_timestamp is still highly effective, as the planner can perform a ranged scan on the index for the specified device_id and timestamp range.

    Pattern: Unnesting JSON with `LATERAL`

    While our main focus is top-N, the power of LATERAL extends to other complex relational operations. A common use case is de-normalizing JSON data into a relational format on the fly.

    Imagine our devices table had a JSONB column of supported_metrics:

    sql
    -- ALTER TABLE devices ADD COLUMN supported_metrics JSONB;
    -- UPDATE devices SET supported_metrics = '["temperature", "humidity"]' WHERE ...
    
    SELECT
        d.id,
        metric
    FROM
        devices d,
        LATERAL jsonb_array_elements_text(d.supported_metrics) AS metric;

    This query will produce a separate row for each device and each metric in its supported_metrics array. The LATERAL keyword is implicit here when using set-returning functions in the SELECT list, but being explicit with CROSS JOIN LATERAL is often clearer. This pattern is invaluable for querying and joining against semi-structured data within your relational database.

    Benchmarking: Quantifying the Performance Delta

    To drive the point home, let's establish a test harness and run a benchmark.

    Test Data Generation:

    sql
    -- Create 10,000 devices
    INSERT INTO devices (location, firmware_version, is_active)
    SELECT
        'location-' || (random() * 5)::int,
        'v' || (1 + (random() * 2))::int || '.' || (random() * 9)::int,
        TRUE
    FROM generate_series(1, 10000);
    
    -- Create 50 million sensor readings (5,000 per device)
    INSERT INTO sensor_readings (device_id, timestamp, temperature, humidity)
    SELECT
        d.id,
        NOW() - (random() * 365 * 2) * INTERVAL '1 day',
        10 + (random() * 25),
        30 + (random() * 60)
    FROM
        devices d,
        generate_series(1, 5000);

    Note: Run this on a sufficiently powerful machine. This insert will take time.

    Benchmark Results (Illustrative):

    Querying for the latest reading for all 10,000 devices on a system with the above data might yield results like this:

    MethodExecution TimeCPU UsageI/O ActivityScalability Characteristic
    ROW_NUMBER()185.3 secondsHighHeavy disk spill for sortO(N log N) where N is total rows
    Correlated Subquery12.1 secondsMediumMany small I/O operationsO(M) where M is outer rows
    LATERAL JOIN0.25 secondsLowMinimal index readsO(M log K) where K is rows per group

    As the data shows, the LATERAL JOIN approach isn't just incrementally better; it represents a fundamental architectural advantage. It is 740x faster than the window function approach and 48x faster than the correlated subquery in this scenario. This is the difference between an interactive, real-time dashboard and a nightly batch report.

    Conclusion: `LATERAL JOIN` as a Primary Tool for Senior Engineers

    For experienced developers working with large-scale relational databases, mastering patterns beyond basic JOINs is non-negotiable. The "top-N-per-group" problem is a perfect case study for this principle. While window functions are powerful tools for analytics, they are often misused for retrieval tasks where their computational model (scan-sort-filter) is prohibitively expensive.

    LATERAL JOIN offers a more direct and vastly more performant execution model (loop-and-seek) for this class of query. By enabling a subquery to contextually reference preceding tables, it allows the query planner to leverage indexes to their full potential, executing thousands of tiny, targeted lookups instead of one monolithic, brute-force operation.

    When architecting systems that require real-time insights from large datasets, make LATERAL JOIN a primary consideration. Co-design your indexes with your LATERAL queries, use LEFT JOIN LATERAL to handle missing data gracefully, and build systems that scale with the number of groups, not the total volume of data. It is a technique that separates a functional data model from a truly high-performance one.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles