Idempotency Keys: A Deep Dive with Redis and Atomic Lua Scripts

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Flaw in Naive Idempotency Checks

As senior engineers building distributed systems, we're all familiar with the concept of idempotency. We know that POST /charges should not create two charges if a client retries due to a network timeout. The standard solution is to require an Idempotency-Key header, check if we've seen it before, and if so, return the cached result.

A common first-pass implementation using a key-value store like Redis might look like this in pseudocode:

javascript
// DO NOT USE THIS IN PRODUCTION - FLAWED EXAMPLE
async function handleRequest(request) {
  const idempotencyKey = request.headers['idempotency-key'];

  if (!idempotencyKey) {
    // Proceed without idempotency
    return processBusinessLogic(request);
  }

  const cachedResponse = await redis.get(idempotencyKey);
  if (cachedResponse) {
    return JSON.parse(cachedResponse);
  }

  const result = await processBusinessLogic(request);

  // Cache the result for 1 day
  await redis.set(idempotencyKey, JSON.stringify(result), 'EX', 86400);

  return result;
}

This seems simple enough, but it harbors a critical race condition. Consider two identical requests arriving at nearly the same time:

  • Request A arrives. It checks Redis for the key idem-key-123. The key does not exist.
  • Request B arrives microseconds later. It also checks Redis for idem-key-123. The key still does not exist because Request A hasn't set it yet.
  • Request A proceeds to execute the business logic (e.g., charging a credit card).
  • Request B also proceeds to execute the same business logic.
  • The credit card is charged twice. Both requests eventually call redis.set, with one overwriting the other.
  • This is the very failure mode we sought to prevent. The root cause is that the GET and SET operations are not atomic. We need a mechanism to check for the key and immediately lock it, preventing any other process from acting on it.

    A Production-Grade Idempotency State Machine

    To solve this robustly, we must treat the idempotency record not as a simple cached value, but as a state machine. Each key transitions through a defined lifecycle, which allows us to handle concurrency, failures, and retries gracefully.

    Our state machine will have three primary states:

  • IN_PROGRESS: The first request for a given key has arrived. We've locked the key and are currently processing the business logic. The lock must have a timeout (TTL) to prevent indefinite locks if the processing node dies.
  • COMPLETED: The business logic finished successfully. The record now stores the final HTTP status and body of the response. This record should also have a TTL, after which it's safe to purge.
  • FAILED: The business logic encountered an error. We store this state to indicate that a retry is permissible. A subsequent request with the same key can then transition the state back to IN_PROGRESS.
  • Here’s a visualization of the state transitions:

    mermaid
    graph TD
        A[Start] -->|Request 1 Arrives| B(IN_PROGRESS)
        B -->|Processing Succeeds| C(COMPLETED)
        B -->|Processing Fails| D(FAILED)
        C -->|Request 2 Arrives| C
        D -->|Request 2 Arrives| B
        B -->|Request 2 Arrives (Concurrent)| E{Conflict}

    This model elegantly handles the race condition. When Request B arrives while Request A is processing, it will find the key in the IN_PROGRESS state and can immediately return a 409 Conflict, signaling to the client that an operation is already underway.

    The Atomic Core: Redis and Lua Scripting

    To implement this state machine atomically, we cannot rely on separate Redis commands. Redis transactions (MULTI/EXEC) are insufficient because they don't allow for conditional logic based on the result of a command within the transaction. A WATCH/MULTI/EXEC block could work, but it becomes complex to manage and can suffer from high abort/retry rates under contention.

    The ideal solution is a server-side Lua script. Redis guarantees that Lua scripts are executed atomically. No other command can run while a script is executing, making it the perfect tool for our check-and-lock operation.

    We'll need two primary scripts: one to initiate the process and lock the key, and another to store the final result.

    Code Example 1: The `check_and_lock` Lua Script

    This script is the entry point for our idempotency middleware. It checks the current state of the key and decides the next action.

    check_and_lock.lua:

    lua
    -- KEYS[1]: The idempotency key (e.g., 'idem:uuid-123')
    -- ARGV[1]: The lock timeout in seconds (TTL for the IN_PROGRESS state)
    -- ARGV[2]: A unique identifier for the current request attempt (e.g., a request ID or worker ID)
    
    -- Get the current value associated with the idempotency key
    local existing_value = redis.call('GET', KEYS[1])
    
    if not existing_value then
        -- Key does not exist. This is the first time we've seen this request.
        -- Create a lock record in the 'IN_PROGRESS' state.
        -- The value is a serialized object containing the state and the request fingerprint.
        local new_value = cjson.encode({state = 'IN_PROGRESS', fingerprint = ARGV[2]})
        redis.call('SET', KEYS[1], new_value, 'EX', ARGV[1])
        -- Signal to the application to proceed with business logic.
        return 'PROCEED'
    end
    
    -- Key exists, so we need to inspect its state.
    local record = cjson.decode(existing_value)
    
    if record.state == 'IN_PROGRESS' then
        -- Another process is already working on this request.
        -- We return 'CONFLICT' to signal a 409 response.
        -- An advanced implementation could check the fingerprint (ARGV[2]) to see if it's the *same* worker retrying,
        -- but for simplicity, we'll treat any IN_PROGRESS as a conflict for external callers.
        return 'CONFLICT'
    elseif record.state == 'COMPLETED' then
        -- The operation is already complete. Return the cached result.
        return existing_value
    elseif record.state == 'FAILED' then
        -- The previous attempt failed. We can allow a retry.
        -- Transition the state back to IN_PROGRESS with the new fingerprint.
        local new_value = cjson.encode({state = 'IN_PROGRESS', fingerprint = ARGV[2]})
        redis.call('SET', KEYS[1], new_value, 'EX', ARGV[1])
        return 'PROCEED'
    end
    
    -- Fallback, should not be reached with the defined states.
    return 'CONFLICT'

    Key Details of the Script:

    * Atomicity: All logic from redis.call('GET', ...) to the final return is executed as a single, indivisible operation.

    * Stateful Value: We store a JSON string in Redis, not just a simple value. This allows us to encode the state (IN_PROGRESS, COMPLETED) and other metadata like a fingerprint.

    * Lock TTL: The EX argument on SET is crucial. If the worker processing the request dies, the lock will automatically expire, allowing another request to eventually proceed.

    Code Example 2: The `store_result` Lua Script

    Once the business logic is complete, this script atomically updates the key from IN_PROGRESS to COMPLETED, but only if it's the rightful owner of the lock.

    store_result.lua:

    lua
    -- KEYS[1]: The idempotency key
    -- ARGV[1]: The result to store (serialized JSON of {statusCode, body})
    -- ARGV[2]: The result TTL in seconds
    -- ARGV[3]: The unique fingerprint of the request that performed the work
    
    local existing_value = redis.call('GET', KEYS[1])
    
    if not existing_value then
        -- This should not happen in a correct flow, but as a safeguard, do nothing.
        -- The lock may have expired.
        return 0
    end
    
    local record = cjson.decode(existing_value)
    
    -- Atomically update the record ONLY if the state is IN_PROGRESS and the fingerprint matches.
    -- This prevents a slow, timed-out request from overwriting the result of a faster, subsequent retry.
    if record.state == 'IN_PROGRESS' and record.fingerprint == ARGV[3] then
        local result_data = cjson.decode(ARGV[1])
        local new_value = cjson.encode({
            state = 'COMPLETED',
            statusCode = result_data.statusCode,
            body = result_data.body
        })
        redis.call('SET', KEYS[1], new_value, 'EX', ARGV[2])
        return 1 -- Success
    else
        -- The lock was either lost, or another process took over.
        -- Do not store the result.
        return 0 -- Failure
    end

    Key Details of the Script:

    * Fingerprint Check: The record.fingerprint == ARGV[3] check is vital. It prevents a race condition where: 1) Request A gets a lock. 2) It takes too long, and the lock expires. 3) Request B gets a new lock and completes quickly. 4) Request A finally finishes and tries to store its result. Without the fingerprint check, stale Request A would overwrite the correct result from Request B.

    Integrating into a Microservice (Node.js/Express Example)

    Now let's integrate these scripts into a practical Express.js middleware. We'll use the ioredis library, which has excellent support for Lua scripting.

    Setup

    First, let's create a Redis client manager that loads our scripts.

    redisClient.js:

    javascript
    const Redis = require('ioredis');
    const fs = require('fs');
    const path = require('path');
    
    const redis = new Redis({
        // Your Redis connection options
    });
    
    // Load and define Lua scripts
    redis.defineCommand('checkAndLock', {
        numberOfKeys: 1,
        lua: fs.readFileSync(path.join(__dirname, 'scripts/check_and_lock.lua'), 'utf8'),
    });
    
    redis.defineCommand('storeResult', {
        numberOfKeys: 1,
        lua: fs.readFileSync(path.join(__dirname, 'scripts/store_result.lua'), 'utf8'),
    });
    
    // Optional: a script to mark a request as failed
    redis.defineCommand('markFailed', {
        numberOfKeys: 1,
        lua: `
            local record = cjson.decode(redis.call('GET', KEYS[1]))
            if record and record.state == 'IN_PROGRESS' and record.fingerprint == ARGV[2] then
                local new_value = cjson.encode({state = 'FAILED'})
                redis.call('SET', KEYS[1], new_value, 'EX', ARGV[1])
                return 1
            end
            return 0
        `,
    });
    
    module.exports = redis;

    Code Example 3: The Idempotency Middleware

    This middleware orchestrates the entire flow.

    idempotencyMiddleware.js:

    javascript
    const redis = require('./redisClient');
    const { randomUUID } = require('crypto');
    
    const LOCK_TTL_SECONDS = 30; // How long to hold the 'IN_PROGRESS' lock
    const RESULT_TTL_SECONDS = 86400; // 24 hours
    const FAILED_TTL_SECONDS = 300; // 5 minutes
    
    async function idempotencyMiddleware(req, res, next) {
        const idempotencyKey = req.headers['idempotency-key'];
    
        if (!idempotencyKey) {
            return next(); // No key, proceed without idempotency
        }
    
        const redisKey = `idem:${idempotencyKey}`;
        const requestFingerprint = randomUUID();
    
        try {
            const result = await redis.checkAndLock(redisKey, LOCK_TTL_SECONDS, requestFingerprint);
    
            if (result === 'PROCEED') {
                // Attach fingerprint to response locals to use it later
                res.locals.idempotency = { key: redisKey, fingerprint: requestFingerprint };
    
                // We need to intercept the response to store it
                const originalSend = res.send;
                res.send = function (body) {
                    // Only store successful (2xx) responses
                    if (res.statusCode >= 200 && res.statusCode < 300) {
                        const responseToCache = JSON.stringify({ statusCode: res.statusCode, body });
                        redis.storeResult(redisKey, responseToCache, RESULT_TTL_SECONDS, requestFingerprint)
                            .catch(err => console.error('Failed to store idempotency result:', err));
                    }
                    originalSend.call(this, body);
                };
    
                return next(); // Proceed to business logic
            } else if (result === 'CONFLICT') {
                return res.status(409).json({ error: 'Request already in progress' });
            } else {
                // We have a cached result
                const cached = JSON.parse(result);
                return res.status(cached.statusCode).send(cached.body);
            }
        } catch (error) {
            console.error('Idempotency middleware error:', error);
            return next(error);
        }
    }
    
    // Error handler to mark failed requests
    function idempotencyErrorHandler(err, req, res, next) {
        const { key, fingerprint } = res.locals.idempotency || {};
    
        if (key && fingerprint) {
            redis.markFailed(key, FAILED_TTL_SECONDS, fingerprint)
                .catch(e => console.error('Failed to mark idempotency key as FAILED:', e));
        }
    
        // Standard error response
        if (!res.headersSent) {
            res.status(500).json({ error: 'Internal Server Error' });
        }
    }
    
    module.exports = { idempotencyMiddleware, idempotencyErrorHandler };

    Usage in an Express App:

    javascript
    const express = require('express');
    const { idempotencyMiddleware, idempotencyErrorHandler } = require('./idempotencyMiddleware');
    
    const app = express();
    app.use(express.json());
    
    app.post('/charge', idempotencyMiddleware, async (req, res) => {
        // Simulate complex business logic
        console.log('Processing charge for key:', req.headers['idempotency-key']);
        await new Promise(resolve => setTimeout(resolve, 2000));
    
        // Example of a conditional failure
        if (req.body.amount > 1000) {
            throw new Error('Amount exceeds limit');
        }
    
        res.status(201).json({ success: true, chargeId: `ch_${Date.now()}` });
    });
    
    // IMPORTANT: The error handler must be placed after the routes
    app.use(idempotencyErrorHandler);
    
    app.listen(3000, () => console.log('Server running on port 3000'));

    This implementation provides a complete, robust idempotency layer. It handles the happy path, concurrent requests, and server-side errors that should allow for a retry.

    Advanced Considerations and Edge Cases

    A production system requires thinking beyond the core logic.

    Choosing the Idempotency Key

    The client should generate a unique key, typically a UUIDv4, and send it in the Idempotency-Key header. The client is responsible for persisting this key and reusing it for retries of the exact same logical operation. If the request parameters change, a new key must be generated.

    Partial Failures: The Service-to-Redis Gap

    What happens if your business logic commits a database transaction, but the subsequent call to redis.storeResult fails due to a network partition between your service and Redis? The system is now in an inconsistent state:

    * Source of Truth (Database): The charge is complete.

    * Idempotency Cache (Redis): The key is still IN_PROGRESS.

    When the lock expires, a new request will be allowed to proceed, potentially causing a duplicate operation. This is the hardest problem in distributed systems.

    Solution Pattern: Asynchronous Reconciliation

  • Instrument Your Logic: When starting a job tied to an idempotency key, log the key and its status to a durable log or a database table (idempotency_jobs).
  • Reconciliation Worker: Run a background job that periodically scans for IN_PROGRESS keys in Redis that are near their TTL expiration.
  • Verify with Source of Truth: For each expiring key, the worker checks the idempotency_jobs table (or the primary business table, e.g., charges) to determine the true status of the operation.
  • Correct the Cache:
  • * If the job was successful, the worker forcefully updates the Redis key to COMPLETED with the correct result.

    * If the job failed or is unknown, the worker can delete the key to allow a clean retry.

    This adds complexity but closes the consistency gap, moving the system closer to a true exactly-once guarantee.

    Performance and Scalability

    While Redis is incredibly fast, adding two Redis commands to every critical write path introduces latency. At scale, this matters.

    * Lua Overhead: Executing Lua is slightly slower than native Redis commands, but it saves one or more network round trips, making it a net win for complex atomic operations.

    * Redis Clustering: If you use Redis Cluster, you must ensure that any key manipulated by a Lua script resides on a single shard. Lua scripts cannot operate on keys across different shards. The standard way to solve this is with hash tags. By naming your key {idem:user123}:uuid-abc, you tell Redis to hash only the part within the curly braces (idem:user123), ensuring that all keys for that user land on the same shard.

    * Connection Pooling: Ensure your application uses a robust Redis client with proper connection pooling to handle high throughput without exhausting server resources.

    Idempotency Beyond `POST`

    While often associated with POST, this pattern is equally useful for non-idempotent PATCH operations or for ensuring a complex, multi-step PUT operation that isn't naturally atomic can be retried safely. The key is to protect any state-changing operation whose side effects are not easily reversible or repeatable.

    Conclusion

    Implementing a truly robust idempotency layer is a significant engineering task that goes far beyond a simple GET/SET check. By adopting a state machine model and leveraging the atomicity of Redis Lua scripts, we can build a system that is resilient to race conditions, client retries, and even certain classes of server failure.

    The patterns discussed here—atomic locking, state transitions, fingerprinting to prevent stale writes, and planning for reconciliation—form the bedrock of reliable distributed systems. While the initial implementation is more complex, the consistency and safety it provides for critical business operations are indispensable at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles