Idempotency Keys in Distributed Systems using Redis and Lua Scripts

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitability of Duplicate Requests in Distributed Architectures

In any non-trivial distributed system, the potential for duplicate requests is not a possibility; it is a certainty. Client-side retry logic, network hiccups, gateway timeouts (like a 504 Gateway Time-out), and complex service orchestration patterns all contribute to a single logical operation manifesting as multiple physical HTTP requests. For read operations (GET, HEAD, OPTIONS), this is a non-issue. For state-changing operations (POST, PUT, PATCH), a duplicate request can lead to severe data corruption, such as double-charging a customer, creating duplicate resources, or sending multiple notifications.

The common solution is to enforce idempotency. An operation is idempotent if making the same request multiple times produces the same result as making it once. This requires the server to recognize and correctly handle repeated requests. The standard mechanism for this is the Idempotency-Key header, a client-generated UUID that uniquely identifies a single logical operation.

However, implementing the server-side logic for this is fraught with peril. A simple check-and-set operation is vulnerable to race conditions in a high-concurrency environment. Consider this naive, and fundamentally broken, approach:

javascript
// DO NOT USE THIS - FLAWED EXAMPLE
async function handleRequest(req, res) {
  const idempotencyKey = req.headers['idempotency-key'];
  const redisClient = getRedisClient();

  const keyExists = await redisClient.get(idempotencyKey);

  if (keyExists) {
    // A race condition exists here! Another request could be processing.
    return res.status(200).json(JSON.parse(keyExists));
  }

  // And another race condition here! Two requests could pass the check above.
  await redisClient.set(idempotencyKey, 'processing', 'EX', 60);

  const result = await performExpensiveOperation();

  await redisClient.set(idempotencyKey, JSON.stringify(result), 'EX', 86400);
  res.status(201).json(result);
}

Between the get and set calls, another request with the same Idempotency-Key can slip through, leading to the very duplication we aimed to prevent. Using Redis transactions (MULTI/EXEC) with WATCH can help, but it introduces complexity around retries and doesn't elegantly solve the problem of signaling to a concurrent request that another is already in progress. The truly robust solution lies in guaranteeing atomicity at the data layer itself.

The Atomic Solution: Stateful Locking with Redis and Lua

Redis provides a powerful mechanism for ensuring atomicity: Lua scripting. A Lua script sent to Redis via the EVAL or EVALSHA command is guaranteed to execute atomically. No other command can run concurrently with the script, effectively turning a series of commands into a single, indivisible operation. This allows us to build a sophisticated, stateful locking mechanism that is immune to race conditions.

Our idempotency layer will manage three states for a given key:

  • Non-existent: No request with this key has been seen.
  • IN_PROGRESS: A request has been received and is currently being processed. Any concurrent requests with the same key should be rejected immediately to prevent redundant work.
  • COMPLETED: The original request finished successfully. Its response is cached. Any subsequent requests with the same key should receive this cached response without re-executing the business logic.
  • This state machine will be managed entirely within a single Lua script, providing the necessary atomic guarantees.

    The Core Lua Script for Atomic Idempotency Checks

    This script is the heart of our solution. It takes the idempotency key as KEYS[1], a timeout for the IN_PROGRESS state as ARGV[1], and a unique request identifier (e.g., a server instance ID) as ARGV[2] for debugging purposes.

    lua
    -- KEYS[1]: The idempotency key
    -- ARGV[1]: The lock timeout in seconds (for IN_PROGRESS state)
    -- ARGV[2]: A unique request/server identifier for locking
    
    local key = KEYS[1]
    local lock_ttl = ARGV[1]
    local lock_owner = ARGV[2]
    
    -- Check if the key already exists
    local existing_value = redis.call('GET', key)
    
    if existing_value then
        -- Key exists, parse the stored JSON
        local data = cjson.decode(existing_value)
        if data.status == 'IN_PROGRESS' then
            -- Another request is already processing, return a conflict
            -- Return 'LOCKED' and the owner of the lock
            return {'LOCKED', data.owner}
        elseif data.status == 'COMPLETED' then
            -- Request was already completed, return the stored response
            -- Return 'COMPLETED', status code, and response body
            return {'COMPLETED', data.statusCode, data.responseBody}
        end
    end
    
    -- Key does not exist, so this is the first time we've seen this request.
    -- Create the lock with IN_PROGRESS status.
    local new_value = cjson.encode({status = 'IN_PROGRESS', owner = lock_owner})
    redis.call('SET', key, new_value, 'EX', lock_ttl)
    
    -- Signal that the caller can proceed with the operation.
    return {'PROCEED'}

    Dissection of the Script:

  • redis.call('GET', key): We first attempt to retrieve the key.
  • if existing_value then: If the key exists, we must determine its state.
  • cjson.decode(existing_value): We store our state as a JSON string for flexibility. The script decodes it. Note: Redis 6+ includes a built-in cjson library.
  • if data.status == 'IN_PROGRESS': If a request is already being processed, we immediately return a LOCKED status. This allows our application middleware to respond with an HTTP 429 Too Many Requests or HTTP 409 Conflict, preventing the caller from hammering the service while the first request is running.
  • if data.status == 'COMPLETED': If the request has already successfully completed, we return the COMPLETED status along with the stored HTTP status code and response body. The middleware can then replay this response to the client.
  • redis.call('SET', key, new_value, 'EX', lock_ttl): If the key does not exist, we are the first. We atomically set the key to a JSON payload indicating the IN_PROGRESS state and set a Time-To-Live (TTL). The TTL is critical; it acts as a dead-man's switch. If our server instance crashes mid-process, the lock will eventually expire, allowing a subsequent request to proceed.
  • return {'PROCEED'}: This signals to our middleware that it has successfully acquired the lock and can execute the business logic.
  • Production Implementation: Express.js Middleware

    Now, let's wrap this logic in a robust Node.js Express middleware. We'll use the ioredis library, which has excellent support for Lua scripting.

    1. Setup and Script Loading

    First, we define and load our Lua script. It's a best practice to load the script into Redis once using SCRIPT LOAD and then execute it using its SHA1 hash with EVALSHA. This reduces network overhead on subsequent calls.

    javascript
    // redisClient.js
    const IORedis = require('ioredis');
    const fs = require('fs');
    const path = require('path');
    
    const redisClient = new IORedis({
        // Your Redis connection options
        host: 'localhost',
        port: 6379,
        maxRetriesPerRequest: null, // Important for this pattern
    });
    
    const LUA_SCRIPTS = {};
    
    async function loadLuaScripts() {
        const scriptPath = path.join(__dirname, 'lua', 'idempotency.lua');
        const script = fs.readFileSync(scriptPath, 'utf8');
        LUA_SCRIPTS.idempotency = {
            script,
            sha: await redisClient.script('load', script),
        };
        console.log('Idempotency Lua script loaded.');
    }
    
    function getRedisClient() {
        return redisClient;
    }
    
    function getLuaScript(name) {
        return LUA_SCRIPTS[name];
    }
    
    module.exports = { getRedisClient, loadLuaScripts, getLuaScript };

    We would call loadLuaScripts() at application startup.

    2. The Idempotency Middleware

    This middleware will orchestrate the entire flow: checking the key, executing business logic, and saving the final result.

    javascript
    // idempotencyMiddleware.js
    const { getRedisClient, getLuaScript } = require('./redisClient');
    const { v4: uuidv4 } = require('uuid');
    
    const IDEMPOTENCY_KEY_HEADER = 'idempotency-key';
    const IN_PROGRESS_TTL_SECONDS = 30; // Short TTL for the lock
    const COMPLETED_TTL_SECONDS = 86400; // 24 hours for completed responses
    
    const idempotencyMiddleware = async (req, res, next) => {
        const idempotencyKey = req.header(IDEMPOTENCY_KEY_HEADER);
    
        if (!idempotencyKey) {
            // Forcing idempotency keys is a good practice for mutable endpoints
            return res.status(400).json({ error: `Header '${IDEMPOTENCY_KEY_HEADER}' is required.` });
        }
    
        const redisClient = getRedisClient();
        const idempotencyScript = getLuaScript('idempotency');
        const serverInstanceId = uuidv4(); // Unique ID for this specific request process
    
        try {
            const result = await redisClient.evalsha(
                idempotencyScript.sha,
                1, // Number of keys
                idempotencyKey,
                IN_PROGRESS_TTL_SECONDS,
                serverInstanceId
            );
    
            const status = result[0];
    
            if (status === 'LOCKED') {
                console.warn(`Idempotency key ${idempotencyKey} is locked by ${result[1]}`);
                // Another request is in progress.
                return res.status(429).json({ error: 'A request is already in progress for this key.' });
            }
    
            if (status === 'COMPLETED') {
                console.log(`Returning cached response for idempotency key ${idempotencyKey}`);
                const storedStatusCode = parseInt(result[1], 10);
                const storedResponseBody = JSON.parse(result[2]);
                // Replay the original response
                return res.status(storedStatusCode).json(storedResponseBody);
            }
    
            // Status is 'PROCEED'. We have the lock.
            // We need to capture the response to store it.
            const originalJson = res.json;
            const originalSend = res.send;
            let responseBody = null;
    
            // Monkey-patch res.json and res.send to intercept the response
            res.json = (body) => {
                responseBody = body;
                return originalJson.call(res, body);
            };
            res.send = (body) => {
                responseBody = body;
                return originalSend.call(res, body);
            }
    
            res.on('finish', async () => {
                // The request has finished processing, success or failure.
                if (res.statusCode >= 200 && res.statusCode < 300) {
                    console.log(`Storing successful response for idempotency key ${idempotencyKey}`);
                    const dataToStore = {
                        status: 'COMPLETED',
                        statusCode: res.statusCode,
                        responseBody: JSON.stringify(responseBody || {}),
                    };
    
                    await redisClient.set(
                        idempotencyKey,
                        JSON.stringify(dataToStore),
                        'EX',
                        COMPLETED_TTL_SECONDS
                    );
                } else {
                    // The operation failed. We should release the lock to allow retries.
                    console.warn(`Operation failed for idempotency key ${idempotencyKey}. Releasing lock.`);
                    await redisClient.del(idempotencyKey);
                }
            });
    
            next(); // Proceed to the actual controller logic
    
        } catch (error) {
            console.error('Idempotency middleware error:', error);
            // In case of a Redis error, we should fail open or closed depending on safety requirements.
            // Failing closed is safer for financial transactions.
            return res.status(500).json({ error: 'Internal server error during idempotency check.' });
        }
    };
    
    module.exports = { idempotencyMiddleware };

    Key Implementation Details:

  • Monkey-Patching res.json: This is a crucial, if slightly invasive, technique. To store the response, we need to capture it before it's sent to the client. We replace res.json and res.send with our own functions that store the body, then call the original functions.
  • res.on('finish', ...): We use the finish event on the response object to trigger our persistence logic. This ensures our code runs after the controller has done its work and the response is finalized.
  • Success vs. Failure: On a successful response (2xx status code), we update the key's state to COMPLETED and store the response body with a long TTL. If the request fails (e.g., validation error, downstream service failure), we DEL the key entirely. This is a critical design choice: deleting the key on failure allows the client to retry the operation with the same idempotency key. Storing a failure response would prevent any future attempts.
  • 3. Usage in an Express Route

    Using the middleware is now straightforward. Let's imagine an endpoint for creating a payment.

    javascript
    // server.js
    const express = require('express');
    const { idempotencyMiddleware } = require('./idempotencyMiddleware');
    const { loadLuaScripts } = require('./redisClient');
    
    const app = express();
    app.use(express.json());
    
    // Example route for a critical operation
    app.post('/api/payments', idempotencyMiddleware, async (req, res) => {
        try {
            console.log('Executing payment creation logic...');
            // Simulate a slow, critical operation
            await new Promise(resolve => setTimeout(resolve, 2000));
    
            const payment = {
                id: `payment_${Date.now()}`,
                amount: req.body.amount,
                currency: req.body.currency,
                status: 'succeeded',
            };
    
            // The middleware will capture this response
            res.status(201).json(payment);
        } catch (error) {
            console.error('Payment creation failed:', error);
            res.status(500).json({ error: 'Payment processing failed.' });
        }
    });
    
    async function startServer() {
        await loadLuaScripts();
        app.listen(3000, () => {
            console.log('Server running on port 3000');
        });
    }
    
    startServer();

    Now, if you send two identical POST requests to /api/payments with the same Idempotency-Key header in quick succession, you'll observe the following:

    • The first request will enter the controller, print "Executing payment creation logic...", and take 2 seconds to complete.
  • The second request, arriving while the first is running, will hit the LOCKED condition in the Lua script and immediately receive an HTTP 429 Too Many Requests.
  • Any subsequent request after the first has completed will hit the COMPLETED condition and immediately receive the cached 201 Created response without executing the payment logic again.
  • Advanced Edge Cases and Production Considerations

    A robust system must account for more than the happy path.

    1. Orphaned `IN_PROGRESS` Keys

    Problem: What happens if your server process crashes after acquiring the lock (setting the key to IN_PROGRESS) but before completing the operation and updating the key?

    Solution: This is precisely why the IN_PROGRESS state has a short TTL (IN_PROGRESS_TTL_SECONDS). This TTL should be configured to be slightly longer than the maximum expected execution time of your operation. If a process dies, the lock key will automatically expire after this duration, allowing a subsequent retry from the client to acquire a new lock and proceed. This prevents permanent deadlocks. Choosing this TTL is a trade-off: too short, and a long-running valid process might lose its lock; too long, and the system takes longer to recover from a crash.

    2. Large Response Payloads

    Problem: Storing large JSON response bodies directly in Redis can consume significant memory, especially if you have millions of keys. A 1MB response body for 1 million keys is 1TB of Redis memory.

    Solution: For endpoints that return large payloads, modify the pattern to store only a reference. Instead of placing the full responseBody in Redis, upload the response to a blob store like Amazon S3 or Google Cloud Storage with a UUID as its key. Then, store this UUID in the Redis value. When replaying a COMPLETED request, the middleware would first fetch the reference from Redis and then stream the payload from the blob store.

    json
    // Example of a reference-based value in Redis
    {
      "status": "COMPLETED",
      "statusCode": 200,
      "responseRef": {
        "type": "S3",
        "bucket": "my-app-responses",
        "key": "idempotency/a1b2c3d4-e5f6-...."
      }
    }

    3. Garbage Collection and Key Eviction

    Problem: Even with a 24-hour TTL, a high-traffic API can accumulate a massive number of idempotency keys in Redis, consuming memory.

    Solution: The COMPLETED_TTL_SECONDS is your primary garbage collection mechanism. Its value should be chosen based on your business requirements—typically, it should align with the client's retry window. For most systems, 24 hours is a safe default. If memory pressure is a concern, you can implement a more aggressive eviction policy in Redis (e.g., allkeys-lru), but be aware this could cause completed idempotency keys to be evicted early, potentially allowing a duplicate request if it arrives after eviction but before the 24-hour window closes.

    4. Performance Tuning and Scalability

    Overhead: The overhead of this pattern is one EVALSHA command at the beginning of the request and one SET or DEL command at the end. For most applications, this sub-millisecond overhead is negligible compared to the business logic itself.

    Connection Pooling: Ensure your Node.js application is using a robust Redis client like ioredis that manages a connection pool. This avoids the latency of establishing a new TCP connection for every request.

    Redis Cluster: This pattern works seamlessly with a Redis Cluster setup. Since all operations are on a single key (idempotencyKey), Redis Cluster will correctly route the commands to the appropriate shard managing that key's hash slot. No cross-slot operations are performed in our scripts.

    Conclusion: Beyond a Simple Lock

    Implementing idempotency is not merely about setting a lock; it's about managing the entire lifecycle of a request in a distributed, concurrent environment. By leveraging the atomicity of Redis Lua scripts, we can build a stateful, fault-tolerant idempotency layer that is both performant and safe. This pattern moves beyond naive check-then-set logic, providing clear and immediate feedback for concurrent requests (LOCKED) and efficient resolution for completed ones (COMPLETED). It gracefully handles server failures via TTLs on in-progress locks and provides a clear strategy for handling business logic failures. For any senior engineer building mission-critical APIs, mastering this atomic, stateful approach is an essential tool for creating truly resilient and predictable systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles