Idempotency Key Management in Distributed Systems with Redis & Lua

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High Cost of Non-Idempotent Operations

In distributed systems, the guarantee of exactly-once message delivery is a fallacy. Network partitions, client-side retries, and reverse proxy timeouts conspire to create scenarios where a single user action results in multiple identical API requests. For read operations, this is a minor nuisance. For state-changing operations—creating a payment, provisioning a server, or submitting an order—it's a catastrophic failure. A simple POST /api/payments request, if duplicated, can lead to a double charge, eroding user trust and creating significant operational overhead.

The standard solution is to enforce idempotency at the API layer. The client generates a unique key (the Idempotency-Key) for each distinct operation and includes it in the request header. The server then uses this key to recognize and de-duplicate subsequent retries of the same operation.

However, a naive implementation of this check is fraught with peril. A simple CHECK-THEN-ACT sequence—checking if a key exists in a database and then processing the request—is a classic race condition waiting to happen. Two concurrent requests can both pass the initial check before either has a chance to record its completion, leading to the very duplication we sought to prevent.

This article details a production-grade, high-performance pattern for implementing an idempotency layer using Redis. We will leverage Redis's speed and, critically, its support for server-side Lua scripting to achieve the atomicity required to build a truly robust system.

Core Architecture: The Idempotency Key Lifecycle

Our system will model the idempotency key through a simple state machine with three primary states:

  • PENDING: A request with this key has been received and is currently being processed. No other request with the same key should be processed concurrently.
  • COMPLETED: A request with this key has been successfully processed. The final result (HTTP status, headers, and body) is cached. Any subsequent request with the same key should immediately return the cached response without re-executing the business logic.
  • FAILED: The business logic for a request with this key failed with a non-transient error. This state can be used to prevent retries on operations that are guaranteed to fail.
  • We will implement this logic as a middleware in a Node.js/Express application, but the principles are directly applicable to any language or framework.

    The Flawed Approach: Simple GET/SET in Redis

    Before diving into the correct solution, let's illustrate why a simple approach fails. Consider this pseudo-code for an Express middleware:

    javascript
    // DO NOT USE THIS - FLAWED EXAMPLE
    app.post('/api/payments', async (req, res, next) => {
      const idempotencyKey = req.headers['idempotency-key'];
      if (!idempotencyKey) return next();
    
      const cachedResponse = await redisClient.get(idempotencyKey);
      if (cachedResponse) {
        const { statusCode, body } = JSON.parse(cachedResponse);
        return res.status(statusCode).send(body);
      }
    
      // Race condition window is here!
      await redisClient.set(idempotencyKey, 'pending', 'EX', 10); // Mark as pending
    
      // ... proceed to business logic ...
    });

    Imagine two identical requests, Req A and Req B, arriving milliseconds apart:

  • Req A executes redisClient.get(key). It returns null.
  • Req B executes redisClient.get(key). It also returns null.
  • Req A proceeds past the check. The OS context-switches to Req B.
  • Req B proceeds past the check.
  • Req A executes redisClient.set(key, 'pending', ...).
  • Req B executes redisClient.set(key, 'pending', ...).
    • Both requests now believe they have exclusive access and proceed to execute the payment logic, resulting in a double charge.

    This is unacceptable. The check for the key's existence and the creation of the lock (setting the pending state) must be an atomic operation. This is where Lua scripting in Redis becomes indispensable.

    The Atomic Lock: An Idempotency-Acquire Lua Script

    Redis allows executing Lua scripts on the server. The entire script is guaranteed to execute atomically without interruption. We can use this to build a safe CHECK-AND-SET operation.

    Our first script, idempotency-acquire.lua, will handle the initial locking phase. It will attempt to create the key only if it doesn't already exist.

    idempotency-acquire.lua

    lua
    -- KEYS[1]: The idempotency key
    -- ARGV[1]: The initial value (e.g., a JSON string with status 'pending')
    -- ARGV[2]: The lock expiration time in seconds (a short TTL)
    
    -- redis.call('set', key, value, 'NX', 'EX', ttl) attempts to set the key
    -- only if it does not already exist ('NX') with an expiration ('EX').
    -- It returns 'OK' on success, or nil on failure (if the key already exists).
    local result = redis.call('set', KEYS[1], ARGV[1], 'NX', 'EX', ARGV[2])
    
    if result == 'OK' then
      -- Successfully acquired the lock. Return 1 to indicate success.
      return 1
    else
      -- The key already exists. Return the existing value.
      return redis.call('get', KEYS[1])
    end

    This script is the cornerstone of our solution.

  • SET ... NX is the atomic "set if not exist" command.
  • If the SET succeeds, we've acquired the lock. The script returns 1.
  • If the SET fails (returns nil), it means a key already exists. We then GET the existing value (which contains the current status) and return it to the application so it can decide how to proceed.
  • Building the Production-Grade Middleware

    Let's integrate this into a robust Express.js middleware. We'll need a Redis client that can execute our Lua scripts. We'll use the ioredis library for its excellent support for this.

    1. Setup and Script Loading

    First, we configure our Redis client and register our Lua scripts. Registering scripts generates a SHA1 hash, allowing us to call them efficiently with EVALSHA instead of sending the full script body on every request.

    javascript
    // redisClient.js
    import Redis from 'ioredis';
    import fs from 'fs';
    import path from 'path';
    
    const redisClient = new Redis({
      // Your Redis connection options
    });
    
    // Define a command for our acquire script
    redisClient.defineCommand('acquireIdempotencyLock', {
      numberOfKeys: 1,
      lua: fs.readFileSync(path.join(__dirname, 'lua/idempotency-acquire.lua'), 'utf8'),
    });
    
    // Define a command for our release/save script (we'll create this later)
    redisClient.defineCommand('releaseIdempotencyLock', {
      numberOfKeys: 1,
      lua: fs.readFileSync(path.join(__dirname, 'lua/idempotency-release.lua'), 'utf8'),
    });
    
    export default redisClient;

    2. The Idempotency Middleware Logic

    Our middleware will orchestrate the entire flow.

    javascript
    // idempotencyMiddleware.js
    import redisClient from './redisClient';
    
    const LOCK_TTL_SECONDS = 30; // Max expected processing time
    const CACHE_TTL_SECONDS = 24 * 60 * 60; // 24 hours
    
    export const idempotencyMiddleware = async (req, res, next) => {
      const idempotencyKey = req.headers['idempotency-key'];
    
      // 1. Bypass if no key is provided (for non-idempotent endpoints)
      if (!idempotencyKey) {
        return next();
      }
    
      // 2. Attempt to acquire the lock
      try {
        const pendingState = JSON.stringify({ status: 'pending' });
        const lockResult = await redisClient.acquireIdempotencyLock(
          idempotencyKey,
          pendingState,
          LOCK_TTL_SECONDS
        );
    
        if (lockResult === 1) {
          // 3. Lock acquired successfully. Proceed to business logic.
          // We will monkey-patch the response object to capture the result.
          await handleNewRequest(req, res, next, idempotencyKey);
        } else {
          // 4. Lock was not acquired. Key already exists.
          handleDuplicateRequest(res, lockResult);
        }
      } catch (error) {
        console.error('Idempotency middleware error:', error);
        // Fail open or closed? Here we fail open, letting the request through.
        // In a critical financial system, you might fail closed with a 500.
        next();
      }
    };
    
    function handleDuplicateRequest(res, rawCachedValue) {
      try {
        const cachedValue = JSON.parse(rawCachedValue);
    
        if (cachedValue.status === 'pending') {
          // Another request is currently processing. Reject this one.
          res.status(409).json({ error: 'A request with this key is already in progress.' });
        } else if (cachedValue.status === 'completed') {
          // Request was already completed. Return the cached response.
          const { statusCode, headers, body } = cachedValue.response;
          res.set(headers).status(statusCode).send(body);
        } else {
          // Handle other states like 'failed' if implemented
          res.status(500).json({ error: 'An unexpected idempotency state was encountered.' });
        }
      } catch (e) {
        // The cached value is malformed. This is an exceptional case.
        res.status(500).json({ error: 'Failed to parse cached idempotency data.' });
      }
    }
    
    // ... implementation of handleNewRequest to follow ...

    In handleDuplicateRequest, we see the power of our state machine:

  • If the existing state is pending, we return a 409 Conflict. This signals to the client that it should wait and retry, as the original request may still be processing.
  • If the state is completed, we have a cached response. We deserialize it and send it back immediately. The business logic is never touched.
  • 3. Capturing the Response and Releasing the Lock

    When we successfully acquire a lock, we need to execute the business logic and then atomically update the idempotency key with the final result. A robust way to do this is to intercept the response before it's sent to the client.

    idempotency-release.lua

    lua
    -- KEYS[1]: The idempotency key
    -- ARGV[1]: The final value to store (e.g., a JSON string with status 'completed' and the response)
    -- ARGV[2]: The final cache expiration time in seconds (a long TTL)
    
    -- We use 'SET' here to overwrite the 'pending' state.
    -- The 'KEEPTTL' option is not used because we want to set a new, longer TTL.
    redis.call('set', KEYS[1], ARGV[1], 'EX', ARGV[2])
    return 1

    This script is simpler: it just overwrites the key with the final result and sets the long-term TTL.

    Now, let's implement handleNewRequest.

    javascript
    // ... continuation of idempotencyMiddleware.js
    
    async function handleNewRequest(req, res, next, idempotencyKey) {
      const originalSend = res.send;
      const originalJson = res.json;
      const chunks = [];
    
      // Override res.send and res.json to capture the response body
      res.send = function (chunk) {
        if (chunk) chunks.push(Buffer.from(chunk));
        return originalSend.apply(res, arguments);
      };
    
      res.json = function (body) {
        chunks.push(Buffer.from(JSON.stringify(body)));
        return originalJson.apply(res, arguments);
      };
    
      // This event fires just before the response headers and body are sent
      res.on('finish', async () => {
        try {
          // Only cache successful responses (2xx status codes)
          if (res.statusCode >= 200 && res.statusCode < 300) {
            const responseBody = Buffer.concat(chunks).toString('utf8');
            const finalState = {
              status: 'completed',
              response: {
                statusCode: res.statusCode,
                headers: res.getHeaders(),
                body: responseBody,
              },
            };
    
            await redisClient.releaseIdempotencyLock(
              idempotencyKey,
              JSON.stringify(finalState),
              CACHE_TTL_SECONDS
            );
          } else {
            // For failed requests, we remove the key to allow a clean retry.
            // A more advanced implementation might store a 'failed' state.
            await redisClient.del(idempotencyKey);
          }
        } catch (error) {
          console.error('Failed to save idempotency result:', error);
          // The response has already been sent, so we can only log.
        }
      });
    
      // Finally, call the actual business logic
      next();
    }

    This implementation uses response stream listeners to reliably capture the final response. When the response is finished:

  • If it was successful (2xx), we construct a finalState object containing the status, headers, and body, serialize it, and use our releaseIdempotencyLock script to save it to Redis with a long TTL.
    • If it failed, we simply delete the key. This allows the client to attempt a completely new request with the same key. This is a crucial design decision: for transient server errors (5xx), this allows for a successful retry. For permanent client errors (4xx), the client would need to fix the request and generate a new key anyway.

    Advanced Edge Cases and Production Considerations

    A robust system must account for failure modes.

    Edge Case 1: Server Crash During Processing

    What happens if the server acquires a lock but crashes before it can complete the request and release the lock?

    This is why the LOCK_TTL_SECONDS on the pending state is critical. It acts as a dead-man's switch. If the server crashes, the pending key will expire after LOCK_TTL_SECONDS (e.g., 30 seconds). Once expired, a new request from the client with the same idempotency key will be able to acquire a new lock and safely retry the operation.

    Choosing the LOCK_TTL: This value should be set to slightly longer than the maximum expected processing time for the endpoint. If your P99 latency for a payment endpoint is 5 seconds, a TTL of 15-30 seconds is a safe bet. Too short, and you risk premature lock expiration on a slow but valid request. Too long, and you increase the time a client must wait to retry after a server crash.

    Edge Case 2: Client Timeout and Retry

    Consider this sequence:

  • Client sends Req A with key K1.
  • Server acquires lock for K1 and starts processing.
    • Processing takes a long time. The client's HTTP library times out.
  • Client, assuming failure, retries with Req B using the same key K1.
  • Server receives Req B. It checks for key K1 and finds it in a pending state.
  • Server correctly returns 409 Conflict to Req B.
  • Meanwhile, the original processing for Req A finishes successfully. The result is cached.
  • The client, upon receiving the 409, waits (e.g., exponential backoff) and sends Req C with key K1.
  • Server receives Req C. It checks for key K1 and finds the completed state with the cached response.
    • Server immediately returns the cached successful response to the client.

    This demonstrates the system's resilience. The client eventually gets the correct successful response without causing a duplicate operation.

    Performance and Scalability

  • Redis RTT: Each idempotent request involves at least one round trip to Redis. By using Lua scripts, we combine multiple commands (SETNX, GET) into a single round trip, minimizing latency. EVALSHA is more efficient than EVAL as it avoids re-transmitting the script body.
  • Redis as a Bottleneck: For extremely high-throughput systems, a single Redis instance can become a bottleneck. This pattern is compatible with standard Redis scaling strategies like Redis Cluster. The idempotency key can be used as the sharding key, ensuring all operations for a given key land on the same node.
  • Payload Size: We are storing the entire response body in Redis. For APIs that return very large payloads, this can consume significant memory. Consider whether the full body needs to be cached. In some cases, caching just the status code and a resource identifier (like a payment_id) is sufficient. The client can then use this ID to fetch the full resource via a separate GET request.
  • Complete Example: Integration with an Express Route

    Here is how you would use the middleware with an Express application.

    javascript
    // server.js
    import express from 'express';
    import { v4 as uuidv4 } from 'uuid';
    import { idempotencyMiddleware } from './idempotencyMiddleware';
    
    const app = express();
    app.use(express.json());
    
    // Apply the middleware to specific, state-changing routes
    app.post('/api/payments', idempotencyMiddleware, async (req, res) => {
      console.log(`[${new Date().toISOString()}] Processing payment for key: ${req.headers['idempotency-key']}`)
    
      // Simulate a slow, complex business logic operation
      try {
        await new Promise(resolve => setTimeout(resolve, 2000));
    
        // In a real app, you would interact with a payment gateway, database, etc.
        const paymentId = `payment_${uuidv4()}`;
        console.log(`[${new Date().toISOString()}] Payment processed: ${paymentId}`)
    
        res.status(201).json({ 
          status: 'success',
          message: 'Payment processed successfully.',
          paymentId: paymentId
        });
      } catch (error) {
        console.error('Payment processing failed:', error);
        res.status(500).json({ error: 'Internal Server Error' });
      }
    });
    
    app.listen(3000, () => {
      console.log('Server running on port 3000');
    });

    To test this:

  • First Request:
  • bash
        curl -X POST http://localhost:3000/api/payments \
          -H "Content-Type: application/json" \
          -H "Idempotency-Key: a-unique-uuid-v4-key-1" \
          -d '{"amount": 100, "currency": "USD"}'

    This will take ~2 seconds and return a 201 Created with a new paymentId.

  • Immediate Second Request (within 24 hours):
  • bash
        curl -X POST http://localhost:3000/api/payments \
          -H "Content-Type: application/json" \
          -H "Idempotency-Key: a-unique-uuid-v4-key-1" \
          -d '{"amount": 100, "currency": "USD"}'

    This will return instantly with the exact same 201 Created response and the same paymentId as the first request. The server logs will show no "Processing payment..." message, proving the business logic was skipped.

  • Concurrent Requests: If you were to fire two identical requests simultaneously, one would proceed, and the other would receive a 409 Conflict.
  • Conclusion

    Implementing a correct idempotency layer is a non-negotiable requirement for building reliable distributed systems that perform critical, state-changing operations. While the concept is simple, the implementation is riddled with subtle pitfalls, primarily race conditions.

    By leveraging the atomic, server-side execution of Lua scripts in Redis, we can bypass these pitfalls entirely. The pattern detailed here—using an atomic acquire script, managing a pending/completed lifecycle, and capturing the final response for caching—provides a robust, performant, and scalable blueprint. It transforms dangerous, non-idempotent operations into safe, repeatable transactions, providing the stability and predictability that both developers and users expect from a modern API.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles