Atomic Idempotency Key Management in Distributed Systems using Redis & Lua

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Illusion of Idempotency with Simple Primitives

In any distributed system, the promise of exactly-once processing is a siren's call. Network partitions, client retries, and asynchronous message delivery conspire to turn single operations into duplicates. The standard defense is idempotency: designing operations so that receiving them multiple times has the same effect as receiving them once. A common first-pass implementation for this in a system using Redis is the SETNX (SET if Not eXists) command.

Consider a payment processing endpoint. A client sends a request with a unique Idempotency-Key header. The server logic looks like this:

javascript
// DO NOT USE IN PRODUCTION - FLAWED EXAMPLE
async function naiveProcessPayment(request) {
  const idempotencyKey = request.headers['Idempotency-Key'];
  if (!idempotencyKey) {
    throw new Error('Idempotency-Key header is required.');
  }

  const key = `idem:${idempotencyKey}`;
  const isNew = await redisClient.set(key, 'in_progress', 'NX', 'EX', 60);

  if (!isNew) {
    // This is where it gets tricky. Is it already done? Or just in progress?
    // This naive approach can't tell.
    return { status: 'conflict', message: 'Request already in progress or completed.' };
  }

  try {
    // --- CRITICAL SECTION ---
    const result = await paymentGateway.charge(request.body);
    // What if the server crashes right here? The key is 'in_progress' forever (or until TTL).

    await redisClient.set(key, JSON.stringify({ status: 'completed', data: result }), 'EX', 86400); // Keep result for 24h
    return { status: 'success', data: result };
  } catch (error) {
    // Clean up the key on failure so it can be retried.
    await redisClient.del(key);
    throw error;
  }
}

This seemingly plausible approach is riddled with critical flaws that manifest under real-world production load:

  • Indistinguishable States: If SETNX fails, we don't know why. Is another process actively working on it, or did a previous process complete it successfully? The client receives a generic conflict and doesn't know whether to retry or accept a previous success.
  • Process Failure Catastrophe: If the server process dies after the SETNX but before the final SET, the idempotency key is locked in the in_progress state until its 60-second TTL expires. For that minute, no other process can handle the request, effectively causing a temporary outage for that specific operation.
  • Race Conditions: The logic of checking a key, then acting on its value, is a classic race condition if not performed atomically. While SETNX is atomic, the subsequent logic to check for a completed result is not. A client might retry, find the key exists, and before it can fetch the result, the key expires and another process starts working on it.
  • To build a truly resilient system, we need an atomic, state-aware mechanism that can handle the full lifecycle of an idempotent request. This requires moving beyond simple Redis commands to the power of server-side Lua scripting.

    A State-Aware Idempotency Lifecycle

    We will model the idempotency key's lifecycle with three distinct states, stored as a JSON object within the Redis key:

  • IN_PROGRESS: A process has acquired the lock and is executing the business logic. The key should have a short TTL (e.g., 30-60 seconds) to act as a safety net against process death.
  • COMPLETED: The process finished successfully. The value will contain the final result of the operation. This key should have a long TTL (e.g., 24 hours) to serve cached responses to retrying clients.
  • FAILED: The process encountered a terminal error. This state prevents clients from endlessly retrying a doomed operation. This key might have a medium TTL (e.g., 1 hour).
  • To manage transitions between these states without race conditions, we need to perform the check-and-set logic in a single, atomic operation. This is the perfect use case for Redis's EVAL command, which executes a Lua script on the server as a single, uninterruptible transaction.

    Atomicity with Lua: The `start_processing` Script

    Our first script will handle the initial locking and state checking. It will be the gatekeeper for our critical section.

    Lua Script: start_processing.lua

    lua
    -- keys[1]: The idempotency key (e.g., 'idem:user123:charge:req456')
    -- args[1]: The TTL for the IN_PROGRESS lock (e.g., 60 seconds)
    -- args[2]: The initial value for the key (e.g., '{"status":"IN_PROGRESS"}')
    
    local key = KEYS[1]
    local lock_ttl = ARGV[1]
    local in_progress_value = ARGV[2]
    
    -- Check if the key exists
    local existing_value = redis.call('GET', key)
    
    if existing_value then
      -- Key exists, return its current value so the client can decide what to do
      -- e.g., if status is 'COMPLETED', use the cached response.
      return existing_value
    else
      -- Key does not exist, this is the first time we've seen it.
      -- Atomically set it to IN_PROGRESS with a short TTL.
      redis.call('SET', key, in_progress_value, 'EX', lock_ttl)
      -- Return a special value to indicate we've acquired the lock
      return 'LOCK_ACQUIRED'
    end

    This script provides the atomicity we need. When executed:

    • If the key exists, it immediately returns the stored value (which will be a JSON string containing the state and potentially the result). The application layer can then inspect this.
  • If the key does not exist, it creates it, sets the IN_PROGRESS state, applies the short TTL, and returns a unique LOCK_ACQUIRED string.
  • This completely eliminates the race condition of a separate GET and SET.

    Atomicity with Lua: The `complete_processing` Script

    Once the business logic is complete, we need another atomic operation to transition the key from IN_PROGRESS to COMPLETED or FAILED and set the final, longer TTL.

    Lua Script: complete_processing.lua

    lua
    -- keys[1]: The idempotency key
    -- args[1]: The final value to set (e.g., '{"status":"COMPLETED", "data":{...}}')
    -- args[2]: The final TTL for the completed key (e.g., 86400 seconds)
    
    local key = KEYS[1]
    local final_value = ARGV[1]
    local final_ttl = ARGV[2]
    
    -- We simply overwrite the key with the final result and a new TTL.
    -- This is safe because only the process that acquired the lock should be calling this.
    -- An optional enhancement could be to pass a unique worker ID to ensure the completer is the original locker.
    redis.call('SET', key, final_value, 'EX', final_ttl)
    
    return 'OK'

    This script is simpler. It atomically updates the key with the final result and applies the long-term TTL. This ensures that a subsequent retry hitting the start_processing.lua script will receive the cached COMPLETED state.

    Production Implementation with Node.js

    Let's integrate these scripts into a robust middleware for an Express.js application. This example uses the ioredis library.

    First, we'll create a module to manage our Redis connection and load the scripts. Loading scripts on application startup using SCRIPT LOAD is a crucial optimization. It returns a SHA1 hash of the script, which we can then call using the more efficient EVALSHA command, avoiding the need to send the full script body over the network for every request.

    redis-idempotency.js

    javascript
    const Redis = require('ioredis');
    const fs = require('fs');
    const path = require('path');
    
    class IdempotencyManager {
      constructor(redisOptions) {
        this.redis = new Redis(redisOptions);
        this.scriptShas = {};
      }
    
      async loadScripts() {
        try {
          const startScript = fs.readFileSync(path.join(__dirname, 'start_processing.lua'), 'utf8');
          const completeScript = fs.readFileSync(path.join(__dirname, 'complete_processing.lua'), 'utf8');
    
          this.scriptShas.start = await this.redis.script('load', startScript);
          this.scriptShas.complete = await this.redis.script('load', completeScript);
    
          console.log('Idempotency Lua scripts loaded successfully.');
          console.log(`- Start SHA: ${this.scriptShas.start}`);
          console.log(`- Complete SHA: ${this.scriptShas.complete}`);
        } catch (error) {
          console.error('Failed to load Lua scripts:', error);
          process.exit(1);
        }
      }
    
      async start(key, inProgressValue, lockTtl) {
        return this.redis.evalsha(this.scriptShas.start, 1, key, lockTtl, inProgressValue);
      }
    
      async complete(key, finalValue, finalTtl) {
        return this.redis.evalsha(this.scriptShas.complete, 1, key, finalValue, finalTtl);
      }
    }
    
    // Singleton instance
    const idempotencyManager = new IdempotencyManager({
        host: 'localhost',
        port: 6379
    });
    
    module.exports = idempotencyManager;

    Now, let's create the Express middleware that uses this manager.

    idempotency-middleware.js

    javascript
    const idempotencyManager = require('./redis-idempotency');
    
    const LOCK_TTL = 30; // 30 seconds for in-progress lock
    const RESULT_TTL = 86400; // 24 hours for final result
    
    function idempotencyMiddleware() {
      return async (req, res, next) => {
        const idempotencyKeyHeader = req.headers['idempotency-key'];
    
        if (!idempotencyKeyHeader) {
          // For simplicity, we proceed. In a strict system, you might reject.
          return next(); 
        }
    
        const key = `idem:${req.user.id}:${req.method}:${req.path}:${idempotencyKeyHeader}`;
    
        try {
          const inProgressValue = JSON.stringify({ status: 'IN_PROGRESS', timestamp: Date.now() });
          const result = await idempotencyManager.start(key, inProgressValue, LOCK_TTL);
    
          if (result === 'LOCK_ACQUIRED') {
            // We have the lock. Attach completion logic to the response stream.
            res.locals.idempotencyKey = key;
            res.on('finish', async () => {
              // 'finish' event fires when the response has been sent.
              if (res.statusCode >= 200 && res.statusCode < 300) {
                const finalValue = JSON.stringify({
                  status: 'COMPLETED',
                  statusCode: res.statusCode,
                  headers: res.getHeaders(),
                  body: res.locals.responseBody, // We need to capture the body before it's sent
                });
                await idempotencyManager.complete(key, finalValue, RESULT_TTL);
              }
            });
    
            // We need a way to capture the response body. This is a common pattern.
            const originalSend = res.send;
            res.send = function (body) {
                res.locals.responseBody = body;
                originalSend.call(this, body);
            };
            
            return next();
          } else {
            // The key already existed. The result is the stored value.
            const storedResult = JSON.parse(result);
    
            if (storedResult.status === 'IN_PROGRESS') {
              // Another request is currently processing this. Return a conflict.
              return res.status(409).json({ message: 'A request with this idempotency key is already in progress.' });
            } else if (storedResult.status === 'COMPLETED') {
              // The request was already completed successfully. Return the cached response.
              console.log(`Returning cached response for key: ${key}`);
              res.set(storedResult.headers);
              return res.status(storedResult.statusCode).send(storedResult.body);
            }
          }
        } catch (error) {
          console.error(`Idempotency middleware error for key ${key}:`, error);
          // Let the standard error handler deal with it.
          return next(error);
        }
      };
    }
    
    module.exports = idempotencyMiddleware;

    Example Usage in an Express App

    javascript
    const express = require('express');
    const idempotencyManager = require('./redis-idempotency');
    const idempotencyMiddleware = require('./idempotency-middleware');
    
    const app = express();
    app.use(express.json());
    
    // Dummy user middleware
    app.use((req, res, next) => {
      req.user = { id: 'user-123' };
      next();
    });
    
    // Apply the idempotency middleware to critical routes
    app.post('/api/payments', idempotencyMiddleware(), async (req, res) => {
      try {
        console.log('Processing payment...');
        // Simulate a 2-second payment gateway call
        await new Promise(resolve => setTimeout(resolve, 2000));
        
        const response = { transactionId: `txn_${Date.now()}`, amount: req.body.amount, status: 'succeeded' };
        res.status(201).json(response);
      } catch (error) {
        res.status(500).json({ error: 'Payment processing failed' });
      }
    });
    
    async function startServer() {
      await idempotencyManager.loadScripts();
      app.listen(3000, () => {
        console.log('Server running on port 3000');
      });
    }
    
    startServer();

    To test this:

    • Start the server.
  • Send a POST request: curl -X POST http://localhost:3000/api/payments -H "Content-Type: application/json" -H "Idempotency-Key: unique-key-123" -d '{"amount": 100}'
  • You'll see "Processing payment..." in the logs, and after 2 seconds, you'll get a 201 Created response.
  • Immediately send the exact same request again. You will instantly receive the same 201 Created response from the Redis cache, and the log "Processing payment..." will not appear.
  • If you send two requests in quick succession, the second one will receive a 409 Conflict error while the first is processing.
  • Advanced Considerations and Edge Cases

    This pattern is robust, but in a high-stakes production environment, we must consider the sharp edges.

    1. Stale Lock Recovery

    Problem: A worker acquires a lock, starts processing, and then crashes. The IN_PROGRESS key remains in Redis until its 30-second TTL expires. What happens next?

    Solution: The next request with the same key will find the lock has expired and will acquire it. This is generally the desired behavior, but it has implications:

  • Non-atomic Operations: If the crashed worker was partway through a non-atomic operation (e.g., it called one API but not the second), the next worker might repeat that first call. Your underlying business logic must still be idempotent on its own or use transactional semantics where possible. The Redis layer prevents duplicate starts of the entire process, but cannot protect against partial execution failures.
  • Monitoring: Frequent lock expirations for IN_PROGRESS keys are a strong signal that your workers are crashing or are taking too long. Set up monitoring on your Redis keyspace (e.g., using keyspace notifications or scanning) to alert on this pattern.
  • 2. Large Response Payloads

    Problem: Our COMPLETED state stores the entire response body in Redis. If your API returns megabytes of data, this can quickly bloat your Redis memory.

    Solution: For large payloads, use a hybrid approach. Store a pointer to the data instead of the data itself.

    • Upon successful completion, save the full response body to a more suitable storage layer like Amazon S3 or a blob store.
    • In the idempotency key in Redis, store the pointer (e.g., the S3 object key) instead of the full body.
    json
    // Example value for a large payload
    {
      "status": "COMPLETED",
      "statusCode": 200,
      "headers": { ... },
      "body_ref": "s3://my-app-responses/idem-key-123.json"
    }

    When a retrying client hits a cached result, the application layer is responsible for fetching the body from the reference location before serving the response. This adds latency but keeps Redis lean and fast.

    3. Key Naming and Garbage Collection

    Key Schema: A predictable and conflict-free key schema is vital. The example idem:${req.user.id}:${req.method}:${req.path}:${idempotencyKeyHeader} is a good start. It scopes the key to the user, the specific endpoint, and the client-provided key, preventing collisions.

    TTL Management: TTLs are your garbage collectors. Be deliberate:

  • IN_PROGRESS TTL: Should be slightly longer than the P99 latency of the operation. If your endpoint typically takes 5 seconds, a 30-second TTL is a safe buffer. Too short, and you risk premature lock expiration on a slow but valid request. Too long, and a crashed worker causes a longer outage for that operation.
  • COMPLETED TTL: This depends on your business requirements. How long should a client be able to retry and get a cached result? 24 hours is a common standard, matching Stripe's API.
  • 4. Performance: Lua vs. Alternatives

    Why Lua is often superior:

  • Atomicity: Guarantees that no other command can run between the GET and SET operations within the script.
  • Reduced Network Round-Trips: A single EVALSHA call replaces multiple commands (GET, SET), reducing latency, especially in geo-distributed environments.
  • Server-Side Execution: The logic runs on the Redis server, which is highly optimized for this kind of work.
  • Alternative: Optimistic Locking with WATCH

    One could also implement this using Redis's WATCH, MULTI, and EXEC commands. The client would WATCH the key, GET its value, and then start a MULTI transaction to SET the new value. If another client modified the key after the WATCH, the EXEC would fail, and the client would have to retry the entire process.

    While this works, it's more complex on the client side (requires a retry loop) and often results in more network traffic. For this specific state management pattern, a server-side Lua script is cleaner, more performant, and less error-prone.

    Conclusion

    Building resilient distributed systems requires moving beyond simplistic patterns and embracing techniques that provide strong guarantees. By leveraging Redis's server-side Lua scripting, we can construct an atomic, state-aware idempotency layer that is both performant and fault-tolerant. This approach elevates idempotency from a simple de-duplication mechanism to a core part of your system's reliability, enabling safe client retries, preventing duplicate processing, and providing a transparent caching layer for repeated requests.

    This pattern is not a silver bullet—it cannot magically make non-idempotent business logic safe—but it provides a robust and essential guardrail. It ensures that for any given idempotency key, the process of initiating your business logic happens exactly once, which is a foundational requirement for building predictable and trustworthy services at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles