Idempotency-Key Middleware: A Redis-Backed State Machine

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Problem of Double Execution in Distributed Systems

As senior engineers, we've all encountered the scenario: a client performs a critical action—like processing a payment or creating an order—and a network glitch causes a timeout. The client, unsure if the request succeeded, retries. If your API endpoint isn't idempotent, you risk charging a customer twice or creating a duplicate order. This isn't a theoretical problem; it's a catastrophic failure mode for any system handling stateful operations.

The common solution is the Idempotency-Key header, a client-generated unique identifier for a request. While the concept is simple, a production-grade implementation is fraught with peril. Naive approaches using simple database checks are susceptible to race conditions and performance bottlenecks. A robust solution requires treating each idempotent request not as a simple check, but as a managed lifecycle—a state machine.

This post details the architecture and implementation of a highly reliable idempotency middleware using a Redis-backed state machine. We will bypass introductory concepts and dive straight into the advanced patterns required to handle concurrency, atomicity, and failure recovery in a high-throughput environment.

Architecture: A Request Lifecycle State Machine

To prevent race conditions and correctly handle retries, we model the lifecycle of an idempotent request with a simple but powerful state machine. Each state is stored against the idempotency key in our chosen backend.

  • STARTED: The initial state. A request with a new idempotency key has been received, but the core business logic has not yet been invoked. We create a lock to prevent concurrent execution.
  • PROCESSING: The business logic is currently executing. This state is functionally similar to STARTED in that it signals an in-flight operation.
  • COMPLETED: The business logic finished successfully. The final HTTP response (status code and body) is stored alongside this state. Subsequent retries will immediately return this cached response.
  • FAILED: The business logic encountered an error. This state allows for a decision: should the client be allowed to retry with the same key? We'll explore this nuanced choice later.
  • Why Redis is the Superior Choice for This Pattern

    While a transactional database like PostgreSQL could store this state, Redis offers a compelling set of features that make it purpose-built for this task:

    * Low Latency: Idempotency checks are in the critical path of every request. The sub-millisecond latency of Redis is essential to avoid adding significant overhead.

    * Atomic Operations: The SET key value [NX|XX] [GET] [EX seconds|PX milliseconds|EXAT unix-time-seconds|PXAT unix-time-milliseconds|KEEPTTL] command is the cornerstone of our implementation. The NX (Not eXists) option allows us to perform an atomic "set if not exists," which is the primitive for distributed locking and race condition prevention.

    * Time-To-Live (TTL): Redis's built-in key expiration is perfect for garbage collection. We can automatically purge old idempotency records, preventing unbounded memory growth.

    Using a relational database would require SELECT ... FOR UPDATE locks, which are heavier, have higher latency, and can introduce more complex transaction management issues.

    Production-Grade Implementation in Node.js with Express

    Let's build this as an Express.js middleware. We'll use the ioredis library for its performance and robust connection management.

    1. The Middleware Structure and Key Extraction

    Our middleware will intercept incoming requests, check for the Idempotency-Key header, and interact with Redis to manage the state machine.

    javascript
    // idempotencyMiddleware.js
    import { createClient } from 'redis';
    
    // Assume a shared Redis client instance is configured elsewhere
    // and passed to the middleware factory for proper connection pooling.
    const redisClient = createClient({ url: process.env.REDIS_URL });
    redisClient.connect();
    
    const IDEMPOTENCY_KEY_HEADER = 'Idempotency-Key';
    const LOCK_TTL_MS = 10000; // 10 seconds for in-progress lock
    const RESULT_TTL_MS = 24 * 60 * 60 * 1000; // 24 hours for completed result
    
    export function idempotencyMiddleware() {
        return async (req, res, next) => {
            const idempotencyKey = req.get(IDEMPOTENCY_KEY_HEADER);
    
            if (!idempotencyKey) {
                // Not an idempotent request, proceed as normal.
                return next();
            }
    
            const redisKey = `idempotency:${idempotencyKey}`;
    
            try {
                // Implementation details to follow...
            } catch (error) {
                console.error('Idempotency middleware error:', error);
                return res.status(500).json({ error: 'Internal Server Error' });
            }
        };
    }

    2. The Core Logic: Atomic Locking and State Handling

    This is the most critical part of the implementation. We use an atomic SET with the NX option to both check for the key's existence and acquire a lock in a single, non-interruptible operation.

    javascript
    // Inside the idempotencyMiddleware try block
    
    // Step 1: Attempt to acquire the lock atomically
    const initialState = JSON.stringify({ status: 'PROCESSING' });
    const lockAcquired = await redisClient.set(redisKey, initialState, {
        PX: LOCK_TTL_MS,
        NX: true, // Only set if the key does not already exist
    });
    
    if (lockAcquired) {
        // --- SCENARIO A: NEW REQUEST --- 
        // We successfully acquired the lock. This is the first time we've seen this key.
        console.log(`[${idempotencyKey}] Lock acquired. Processing...`);
    
        // We need to store the final result. We can't use `res.send` directly
        // because we need to capture its output. So we patch it.
        const originalSend = res.send;
        res.send = function (body) {
            const result = {
                status: 'COMPLETED',
                response_code: res.statusCode,
                response_body: body,
            };
            
            // Store the final result with a longer TTL
            redisClient.set(redisKey, JSON.stringify(result), { PX: RESULT_TTL_MS });
            
            originalSend.call(this, body);
        };
    
        // If the connection is aborted, we must handle it.
        req.on('aborted', () => {
            // Client gave up. We can optionally clear the lock to allow a quick retry.
            // Be cautious with this; the original request might still be processing.
            // A safer bet is to let the lock TTL expire.
            console.warn(`[${idempotencyKey}] Request aborted by client.`);
        });
    
        return next(); // Proceed to the actual route handler
    }
    
    // --- SCENARIO B & C: DUPLICATE OR RETRIED REQUEST ---
    // Lock was not acquired, meaning the key already exists.
    console.log(`[${idempotencyKey}] Key exists. Checking status...`);
    
    const existingRecordRaw = await redisClient.get(redisKey);
    if (!existingRecordRaw) {
        // This is a rare edge case: the key existed moments ago but expired before we could GET it.
        // Treat it as a transient error and ask the client to retry.
        return res.status(503).json({ error: 'Service Unavailable, please retry.' });
    }
    
    const existingRecord = JSON.parse(existingRecordRaw);
    
    if (existingRecord.status === 'COMPLETED') {
        // --- SCENARIO C: RETRIED COMPLETED REQUEST ---
        console.log(`[${idempotencyKey}] Request already completed. Returning cached response.`);
        return res
            .status(existingRecord.response_code)
            .send(existingRecord.response_body);
    }
    
    if (existingRecord.status === 'PROCESSING') {
        // --- SCENARIO B: DUPLICATE IN-FLIGHT REQUEST ---
        console.log(`[${idempotencyKey}] Request is already in progress.`);
        return res.status(409).json({ error: 'Request in progress' });
    }
    
    // If we reach here, the state is unknown or corrupt. 
    // Best to return a server error.
    return res.status(500).json({ error: 'Inconsistent idempotency state' });

    3. Handling Handler Failures

    What if the route handler throws an error after we've acquired the lock? Our current implementation would leave a PROCESSING record in Redis until the lock TTL expires. The client would receive a 500 error from our framework's error handler but would be blocked from retrying for 10 seconds.

    We can improve this by creating a dedicated error-handling middleware that runs after our routes and cleans up the idempotency key.

    javascript
    // In your main app file (e.g., server.js)
    import express from 'express';
    import { idempotencyMiddleware } from './idempotencyMiddleware';
    
    const app = express();
    
    app.use(express.json());
    app.use(idempotencyMiddleware());
    
    app.post('/api/payments', async (req, res) => {
        // Simulate a complex, potentially failing operation
        console.log('Processing payment for key:', req.get('Idempotency-Key'));
        if (Math.random() > 0.5) {
            throw new Error('Payment processor failed');
        }
        // Success
        res.status(201).json({ transactionId: 'txn_' + Date.now() });
    });
    
    // Custom error handler specifically for idempotency cleanup
    app.use((err, req, res, next) => {
        const idempotencyKey = req.get('Idempotency-Key');
        if (idempotencyKey) {
            const redisKey = `idempotency:${idempotencyKey}`;
            console.error(`[${idempotencyKey}] Error occurred during processing. Deleting lock.`);
            // On failure, we delete the key entirely. This allows the client to retry the operation from scratch.
            // The alternative is to set the state to FAILED, which would permanently block this key.
            // Deleting is often the more pragmatic choice for transient failures.
            redisClient.del(redisKey);
        }
        
        // Default error response
        res.status(500).json({ error: err.message || 'An unexpected error occurred.' });
    });
    
    app.listen(3000, () => console.log('Server running on port 3000'));

    This error handler ensures that if the business logic fails, the lock is immediately released, allowing a client to perform a clean retry.

    Advanced Edge Cases and Performance Considerations

    A basic implementation is a good start, but production systems demand resilience against more subtle failure modes.

    Edge Case 1: The Server Crash

    What happens if the server process crashes after the business logic completes but before the res.send patch can update the Redis record to COMPLETED?

    The idempotency record will remain in the PROCESSING state until its short TTL (10 seconds) expires. During this window, any retries will receive a 409 Conflict. After the TTL, a new request with the same key will be treated as a new operation, potentially leading to double execution.

    Solution: There is no perfect solution without a distributed transaction coordinator, which is overkill. A pragmatic approach is:

  • Shorter Lock TTLs: Make the PROCESSING state TTL as short as is reasonable for your P99 request duration. If your endpoint typically responds in 200ms, a 2-second lock TTL might be sufficient.
  • Client-Side Retry Logic: Clients should implement exponential backoff. If they receive a 409 Conflict, they should wait and retry. If they retry after the lock has expired, your system must be designed to detect the duplicate operation through other means (e.g., a unique constraint on an order ID in your database).
  • Audit Logging: Log every idempotency state transition. This allows for manual reconciliation in the rare event of a double execution.
  • Edge Case 2: The `GET` before `SET` Race

    Our logic for a non-lockAcquired path is:

  • const existingRecordRaw = await redisClient.get(redisKey);
  • if (!existingRecordRaw) { / ... rare edge case ... / }
  • It's possible for the key to expire between the SET...NX failing and the GET executing. Our code handles this by returning a 503 Service Unavailable, prompting the client to retry. This is a safe and correct approach, as the subsequent retry will successfully acquire the lock and restart the process cleanly.

    Performance Optimization: Lua Scripting

    Our current logic for handling an existing key involves two network roundtrips to Redis: one for the failed SET...NX and another for the GET.

    We can combine this logic into a single, atomic operation using a Redis Lua script. This reduces network latency and ensures that the check-and-get operation is atomic, eliminating the GET before SET race condition entirely.

    lua
    -- idempotency.lua
    -- KEYS[1]: The idempotency key (e.g., 'idempotency:uuid-123')
    -- ARGV[1]: The initial value for a new record (e.g., '{"status":"PROCESSING"}')
    -- ARGV[2]: The TTL in milliseconds for a new record
    
    -- Attempt to set the key if it doesn't exist
    local was_set = redis.call('SET', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
    
    if was_set then
      -- The key was successfully set, this is a new request
      return {'NEW'}
    else
      -- The key already exists, return its current value
      local existing_value = redis.call('GET', KEYS[1])
      return {'EXISTING', existing_value}
    end

    Now, we can update our middleware to use this script via EVAL or EVALSHA.

    javascript
    // In idempotencyMiddleware.js
    
    // Load the script during initialization
    // In a real app, you'd read this from a file and manage the SHA hash for EVALSHA
    const LUA_SCRIPT = `
      local was_set = redis.call('SET', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
      if was_set then
        return {'NEW'}
      else
        local existing_value = redis.call('GET', KEYS[1])
        return {'EXISTING', existing_value}
      end
    `;
    
    // In the middleware function...
    const initialState = JSON.stringify({ status: 'PROCESSING' });
    const [status, existingValue] = await redisClient.eval(
        LUA_SCRIPT,
        1, // Number of keys
        redisKey,
        initialState,
        LOCK_TTL_MS
    );
    
    if (status === 'NEW') {
        // We acquired the lock. Same logic as SCENARIO A before.
        // ...
        return next();
    } else if (status === 'EXISTING') {
        // Key already existed. Same logic as SCENARIO B & C before,
        // but now we use `existingValue` directly.
        if (!existingValue) {
            // This race condition is now impossible with the Lua script, 
            // but defensive coding is good practice.
            return res.status(503).json({ error: 'Service Unavailable, please retry.' });
        }
        const existingRecord = JSON.parse(existingValue);
        // ... handle COMPLETED or PROCESSING states
    }

    This Lua-based approach is more performant and robust, making it the preferred pattern for high-throughput systems.

    Complete Runnable Example

    Here is a simplified but complete server file to demonstrate the entire pattern in action.

    javascript
    // server.js
    import express from 'express';
    import { createClient } from 'redis';
    import { v4 as uuidv4 } from 'uuid';
    
    // --- Redis Client Setup ---
    const redisClient = createClient({ url: 'redis://localhost:6379' });
    redisClient.on('error', (err) => console.log('Redis Client Error', err));
    await redisClient.connect();
    
    // --- Idempotency Middleware ---
    const IDEMPOTENCY_KEY_HEADER = 'Idempotency-Key';
    const LOCK_TTL_MS = 10000;
    const RESULT_TTL_MS = 24 * 60 * 60 * 1000;
    
    // Using Lua for atomicity and performance
    const LUA_SCRIPT = `
      local was_set = redis.call('SET', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
      if was_set then
        return {'NEW'}
      else
        return {'EXISTING', redis.call('GET', KEYS[1])}
      end
    `;
    
    function idempotencyMiddleware() {
        return async (req, res, next) => {
            const idempotencyKey = req.get(IDEMPOTENCY_KEY_HEADER);
            if (!idempotencyKey) return next();
    
            const redisKey = `idempotency:${idempotencyKey}`;
    
            try {
                const initialState = JSON.stringify({ status: 'PROCESSING' });
                const [status, existingValue] = await redisClient.eval(LUA_SCRIPT, {
                    keys: [redisKey],
                    arguments: [initialState, LOCK_TTL_MS.toString()],
                });
    
                if (status === 'NEW') {
                    const originalSend = res.send.bind(res);
                    res.send = (body) => {
                        const result = JSON.stringify({
                            status: 'COMPLETED',
                            response_code: res.statusCode,
                            response_body: body,
                        });
                        redisClient.set(redisKey, result, { PX: RESULT_TTL_MS });
                        return originalSend(body);
                    };
                    return next();
                }
    
                if (status === 'EXISTING') {
                    if (!existingValue) {
                        return res.status(503).json({ error: 'Retry required due to transient state.' });
                    }
                    const record = JSON.parse(existingValue);
                    if (record.status === 'COMPLETED') {
                        return res.status(record.response_code).send(record.response_body);
                    }
                    if (record.status === 'PROCESSING') {
                        return res.status(409).json({ error: 'Request in progress' });
                    }
                }
            } catch (error) {
                console.error('Idempotency middleware error:', error);
                return res.status(500).json({ error: 'Internal Server Error' });
            }
        };
    }
    
    // --- Express App ---
    const app = express();
    app.use(express.json());
    app.use(idempotencyMiddleware());
    
    app.post('/api/orders', async (req, res) => {
        const key = req.get(IDEMPOTENCY_KEY_HEADER);
        console.log(`[${key}] Processing new order for user ${req.body.userId}`);
        
        // Simulate 2 seconds of work
        await new Promise(resolve => setTimeout(resolve, 2000));
    
        if (req.body.shouldFail) {
            throw new Error('Order creation failed!');
        }
    
        const orderId = `order_${uuidv4()}`;
        console.log(`[${key}] Order ${orderId} created successfully.`);
        res.status(201).json({ orderId });
    });
    
    // Error handler for cleanup
    app.use((err, req, res, next) => {
        const idempotencyKey = req.get(IDEMPOTENCY_KEY_HEADER);
        if (idempotencyKey) {
            const redisKey = `idempotency:${idempotencyKey}`;
            console.error(`[${idempotencyKey}] Deleting lock due to error: ${err.message}`);
            redisClient.del(redisKey);
        }
        res.status(500).json({ error: 'An unexpected error occurred.' });
    });
    
    app.listen(3000, () => {
        console.log('Server running on port 3000');
        console.log('Test with:');
        console.log(`curl -X POST -H "Content-Type: application/json" -H "Idempotency-Key: $(uuidgen)" -d '{"userId": 123}' http://localhost:3000/api/orders`);
    });
    

    Conclusion: Beyond the Basics

    Implementing a truly robust idempotency layer is a microcosm of distributed systems engineering. It forces us to confront race conditions, network failures, and the need for atomic operations. By leveraging Redis and modeling the request lifecycle as a state machine, we can build a solution that is both highly performant and resilient.

    The key takeaways for a production-grade system are:

  • Use Atomic Primitives: Rely on operations like SET...NX or, even better, a comprehensive Lua script to prevent race conditions.
  • Implement a State Machine: Don't just check for key existence. Track the PROCESSING and COMPLETED states to handle in-flight and finished requests correctly.
  • Plan for Failure: Your system will crash. Use short lock TTLs and robust error handling to ensure you can recover gracefully.
  • Cache the Result: The goal is not just to prevent double execution but also to return the identical result for a retried request. Storing the final response is non-negotiable.
  • This pattern, while complex, is an essential tool in the arsenal of any senior engineer building reliable, mission-critical services.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles