Idempotency Key Patterns for Resilient Distributed APIs

28 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond Theory: Production-Grade Idempotency

For senior engineers building distributed systems, particularly in domains like FinTech, e-commerce, or any service involving stateful mutations, idempotency isn't a 'nice-to-have'—it's a foundational requirement for reliability. The common interview question, "How do you handle duplicate requests?" has a simple answer: an idempotency key. However, the chasm between this simple answer and a production-ready, performant, and race-condition-proof implementation is vast.

This article bypasses the introductory definitions. We assume you understand why you need an Idempotency-Key header. Instead, we will dissect the complex engineering decisions you must make when implementing this pattern in a high-throughput, distributed environment. We'll focus on two primary state-management strategies—a relational database (PostgreSQL) and an in-memory cache (Redis)—and an advanced hybrid model, complete with full-fledged TypeScript examples for a Node.js/Express environment.

Our goal is to move from a conceptual model to a concrete architecture that addresses:

  • Atomicity and Race Conditions: How do you prevent two identical requests, arriving milliseconds apart, from both executing the business logic?
  • State Management: Where do you store the state of a request (e.g., in_progress, completed) and its result? What are the performance and consistency trade-offs?
  • Partial Failures: What happens if your service crashes after the database transaction commits but before the idempotency result is stored? How do you ensure the system remains consistent?
  • Performance Overheads: What is the latency cost of each idempotency check, and how can we optimize it?
  • Lifecycle and Cleanup: How do you manage the lifecycle of idempotency keys to prevent unbounded data growth?

  • The Anatomy of an Idempotent Request Flow

    Before diving into implementation specifics, let's establish a state machine for an idempotent request. This model will be the foundation for all subsequent patterns.

    A request associated with an idempotency key can be in one of three states:

  • IN_PROGRESS: The key has been seen, and the original request is currently being processed. Any subsequent requests with the same key should wait or receive a conflict error.
  • COMPLETED: The original request was successfully processed, and its result (HTTP status code, headers, and body) is stored. Subsequent requests with the same key should immediately return the stored result without re-executing any logic.
  • FAILED: The original request failed in a way that should be retried (e.g., a temporary downstream service failure). This state is more nuanced and often treated as non-existent, forcing a full retry after a lock timeout.
  • Our middleware will orchestrate this flow:

  • Client sends a POST /v1/payments request with an Idempotency-Key: header.
    • Middleware intercepts the request and extracts the key.
    • Middleware checks the state store for this key.

    * If key exists and is COMPLETED: Intercept the request and immediately return the stored response.

    * If key exists and is IN_PROGRESS: Return an HTTP 409 Conflict to indicate the request is already being processed.

    * If key does not exist: This is the first time we've seen this request.

    a. Atomically create a record for the key and mark its state as IN_PROGRESS. This is the most critical step for preventing race conditions.

    b. Proceed to the actual business logic (e.g., the payment controller).

    c. Once the business logic completes, capture the response (status, headers, body).

    d. Update the idempotency record's state to COMPLETED and store the captured response.

    e. Release any locks and return the response to the client.

    Now, let's implement this flow using different backend technologies, analyzing the trade-offs of each.


    Strategy 1: PostgreSQL for Strong Consistency

    Using your primary relational database (like PostgreSQL) as the state store for idempotency offers the strongest consistency guarantees. It allows you to tie the idempotency check into the same transaction as your business logic, providing ACID compliance for the entire operation.

    Database Schema

    First, we need a table to store the state. The schema must be designed to enforce uniqueness and handle different states.

    sql
    CREATE TYPE idempotency_status AS ENUM ('in_progress', 'completed');
    
    CREATE TABLE idempotency_keys (
        -- The idempotency key provided by the client.
        key_hash CHAR(64) PRIMARY KEY, -- Storing a SHA-256 hash is better for indexing and security.
        
        -- User/Tenant ID to scope the key.
        -- CRITICAL: Prevents one user from hijacking another's key.
        user_id UUID NOT NULL,
        
        -- The current state of the request processing.
        status idempotency_status NOT NULL,
        
        -- The stored HTTP response to return on subsequent requests.
        response_code SMALLINT,
        response_body JSONB,
        response_headers JSONB,
    
        -- Timestamps for lifecycle management.
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        locked_until TIMESTAMPTZ, -- To prevent indefinite locks on failed processes.
    
        -- A unique constraint on the original key and user_id is the core of the lock.
        -- We hash the key for performance, but we could also have a unique index on (user_id, key_text).
        UNIQUE (key_hash, user_id)
    );
    
    -- An index for efficient lookups.
    CREATE INDEX idx_idempotency_keys_user_id_key_hash ON idempotency_keys (user_id, key_hash);
    -- An index for the cleanup process.
    CREATE INDEX idx_idempotency_keys_created_at ON idempotency_keys (created_at);

    Key Design Choices:

    * key_hash: We store a SHA-256 hash of the client-provided key. This ensures a fixed-length, indexed primary key and prevents malicious clients from sending excessively long keys.

    * user_id: Scoping the key to a user or tenant is non-negotiable in a multi-tenant system.

    * locked_until: This timestamp acts as a dead-man's switch. If a process dies while holding a lock, the lock will eventually expire, allowing a new request to proceed.

    Implementation: The Transactional Middleware

    Here's how to implement the middleware in TypeScript using node-postgres (pg). The magic lies in using a transaction and SELECT ... FOR UPDATE SKIP LOCKED to handle concurrency.

    typescript
    import { Request, Response, NextFunction } from 'express';
    import { Pool, PoolClient } from 'pg';
    import { createHash } from 'crypto';
    
    const pool = new Pool({ connectionString: process.env.DATABASE_URL });
    
    const LOCK_TIMEOUT_MS = 5000; // 5 seconds
    
    // A utility to hash the key
    function hashKey(key: string): string {
        return createHash('sha256').update(key).digest('hex');
    }
    
    export async function idempotencyMiddleware(req: Request, res: Response, next: NextFunction) {
        const idempotencyKey = req.headers['idempotency-key'] as string;
        if (!idempotencyKey) {
            return next(); // Not an idempotent request
        }
    
        // Assume user_id is available from a previous auth middleware
        const userId = (req as any).user.id;
        if (!userId) {
            return res.status(401).json({ error: 'Unauthorized' });
        }
    
        const keyHash = hashKey(idempotencyKey);
        let client: PoolClient | null = null;
    
        try {
            client = await pool.connect();
            await client.query('BEGIN');
    
            // 1. Attempt to find an existing key for this user.
            // SELECT ... FOR UPDATE locks the row, preventing other transactions from modifying it.
            // SKIP LOCKED immediately moves on if the row is already locked by another transaction,
            // which is exactly what we want for a 409 Conflict response.
            const { rows } = await client.query(
                `SELECT status, response_code, response_body, response_headers
                 FROM idempotency_keys
                 WHERE key_hash = $1 AND user_id = $2
                 FOR UPDATE SKIP LOCKED`,
                [keyHash, userId]
            );
    
            if (rows.length > 0) {
                const existingKey = rows[0];
                
                if (existingKey.status === 'completed') {
                    // 2a. Request already completed. Return the stored response.
                    await client.query('COMMIT');
                    client.release();
                    res
                        .set(existingKey.response_headers || {})
                        .status(existingKey.response_code)
                        .json(existingKey.response_body);
                    return;
                } else if (existingKey.status === 'in_progress') {
                     // 2b. Request is in progress. Return 409 Conflict.
                    await client.query('COMMIT');
                    client.release();
                    return res.status(409).json({ error: 'Request in progress' });
                }
            } else {
                // 3. No key found. This is a new request. Create a lock.
                try {
                    await client.query(
                        `INSERT INTO idempotency_keys (key_hash, user_id, status, locked_until)
                         VALUES ($1, $2, 'in_progress', NOW() + INTERVAL '${LOCK_TIMEOUT_MS} milliseconds')`,
                        [keyHash, userId]
                    );
                } catch (error: any) {
                    // This handles the race condition where another request inserted the key between our SELECT and INSERT.
                    if (error.code === '23505') { // unique_violation
                        await client.query('ROLLBACK');
                        client.release();
                        return res.status(409).json({ error: 'Concurrent request detected' });
                    } 
                    throw error; // Rethrow other errors
                }
            }
    
            // 4. We have the lock. Proceed with the business logic.
            // We override res.send and res.json to capture the response.
            const originalJson = res.json;
            const originalSend = res.send;
            let responseBody: any = null;
            let isResponseCaptured = false;
    
            const captureResponse = (body?: any) => {
                if (isResponseCaptured) return;
                responseBody = body;
                isResponseCaptured = true;
            };
    
            res.json = (body: any) => {
                captureResponse(body);
                return originalJson.call(res, body);
            };
    
            res.send = (body: any) => {
                captureResponse(body);
                return originalSend.call(res, body);
            };
            
            res.on('finish', async () => {
                if (!client) return; // Should not happen
    
                try {
                    if (res.statusCode >= 200 && res.statusCode < 300) {
                        // 5a. Success. Store the response and mark as completed.
                        await client.query(
                            `UPDATE idempotency_keys 
                             SET status = 'completed', response_code = $1, response_body = $2, response_headers = $3, locked_until = NULL
                             WHERE key_hash = $4 AND user_id = $5`,
                            [res.statusCode, responseBody, res.getHeaders(), keyHash, userId]
                        );
                    } else {
                        // 5b. Failure. Delete the key to allow retries.
                        await client.query(
                            `DELETE FROM idempotency_keys WHERE key_hash = $1 AND user_id = $2`,
                            [keyHash, userId]
                        );
                    }
                    await client.query('COMMIT');
                } catch (error) {
                    console.error('Failed to update idempotency key:', error);
                    await client.query('ROLLBACK');
                } finally {
                    client.release();
                    client = null;
                }
            });
    
            next(); // Execute the actual controller
    
        } catch (error) {
            if (client) {
                try { await client.query('ROLLBACK'); } catch (e) { /* ignore */ }
                client.release();
            }
            console.error('Idempotency middleware error:', error);
            next(error);
        }
    }

    Edge Cases & Performance Considerations (PostgreSQL)

    * Performance: The SELECT FOR UPDATE introduces row-level locking, which can become a bottleneck under high contention for the same resource (e.g., many payments for the same user). The overhead of a database transaction for every idempotent API call adds latency (typically 5-20ms, depending on network and DB load). Connection pool exhaustion is a real risk if these transactions are long-running.

    * Partial Failures: The beauty of this approach is its resilience to partial failures. If the Node.js process crashes after the business logic commits but before the finish event handler updates the idempotency key, the transaction will be rolled back by PostgreSQL. The in_progress record will remain but will eventually be clearable thanks to the locked_until timestamp. A subsequent request will see the lock, but a cleanup job can remove expired locks.

    * Cleanup: An in_progress record with an expired locked_until is an orphaned lock. A periodic background job is necessary:

    sql
        DELETE FROM idempotency_keys 
        WHERE status = 'in_progress' AND locked_until < NOW();

    Strategy 2: Redis for High-Performance Locking

    For systems where latency is paramount, offloading the idempotency check to an in-memory store like Redis can dramatically reduce overhead. Redis provides atomic operations that are perfect for this use case.

    The Redis-Based Flow

    The logic is similar, but the implementation of the atomic lock is different. We'll use Redis's SET command with the NX (Not Exists) and PX (expire in milliseconds) options.

    SET mykey myvalue NX PX 5000 is the atomic equivalent of: "If mykey does not exist, set it to myvalue with an expiry of 5000ms, and return success. Otherwise, do nothing and return failure."

    We will store a JSON object in Redis with the request's state and eventual result.

    Implementation: The High-Speed Middleware

    Here's the middleware using ioredis.

    typescript
    import { Request, Response, NextFunction } from 'express';
    import Redis from 'ioredis';
    
    const redisClient = new Redis(process.env.REDIS_URL);
    
    const LOCK_TIMEOUT_MS = 5000; // 5 seconds
    
    // Redis key format: idempotency:{user_id}:{key}
    function getRedisKey(userId: string, idempotencyKey: string): string {
        return `idempotency:${userId}:${idempotencyKey}`;
    }
    
    export async function idempotencyMiddlewareRedis(req: Request, res: Response, next: NextFunction) {
        const idempotencyKey = req.headers['idempotency-key'] as string;
        if (!idempotencyKey) {
            return next();
        }
    
        const userId = (req as any).user.id;
        if (!userId) {
            return res.status(401).json({ error: 'Unauthorized' });
        }
    
        const redisKey = getRedisKey(userId, idempotencyKey);
    
        try {
            // 1. Check if a final result already exists.
            const existingResult = await redisClient.get(redisKey);
    
            if (existingResult) {
                const data = JSON.parse(existingResult);
                if (data.status === 'completed') {
                    // 2a. Request completed. Return stored response.
                    res
                        .set(data.headers)
                        .status(data.code)
                        .json(data.body);
                    return;
                }
            }
    
            // 2. Attempt to acquire an atomic lock.
            const lockValue = JSON.stringify({ status: 'in_progress', timestamp: Date.now() });
            const lockAcquired = await redisClient.set(redisKey, lockValue, 'PX', LOCK_TIMEOUT_MS, 'NX');
    
            if (!lockAcquired) {
                // 2b. Lock not acquired, another request is in progress.
                return res.status(409).json({ error: 'Request in progress' });
            }
    
            // 3. Lock acquired. Proceed with business logic.
            const originalJson = res.json;
            const originalSend = res.send;
            let responseBody: any = null;
    
            res.json = (body: any) => {
                responseBody = body;
                return originalJson.call(res, body);
            };
            res.send = (body: any) => {
                responseBody = body;
                return originalSend.call(res, body);
            };
    
            res.on('finish', async () => {
                try {
                    if (res.statusCode >= 200 && res.statusCode < 300) {
                        // 4a. Success. Store the final result with a longer TTL.
                        const finalResult = JSON.stringify({
                            status: 'completed',
                            code: res.statusCode,
                            headers: res.getHeaders(),
                            body: responseBody,
                        });
                        // Use a 24-hour TTL for the final result.
                        await redisClient.set(redisKey, finalResult, 'EX', 24 * 60 * 60);
                    } else {
                        // 4b. Failure. Release the lock to allow retries.
                        await redisClient.del(redisKey);
                    }
                } catch (error) {
                    console.error('Failed to update idempotency key in Redis:', error);
                    // The lock will expire automatically via its TTL.
                }
            });
    
            next();
    
        } catch (error) {
            console.error('Redis idempotency middleware error:', error);
            // If we fail to talk to Redis, it's safer to let the request through
            // but this depends on business requirements.
            next();
        }
    }

    Edge Cases & Performance Considerations (Redis)

    * Performance: This approach is significantly faster. A Redis SET operation is sub-millisecond, drastically reducing the latency overhead compared to a PostgreSQL transaction. It's ideal for high-throughput endpoints.

    * Consistency vs. Availability: The biggest trade-off is the loss of transactional guarantees between your business logic (in PostgreSQL) and your idempotency state (in Redis). Consider this failure mode:

    1. Redis lock is acquired.

    2. PostgreSQL transaction for payment processing commits successfully.

    3. The Node.js process crashes before it can write the completed state to Redis.

    In this scenario, the Redis lock will expire after LOCK_TIMEOUT_MS. The next request with the same key will acquire a new lock and re-execute the business logic, potentially leading to a double payment. This is a critical risk.

    * Data Persistence: If your Redis instance fails and loses data, you lose all idempotency guarantees for in-flight and recently completed requests. While Redis persistence (AOF/RDB) mitigates this, it's not as durable as a relational database.


    Strategy 3: The Hybrid Approach - Redis for Locking, PostgreSQL for Truth

    To get the best of both worlds—the high performance of Redis for locking and the strong consistency of PostgreSQL for the source of truth—we can combine them. This pattern is more complex but offers a robust solution for demanding applications.

    The Hybrid Flow

  • Check for Final Result (PostgreSQL): First, quickly check the idempotency_keys table for a completed record. If found, return the stored response. This is our source of truth.
  • Attempt Fast Lock (Redis): If no completed record is found, attempt to acquire a fast, short-lived lock in Redis using SET NX.
  • * If the Redis lock fails, another request is in progress. Return 409 Conflict.

  • Acquire DB Lock & Verify: Once the Redis lock is acquired, start a database transaction.
  • * Inside the transaction, perform the SELECT ... FOR UPDATE SKIP LOCKED check again. This is a crucial double-check to guard against race conditions where the Redis state is inconsistent with the DB (e.g., after a Redis failure).

  • Execute Business Logic: If the DB lock is secured, execute the main business logic.
  • Commit and Store: Commit the business logic transaction. Then, update the idempotency_keys table to completed with the response.
  • Release Locks: Finally, delete the Redis key.
  • This flow uses Redis as a fast path to reject concurrent requests but ultimately relies on the database for correctness.

    Implementation Sketch

    Implementing the full hybrid middleware is an extension of the previous two. Here's the core logic flow:

    typescript
    // (Inside the middleware function)
    
    // 1. Check for a final result in PostgreSQL first.
    const completedResult = await findCompletedInDb(keyHash, userId);
    if (completedResult) {
        return sendStoredResponse(res, completedResult);
    }
    
    // 2. Attempt to acquire a fast lock in Redis.
    const redisKey = getRedisKey(userId, idempotencyKey);
    const lockAcquired = await redisClient.set(redisKey, 'locked', 'PX', LOCK_TIMEOUT_MS, 'NX');
    
    if (!lockAcquired) {
        return res.status(409).json({ error: 'Request in progress' });
    }
    
    let client: PoolClient | null = null;
    try {
        // 3. Redis lock acquired. Now verify with the DB inside a transaction.
        client = await pool.connect();
        await client.query('BEGIN');
    
        // Double-check for race conditions.
        const { rows } = await client.query(
            `SELECT ... FROM idempotency_keys ... FOR UPDATE SKIP LOCKED`,
            [keyHash, userId]
        );
    
        if (rows.length > 0) {
            // Another process beat us to the DB lock despite us getting the Redis lock.
            // This can happen if the first process died after getting the DB lock but before releasing the Redis lock.
            await client.query('ROLLBACK');
            await redisClient.del(redisKey); // Clean up our redis lock
            return res.status(409).json({ error: 'Concurrent request detected' });
        }
    
        // Insert the 'in_progress' record into PostgreSQL.
        await client.query(
            `INSERT INTO idempotency_keys (..., status) VALUES (..., 'in_progress')`
        );
    
        // Attach 'finish' handler to update PG to 'completed' on success.
        res.on('finish', async () => {
            // ... update PG to 'completed' and commit ...
            await redisClient.del(redisKey); // Release Redis lock
        });
    
        // ... proceed to next() to run business logic ...
        // The business logic must use the same transaction client.
        (req as any).dbClient = client;
        next();
    
    } catch (error) {
        // ... error handling, rollback, release Redis lock ...
    }
    

    Analysis of the Hybrid Approach

    * Pros:

    * High Performance: Most concurrent requests are rejected at the Redis layer, which is extremely fast, preventing load on the database.

    * Strong Consistency: The final state and business logic are tied to a database transaction, preventing the double-execution problem of the Redis-only approach.

    * Cons:

    * Complexity: This is the most complex of the three patterns. It requires managing state in two different systems, and the logic for handling failures (e.g., Redis is down but PostgreSQL is up) must be carefully considered.

    * Increased Infrastructure: It requires managing and maintaining both a PostgreSQL database and a Redis instance.

    Conclusion: Choosing the Right Pattern

    There is no single best solution for implementing idempotency keys. The optimal choice is a direct function of your system's specific requirements.

    * Choose PostgreSQL-only when strong consistency is non-negotiable and the additional latency per request is acceptable. This is the safest and simplest starting point for services handling financial transactions or critical data mutations.

    * Choose Redis-only for high-throughput, low-latency endpoints where a small risk of double-execution in rare failure scenarios is an acceptable trade-off for performance. This might be suitable for operations like updating a user's profile information or casting a vote, where eventual consistency is sufficient.

    * Choose the Hybrid approach for systems that require both high performance and strong consistency. It provides the fast-path rejection of Redis while falling back to the transactional safety of PostgreSQL, making it a powerful but complex pattern for mission-critical, high-load services.

    By moving beyond the simple concept of an idempotency key and into the details of transactional locking, atomic operations, and failure modes, you can architect truly resilient and predictable distributed APIs.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles