Resilient Choreographed Sagas: Idempotency and Distributed Locking

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Fragility of Choreographed Sagas

In a distributed architecture, the choreographed saga pattern is an elegant solution for maintaining data consistency across multiple services without resorting to tightly-coupled, two-phase commits. Services communicate via asynchronous events, each performing its local transaction and publishing events to trigger downstream actions. While this promotes loose coupling and scalability, it introduces significant challenges that are often glossed over in introductory material. The two most critical are message duplication and race conditions.

Most message brokers (Kafka, RabbitMQ, SQS) provide an "at-least-once" delivery guarantee. This is a pragmatic choice, favoring durability over the complexity of exactly-once semantics. However, it pushes the problem of deduplication to the consumer. A network glitch, a consumer crash post-processing but pre-acknowledgment, or a broker-side rebalance can all lead to a message being delivered again. In an e-commerce order saga, this could mean charging a customer twice or deducting from inventory multiple times.

Simultaneously, the asynchronous nature of event-driven systems means there are no guarantees on the order or timing of operations concerning a single business entity. Two distinct events related to the same order ID—say, OrderUpdateSubmitted and OrderCancellationRequested—could be processed concurrently by different instances of a service, leading to a classic race condition where the final state of the order is unpredictable and incorrect.

This article presents a battle-tested, production-ready pattern that combines transactional idempotency checks with distributed locking to solve both problems, ensuring your sagas are not just loosely coupled, but also resilient and correct. We will build this pattern from the ground up using TypeScript, PostgreSQL for idempotency storage, and Redis for distributed locking.


Scenario: A Multi-Service E-Commerce Order Flow

Let's ground our discussion in a concrete example. Consider an e-commerce platform with three core microservices:

  • Order Service: Manages order creation and state.
  • Payment Service: Processes payments.
  • Inventory Service: Manages stock levels.
  • A successful order saga looks like this:

    OrderCreated -> (Payment Service) ProcessPayment -> PaymentSucceeded -> (Inventory Service) ReserveStock -> StockReserved -> (Order Service) MarkOrderConfirmed

    A compensation path (rollback) might look like this:

    PaymentFailed -> (Order Service) MarkOrderFailed

    StockReservationFailed -> (Payment Service) RefundPayment -> PaymentRefunded -> (Order Service) MarkOrderFailed

    Our goal is to ensure that each step in this flow can be executed safely, even if events like ProcessPayment or ReserveStock are delivered multiple times or if related events (e.g., a stock update and a cancellation) are processed concurrently.

    Pattern 1: Transactional Idempotency for At-Least-Once Delivery

    Idempotency ensures that an operation, when performed multiple times, has the same effect as if it were performed only once. The standard approach is to use an Idempotency Key, a unique identifier for a specific operation, typically passed in the message headers.

    The consumer service must store these keys and the results of the operations they correspond to. When a new message arrives, the service first checks if it has already processed a message with the same idempotency key. If so, it simply returns the stored result without re-executing the business logic.

    The Storage Strategy: A Dedicated PostgreSQL Table

    Storing idempotency records in the same transactional database as your business data is crucial. This allows you to wrap the business logic and the idempotency record creation in a single atomic transaction, preventing partial states of failure.

    Here's a robust schema for an idempotency_store table in PostgreSQL:

    sql
    CREATE TYPE idempotency_status AS ENUM ('PENDING', 'COMPLETED', 'FAILED');
    
    CREATE TABLE idempotency_store (
        -- The unique key provided by the producer of the message.
        idempotency_key VARCHAR(255) PRIMARY KEY,
    
        -- The service and action this key relates to, for namespacing.
        -- e.g., 'payment-service:process-payment'
        scope VARCHAR(255) NOT NULL,
    
        -- A hash of the request payload to prevent key reuse with different data.
        request_hash CHAR(64) NOT NULL, -- SHA-256
    
        -- The current processing status of the operation.
        status idempotency_status NOT NULL DEFAULT 'PENDING',
    
        -- The serialized response payload to return on subsequent requests.
        response_payload JSONB,
    
        -- Timestamps for lifecycle management and cleanup.
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
    );
    
    -- Index for fast lookups by key and scope.
    CREATE INDEX idx_idempotency_store_key_scope ON idempotency_store(idempotency_key, scope);

    Key Design Choices:

    * idempotency_key (Primary Key): Guarantees uniqueness at the database level.

    * scope: Prevents key collisions between different types of operations.

    * request_hash: A critical field. It prevents a malicious or buggy client from reusing an old idempotency key for a new, different request. You should always verify that the hash of the incoming request payload matches the stored hash.

    * status ENUM: This is vital for handling process crashes. An operation is not truly complete until it moves from PENDING to COMPLETED. If a consumer receives a message with a key that is still PENDING, it knows a previous attempt crashed and can decide whether to wait, retry, or fail.

    Implementation: An Idempotency Middleware in TypeScript

    Let's create a reusable function that wraps our business logic with this idempotency check. We'll use node-postgres (pg) for database interaction.

    typescript
    import { Pool, PoolClient } from 'pg';
    import { createHash } from 'crypto';
    
    export interface IdempotencyRecord {
        idempotency_key: string;
        scope: string;
        request_hash: string;
        status: 'PENDING' | 'COMPLETED' | 'FAILED';
        response_payload: any;
    }
    
    // A generic error for idempotent replays
    export class IdempotentResponseError extends Error {
        constructor(public readonly response: any) {
            super('This operation has already been completed.');
            this.name = 'IdempotentResponseError';
        }
    }
    
    async function getRequestHash(payload: object): Promise<string> {
        const payloadString = JSON.stringify(payload);
        return createHash('sha256').update(payloadString).digest('hex');
    }
    
    export async function withIdempotency<T>(
        dbPool: Pool,
        idempotencyKey: string,
        scope: string,
        requestPayload: object,
        businessLogic: (client: PoolClient) => Promise<T>
    ): Promise<T> {
        const requestHash = await getRequestHash(requestPayload);
        const client = await dbPool.connect();
    
        try {
            // Start a transaction with the highest isolation level to prevent phantom reads.
            await client.query('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE');
    
            // 1. Check for an existing record
            const existingRecordResult = await client.query<IdempotencyRecord>(
                'SELECT * FROM idempotency_store WHERE idempotency_key = $1 AND scope = $2 FOR UPDATE',
                [idempotencyKey, scope]
            );
    
            const existingRecord = existingRecordResult.rows[0];
    
            if (existingRecord) {
                // 2a. Key exists, verify request hash
                if (existingRecord.request_hash !== requestHash) {
                    await client.query('ROLLBACK');
                    throw new Error('Idempotency key reuse detected with different payload.');
                }
    
                // 2b. If completed, return the stored response
                if (existingRecord.status === 'COMPLETED') {
                    await client.query('ROLLBACK');
                    // We throw a special error to signal that this is a replay, not a failure.
                    throw new IdempotentResponseError(existingRecord.response_payload);
                }
                
                // Handle other statuses (PENDING, FAILED) as needed. For PENDING, you might wait and retry.
                // For this example, we'll treat it as a new attempt.
            } else {
                // 3. No record exists, create a new one in PENDING state
                await client.query(
                    'INSERT INTO idempotency_store (idempotency_key, scope, request_hash, status) VALUES ($1, $2, $3, $4)',
                    [idempotencyKey, scope, requestHash, 'PENDING']
                );
            }
    
            // 4. Execute the core business logic within the transaction
            const result = await businessLogic(client);
    
            // 5. Update the idempotency record to COMPLETED with the result
            await client.query(
                'UPDATE idempotency_store SET status = $1, response_payload = $2, updated_at = NOW() WHERE idempotency_key = $3 AND scope = $4',
                ['COMPLETED', JSON.stringify(result), idempotencyKey, scope]
            );
    
            // 6. Commit the entire transaction
            await client.query('COMMIT');
            return result;
    
        } catch (error) {
            // On any error, roll back the transaction
            await client.query('ROLLBACK');
            // Re-throw the error to be handled by the caller, unless it's our special replay error
            if (error instanceof IdempotentResponseError) {
                throw error; 
            }
            // You might want to update the idempotency record to FAILED here in a separate transaction
            throw error; 
        } finally {
            client.release();
        }
    }

    This wrapper is powerful. It ensures that the check, the business logic, and the persistence of the result are atomic. The FOR UPDATE clause locks the row, preventing another transaction from reading this idempotency record until the current one is complete, which is a crucial first step towards handling concurrency.


    Pattern 2: Distributed Locking for True Concurrency Control

    Our idempotency pattern is robust against message replays, but it has a subtle flaw. What if two different operations for the same entity arrive concurrently? For example:

    * Event 1: ReserveStock for orderId: 123

    * Event 2: CancelOrder for orderId: 123

    These will have different idempotency keys. If two different worker instances pick them up at the same time, they could both pass their respective idempotency checks and proceed to operate on the same order's data, leading to a race condition. The SERIALIZABLE transaction isolation level helps, but it might result in one transaction failing with a serialization error, which is not ideal. We need a way to ensure that only one process can operate on a given entity (orderId: 123) at any given time across our entire distributed service.

    This is a classic use case for a distributed lock.

    The Tool: Redis for High-Performance Locking

    Redis is an excellent choice for distributed locking due to its single-threaded nature and atomic commands like SET key value NX PX milliseconds. This command translates to: "Set key to value only if the key does not (NX) exist, and set an expiration time of milliseconds (PX)."

    Implementation: A Robust Redis Lock Manager

    While libraries like redlock exist, understanding the core principles is key. A naive implementation can be dangerous.

    typescript
    import Redis from 'ioredis';
    import { randomBytes } from 'crypto';
    
    export class DistributedLock {
        private readonly redis: Redis;
        private readonly key: string;
        private readonly lockValue: string;
        private readonly leaseTimeMs: number;
    
        constructor(redis: Redis, resourceId: string, leaseTimeMs: number = 30000) {
            this.redis = redis;
            this.key = `lock:${resourceId}`;
            this.leaseTimeMs = leaseTimeMs;
            // A unique value per lock attempt prevents releasing a lock acquired by another process.
            this.lockValue = randomBytes(16).toString('hex');
        }
    
        async acquire(): Promise<boolean> {
            const result = await this.redis.set(this.key, this.lockValue, 'PX', this.leaseTimeMs, 'NX');
            return result === 'OK';
        }
    
        async release(): Promise<boolean> {
            // Use a Lua script to ensure atomicity of the check-and-delete operation.
            // This prevents deleting a lock that was re-acquired by another process after our lock expired.
            const script = `
                if redis.call("get", KEYS[1]) == ARGV[1] then
                    return redis.call("del", KEYS[1])
                else
                    return 0
                end
            `;
            const result = await this.redis.eval(script, 1, this.key, this.lockValue);
            return result === 1;
        }
    }

    Critical Design Points:

    * Lease Time (leaseTimeMs): This is a safety mechanism. If the process holding the lock crashes, the lock will eventually expire and be released, preventing a permanent deadlock. The lease time must be longer than the expected execution time of the business logic.

    * Unique Lock Value (lockValue): This is non-negotiable. Imagine this scenario:

    1. Process A acquires lock, starts work.

    2. Process A is paused by a long GC cycle, exceeding the lock lease time.

    3. The lock expires in Redis.

    4. Process B acquires the same lock.

    5. Process A wakes up, finishes its work, and releases the lock.

    If the release operation were a simple DEL key, Process A would have just released the lock belonging to Process B. The Lua script makes the release conditional: "Only delete the lock if it still has my unique value."

    The Grand Unification: Combining Idempotency and Locking

    Now we combine both patterns into a single, cohesive strategy. The distributed lock acts as a coarse-grained entry gate, ensuring single-threaded access to a specific entity. The idempotency check acts as a fine-grained guard within that locked section, preventing re-execution of the same operation.

    The Flow:

    • Receive a message.
  • Extract the logical entity identifier (e.g., orderId).
    • Attempt to acquire a distributed lock for that entity ID.
    • If the lock is acquired:

    a. Proceed with the withIdempotency wrapper function from Pattern 1.

    b. The wrapper will handle the transactional idempotency check and execute the business logic.

    c. After completion (success or failure), release the distributed lock.

    • If the lock is not acquired:

    a. The entity is being processed by another worker.

    b. The best strategy is often to requeue the message with a delay (if your message broker supports it) or simply wait and retry acquiring the lock with exponential backoff.

    The Final Production-Grade Implementation

    Let's create a final orchestrator function that puts it all together.

    typescript
    // Assuming previous imports for Pool, Redis, DistributedLock, withIdempotency, etc.
    
    interface ProcessMessageOptions<T> {
        dbPool: Pool;
        redisClient: Redis;
        message: {
            idempotencyKey: string;
            entityId: string;
            payload: object;
        };
        scope: string;
        businessLogic: (client: PoolClient) => Promise<T>;
        lockLeaseTimeMs?: number;
    }
    
    export async function processSagaStep<T>(options: ProcessMessageOptions<T>): Promise<T> {
        const { dbPool, redisClient, message, scope, businessLogic, lockLeaseTimeMs = 30000 } = options;
    
        const lock = new DistributedLock(redisClient, `${scope}:${message.entityId}`, lockLeaseTimeMs);
    
        // 1. Attempt to acquire the lock
        const lockAcquired = await lock.acquire();
    
        if (!lockAcquired) {
            // This is a critical decision point. Requeueing is often best.
            throw new Error(`Failed to acquire lock for entity ${message.entityId}. Requeue message.`);
        }
    
        try {
            // 2. With the lock held, perform the idempotent operation
            return await withIdempotency(
                dbPool,
                message.idempotencyKey,
                scope,
                message.payload,
                businessLogic
            );
        } catch (error) {
            // If we get the special IdempotentResponseError, it means it's a successful replay.
            // We can extract the response and return it, treating it as a success.
            if (error instanceof IdempotentResponseError) {
                return error.response as T;
            }
            // For all other errors, re-throw them to be handled by the message consumer's error logic.
            throw error;
        } finally {
            // 3. ALWAYS ensure the lock is released
            await lock.release();
        }
    }
    
    // --- Example Usage in a Payment Service Consumer ---
    
    async function handleProcessPaymentEvent(message: any) {
        const paymentServiceLogic = async (dbClient: PoolClient) => {
            // ... charge the credit card via an external API
            // ... update payment status in our database using the provided dbClient
            const paymentId = 'payment-xyz-789';
            await dbClient.query('UPDATE payments SET status = \'COMPLETED\' WHERE order_id = $1', [message.entityId]);
            console.log(`Payment processed for order ${message.entityId}`);
            return { status: 'SUCCESS', paymentId };
        };
    
        try {
            const result = await processSagaStep({
                dbPool: myPostgresPool, // assume this is configured
                redisClient: myRedisClient, // assume this is configured
                message: {
                    idempotencyKey: message.headers.idempotencyKey, // from message
                    entityId: message.payload.orderId, // from message
                    payload: message.payload,
                },
                scope: 'payment-service:process-payment',
                businessLogic: paymentServiceLogic
            });
    
            // Acknowledge the message to the broker
            console.log('Successfully processed message', result);
    
        } catch (error) {
            if (error.message.includes('Failed to acquire lock')) {
                // Requeue the message
                console.warn(error.message);
            } else {
                // A real business logic failure. Move to Dead Letter Queue (DLQ).
                console.error('Critical error during payment processing:', error);
            }
        }
    }

    Advanced Considerations and Performance Tuning

    * Lock Contention and Lease Time: Setting the lock lease time is a trade-off. Too short, and long-running processes risk their locks expiring prematurely. Too long, and a crashed worker will hold up processing for an extended period. Monitor your P95 and P99 processing times and add a healthy buffer. For high-contention resources, this entire pattern may be too slow. Consider optimistic locking with version numbers in your database as an alternative if the cost of contention outweighs the cost of retrying a transaction.

    * Idempotency Store Cleanup: This table will grow indefinitely. A background job that deletes records older than a certain threshold (e.g., 30 days) is essential. The threshold should be longer than the maximum possible lifetime of a saga instance.

    * Deadlocks in Multi-Lock Sagas: If a single saga step needs to lock multiple resources (e.g., updating both a user's balance and an item's inventory), you must always acquire the locks in a consistent, globally-defined order (e.g., alphabetically by resource ID) to prevent deadlocks.

    * Performance Impact: This pattern introduces latency. Each operation now involves at least one Redis round-trip and several database queries.

    * A Redis SET NX is extremely fast (sub-millisecond).

    * The database transaction adds overhead.

    * Benchmark: In a typical cloud environment, expect this wrapper to add 5-20ms of latency per call. This is usually a small price to pay for correctness. The alternative is data corruption that requires hours of manual, expensive engineering time to reconcile.

    Conclusion

    Building resilient distributed systems requires moving beyond the happy path. The choreographed saga pattern, while powerful, is not inherently fault-tolerant. By systematically addressing the realities of at-least-once message delivery and concurrent processing, we can construct robust, production-ready services. The combination of a transactional idempotency store and a well-implemented distributed lock provides a formidable defense against the common pitfalls of event-driven architectures. While the implementation adds complexity, it provides correctness, auditability, and peace of mind, allowing your sagas to execute reliably even in the chaotic, unpredictable world of distributed systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles