Idempotent Serverless Handlers via DynamoDB Conditional Writes

23 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitability of Duplicate Events in Distributed Systems

In the world of distributed, event-driven architectures, the promise of 'exactly-once' message delivery is a siren's song—alluring but ultimately a myth in most practical, large-scale systems. Services like Amazon SQS, EventBridge, and SNS are built for resilience and scalability, and to achieve this, they offer an at-least-once delivery guarantee. This isn't a bug; it's a fundamental design trade-off. Acknowledging a message receipt can fail, network partitions can occur, or a consumer can crash post-processing but pre-acknowledgment. To prevent data loss, the broker's safest bet is to redeliver the message.

For a senior engineer, this means the responsibility for handling duplicates shifts from the infrastructure to our application code. A Lambda function triggered by an SQS queue might be invoked twice (or more) with the exact same event payload.

Consider a simple payment processing workflow:

  • A ProcessPayment event is pushed to an SQS queue.
    • A Lambda function consumes the event, calls a payment gateway API, and records the transaction in a database.
    • The Lambda successfully completes but, due to a transient network blip, fails to delete the message from the SQS queue before timing out.
    • SQS, seeing the message is still visible, redelivers it to another Lambda invocation.
    • The payment is processed a second time. The customer is double-charged.

    This is the classic failure mode that idempotency solves. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application. Our goal is to make our Lambda handlers idempotent, transforming the infrastructure's at-least-once guarantee into an effectively-once processing semantic at the application layer.

    This article presents a robust, production-grade pattern for achieving this using DynamoDB's conditional write capabilities, providing the atomicity and performance required for high-throughput serverless applications.

    The Idempotency Key Pattern with a Persistent Store

    The standard approach to idempotency involves an idempotency key. This is a unique client-generated identifier for a specific, repeatable operation. The server-side logic then uses this key to track the processing state of the request.

    The high-level workflow is as follows:

  • Receive Request: The handler receives an event containing a unique idempotencyKey.
  • Check State: The handler queries a persistent store using the idempotencyKey.
  • Decision Logic:
  • * Key Not Found: This is the first time we've seen this request. The handler marks the key as IN_PROGRESS in the store, executes the core business logic, and upon completion, updates the key's state to COMPLETED with the result.

    * Key Found (IN_PROGRESS): Another process is currently handling this request. The current invocation should terminate, perhaps returning a conflict error, to prevent duplicate work.

    * Key Found (COMPLETED): The request has already been successfully processed. The handler should not re-execute the business logic but instead return the stored result immediately.

    Why DynamoDB is the Superior Choice for an Idempotency Store

    For this pattern to work, the persistent store must provide atomic operations to prevent race conditions where two concurrent invocations might both decide they are the 'first' to see a key.

    * Relational Databases (e.g., PostgreSQL): Possible, but requires careful transaction management (SELECT FOR UPDATE) which can introduce locking contention and become a bottleneck at high concurrency.

    * In-Memory Caches (e.g., Redis): Very fast, but achieving the required atomicity and durability guarantees can be complex. A SETNX (set if not exists) operation is a good start, but the multi-step process (check, execute, update) requires more sophisticated locking or Lua scripting to be truly atomic.

    * DynamoDB: Natively designed for high-concurrency, low-latency workloads. Its key feature for our use case is Conditional Writes. A PutItem, UpdateItem, or DeleteItem operation can include a ConditionExpression which must evaluate to true for the write to succeed. This check and write occur as a single, atomic operation on the server side, providing the exact primitive we need to build a robust idempotency layer without complex locking.

    Deep Dive: The Two-Phase Atomic Write with DynamoDB

    We'll now construct the core logic. Our strategy relies on a two-phase process: first, atomically acquiring a 'lock' on the idempotency key, and second, updating the record with the final result after processing.

    1. DynamoDB Table Schema

    First, let's define our DynamoDB table. It will be simple but powerful. Let's call it IdempotencyStore.

    * idempotencyKey (Partition Key, String): The unique identifier for the operation. E.g., evt_id_12345 or a UUID from the event payload.

    * status (String): The current state of the operation. Can be IN_PROGRESS, COMPLETED, or FAILED.

    * expiry (Number): A Unix timestamp (in seconds) representing when the record should be considered stale. This is critical for handling function timeouts and will be configured as the table's Time to Live (TTL) attribute.

    * responseData (Map or String): The marshalled JSON response of the successfully completed business logic. Used to serve cached responses on retries.

    * invocationId (String): The AWS Lambda request ID of the invocation that currently holds the lock. Useful for debugging.

    Here's a sample AWS CLI command to create such a table:

    bash
    aws dynamodb create-table \
        --table-name IdempotencyStore \
        --attribute-definitions \
            AttributeName=idempotencyKey,AttributeType=S \
        --key-schema \
            AttributeName=idempotencyKey,KeyType=HASH \
        --billing-mode PAY_PER_REQUEST
    
    aws dynamodb update-time-to-live \
        --table-name IdempotencyStore \
        --time-to-live-specification "Enabled=true, AttributeName=expiry"

    2. Phase 1: Atomically Acquiring the Lock

    This is the most critical step. When a request comes in, we attempt to create a record for its idempotencyKey. We use a PutItem operation with a ConditionExpression that ensures the write only succeeds if no record for this key exists.

    However, we must also account for timed-out or crashed invocations that left a record in the IN_PROGRESS state indefinitely. Our condition must therefore allow a new invocation to take over if the existing lock has expired.

    The atomic condition is: attribute_not_exists(idempotencyKey) OR expiry < :now

    Let's implement this in TypeScript using the AWS SDK v3.

    typescript
    import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
    import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
    
    const client = new DynamoDBClient({});
    const ddbDocClient = DynamoDBDocumentClient.from(client);
    
    const TABLE_NAME = 'IdempotencyStore';
    const LOCK_DURATION_SECONDS = 300; // e.g., Lambda timeout + buffer
    
    interface LockAcquisitionParams {
        idempotencyKey: string;
        invocationId: string;
    }
    
    async function acquireLock({ idempotencyKey, invocationId }: LockAcquisitionParams): Promise<boolean> {
        const now = Math.floor(Date.now() / 1000);
        const expiry = now + LOCK_DURATION_SECONDS;
    
        const command = new PutCommand({
            TableName: TABLE_NAME,
            Item: {
                idempotencyKey,
                status: 'IN_PROGRESS',
                expiry,
                invocationId,
            },
            ConditionExpression: 'attribute_not_exists(idempotencyKey) OR expiry < :now',
            ExpressionAttributeValues: {
                ':now': now,
            },
        });
    
        try {
            await ddbDocClient.send(command);
            console.log(`Lock acquired for key: ${idempotencyKey}`);
            return true; // Lock acquired successfully
        } catch (error: any) {
            if (error.name === 'ConditionalCheckFailedException') {
                console.warn(`Failed to acquire lock for key: ${idempotencyKey}. Another process may be running.`);
                return false; // Lock is held by another invocation
            } else {
                console.error('Error acquiring lock:', error);
                throw error; // Rethrow unexpected errors
            }
        }
    }

    In this code:

    * We construct a PutCommand to insert a new item with the status IN_PROGRESS.

    * The ConditionExpression is the heart of the atomicity. DynamoDB guarantees that it will check this condition and perform the write as a single, indivisible operation.

    * We explicitly catch ConditionalCheckFailedException. This is not an error in the traditional sense; it's the expected outcome when another concurrent invocation has already acquired the lock. It's our signal that we should not proceed.

    3. Phase 2: Business Logic Execution

    If acquireLock returns true, we can safely execute our core business logic. This logic should be completely unaware of the idempotency layer.

    typescript
    async function processPayment(orderId: string, amount: number) {
        // Imagine this calls Stripe, updates an RDS database, etc.
        console.log(`Processing payment for order ${orderId} for amount ${amount}`);
        // Simulate work
        await new Promise(resolve => setTimeout(resolve, 1000));
        return { transactionId: `txn_${Math.random().toString(36).substr(2, 9)}` };
    }

    4. Phase 3: Finalizing the State

    Once the business logic completes (either successfully or with an error), we must update the record in DynamoDB to reflect the final state. We use an UpdateItem command.

    typescript
    import { UpdateCommand } from "@aws-sdk/lib-dynamodb";
    
    interface FinalizeParams {
        idempotencyKey: string;
        status: 'COMPLETED' | 'FAILED';
        responseData?: any;
    }
    
    async function finalizeState({ idempotencyKey, status, responseData }: FinalizeParams) {
        const command = new UpdateCommand({
            TableName: TABLE_NAME,
            Key: {
                idempotencyKey,
            },
            UpdateExpression: 'SET #status = :status, #responseData = :responseData',
            ExpressionAttributeNames: {
                '#status': 'status',
                '#responseData': 'responseData',
            },
            ExpressionAttributeValues: {
                ':status': status,
                ':responseData': responseData || null,
            },
        });
    
        try {
            await ddbDocClient.send(command);
            console.log(`Finalized state for key ${idempotencyKey} to ${status}`);
        } catch (error) {
            console.error('Error finalizing state:', error);
            // Critical failure. May need alarming/monitoring.
            // If this fails, the lock will eventually expire, allowing a retry.
            throw error;
        }
    }

    Handling Advanced Edge Cases and Production Scenarios

    A basic implementation is a good start, but production systems are messy. Here's how this pattern holds up against real-world failures.

    Edge Case 1: The Retry/Concurrent Request

    This is the primary scenario we're solving. A second invocation with the same idempotencyKey arrives while the first is IN_PROGRESS.

  • Invocation B calls acquireLock.
  • The PutCommand's ConditionExpression evaluates to false because an item with the key already exists and its expiry is not less than :now.
  • DynamoDB throws a ConditionalCheckFailedException.
  • Our acquireLock function catches this and returns false.
  • What should Invocation B do now? It knows the operation is already in flight. The correct behavior is to read the existing record to determine its status.

    typescript
    // In your main handler, after acquireLock returns false
    const existingRecord = await getRecord(idempotencyKey);
    
    if (existingRecord) {
        if (existingRecord.status === 'COMPLETED') {
            // Success! Return the cached response.
            return existingRecord.responseData;
        } else if (existingRecord.status === 'IN_PROGRESS') {
            // Another process is working. Return a conflict error.
            // This signals to the client/caller to try again later.
            throw new Error('Conflict: Operation in progress.'); // Or return a specific HTTP status code like 409
        }
    }

    This prevents redundant processing and provides a consistent response for completed operations.

    Edge Case 2: Lambda Timeout After Lock Acquisition

    This is where the expiry TTL attribute becomes invaluable.

  • Invocation A acquires the lock, setting an expiry of now + 300 seconds.
    • It begins executing the business logic.
  • The Lambda function times out after its configured limit (e.g., 60 seconds) before it can call finalizeState.
  • The record in DynamoDB is left with status: 'IN_PROGRESS'. The lock is effectively orphaned.
  • A retry mechanism (e.g., SQS redrive) sends the same event again a few minutes later, starting Invocation B.
  • Invocation B calls acquireLock. The current time now is greater than the expiry timestamp left by Invocation A.
  • The ConditionExpression (... OR expiry < :now) evaluates to true.
  • The PutCommand succeeds, overwriting the stale record from Invocation A and acquiring a new lock.
  • Key Takeaway: The lock expiry should always be set to a value slightly greater than your function's timeout configuration to prevent a valid, running invocation from having its lock stolen.

    Edge Case 3: Partial Failure in Business Logic

    If the business logic itself fails (e.g., a third-party API is down), we should not leave the record IN_PROGRESS.

    typescript
    try {
        const result = await processPayment(...);
        await finalizeState({
            idempotencyKey,
            status: 'COMPLETED',
            responseData: result
        });
        return result;
    } catch (businessError) {
        await finalizeState({
            idempotencyKey,
            status: 'FAILED',
            responseData: { error: businessError.message }
        });
        // Rethrow the error so the event source (e.g., SQS) knows
        // the processing failed and can handle redrives/DLQ.
        throw businessError;
    }

    By marking the record as FAILED, we create a decision point for future retries. You might decide that a FAILED status means the operation is terminal and should not be retried, or you might allow a retry after a certain backoff period (by having the lock acquisition logic also check for a FAILED status and an updatedAt timestamp).

    A Complete, Reusable Production Implementation

    To make this pattern easy to use across many Lambda functions, we can encapsulate the entire logic in a higher-order function or a class.

    Here is a simplified but powerful IdempotencyHandler class in TypeScript.

    typescript
    import { 
        DynamoDBClient 
    } from "@aws-sdk/client-dynamodb";
    import { 
        DynamoDBDocumentClient, 
        PutCommand, 
        GetCommand, 
        UpdateCommand 
    } from "@aws-sdk/lib-dynamodb";
    
    // --- Configuration ---
    const TABLE_NAME = process.env.IDEMPOTENCY_TABLE_NAME || 'IdempotencyStore';
    const LOCK_DURATION_SECONDS = 60; // Should be > Lambda timeout
    
    // --- AWS SDK Clients ---
    const client = new DynamoDBClient({});
    const ddbDocClient = DynamoDBDocumentClient.from(client);
    
    // --- Types ---
    type IdempotencyRecordStatus = 'IN_PROGRESS' | 'COMPLETED' | 'FAILED';
    
    interface IdempotencyRecord {
        idempotencyKey: string;
        status: IdempotencyRecordStatus;
        expiry: number;
        invocationId: string;
        responseData?: any;
    }
    
    class IdempotencyHandler {
        private tableName: string;
        private lockDuration: number;
    
        constructor(tableName: string, lockDuration: number) {
            this.tableName = tableName;
            this.lockDuration = lockDuration;
        }
    
        private async getRecord(key: string): Promise<IdempotencyRecord | undefined> {
            const command = new GetCommand({ TableName: this.tableName, Key: { idempotencyKey: key } });
            const result = await ddbDocClient.send(command);
            return result.Item as IdempotencyRecord | undefined;
        }
    
        private async saveRecord(record: IdempotencyRecord, isFinalState: boolean = false): Promise<void> {
            // For final state, we use Update to be more targeted
            if (isFinalState) {
                const command = new UpdateCommand({
                    TableName: this.tableName,
                    Key: { idempotencyKey: record.idempotencyKey },
                    UpdateExpression: 'SET #status = :status, #responseData = :responseData',
                    ExpressionAttributeNames: { '#status': 'status', '#responseData': 'responseData' },
                    ExpressionAttributeValues: { ':status': record.status, ':responseData': record.responseData || null },
                });
                await ddbDocClient.send(command);
            } else { // For initial lock acquisition, we use Put with Condition
                const now = Math.floor(Date.now() / 1000);
                const command = new PutCommand({
                    TableName: this.tableName,
                    Item: record,
                    ConditionExpression: 'attribute_not_exists(idempotencyKey) OR expiry < :now',
                    ExpressionAttributeValues: { ':now': now },
                });
                await ddbDocClient.send(command);
            }
        }
    
        public async handle<TEvent, TResult>(
            event: TEvent,
            context: any, // Lambda context for invocationId
            idempotencyKeyExtractor: (event: TEvent) => string,
            businessLogic: (event: TEvent) => Promise<TResult>
        ): Promise<TResult> {
            const idempotencyKey = idempotencyKeyExtractor(event);
            const invocationId = context.awsRequestId;
    
            // 1. Check for an existing completed record
            const existingRecord = await this.getRecord(idempotencyKey);
            if (existingRecord && existingRecord.status === 'COMPLETED') {
                console.log(`[IDEMPOTENCY] Key ${idempotencyKey} already completed. Returning cached response.`);
                return existingRecord.responseData as TResult;
            }
    
            // 2. Try to acquire the lock
            const now = Math.floor(Date.now() / 1000);
            const lockRecord: IdempotencyRecord = {
                idempotencyKey,
                status: 'IN_PROGRESS',
                expiry: now + this.lockDuration,
                invocationId,
            };
    
            try {
                await this.saveRecord(lockRecord);
            } catch (error: any) {
                if (error.name === 'ConditionalCheckFailedException') {
                    console.warn(`[IDEMPOTENCY] Lock for ${idempotencyKey} is already held. Aborting.`);
                    // This is a graceful exit. Depending on requirements, you might throw an error
                    // that translates to a 409 Conflict or 429 Too Many Requests.
                    throw new Error(`Operation with key ${idempotencyKey} is already in progress.`);
                } else {
                    throw error; // Unexpected DynamoDB error
                }
            }
    
            // 3. Execute business logic
            try {
                const result = await businessLogic(event);
    
                // 4a. Save successful result
                const finalRecord: IdempotencyRecord = { ...lockRecord, status: 'COMPLETED', responseData: result };
                await this.saveRecord(finalRecord, true);
                return result;
            } catch (error: any) {
                // 4b. Mark as failed
                const finalRecord: IdempotencyRecord = { 
                    ...lockRecord, 
                    status: 'FAILED', 
                    responseData: { error: error.message, stack: error.stack } 
                };
                await this.saveRecord(finalRecord, true);
                throw error; // Propagate error to let Lambda/SQS handle retries
            }
        }
    }
    
    // --- Example Usage in a Lambda Handler ---
    
    const idempotencyHandler = new IdempotencyHandler(TABLE_NAME, LOCK_DURATION_SECONDS);
    
    interface SQSBody {
        orderId: string;
        amount: number;
        idempotencyKey: string;
    }
    
    // The actual business logic is clean and separate
    async function myBusinessLogic(event: SQSBody) {
        console.log(`Processing payment for order ${event.orderId}`);
        await new Promise(resolve => setTimeout(resolve, 500)); // Simulate work
        return { status: 'SUCCESS', transactionId: `txn_${event.orderId}` };
    }
    
    // The Lambda handler
    export const handler = async (event: any, context: any) => {
        // Assuming an SQS event with a single record
        const body: SQSBody = JSON.parse(event.Records[0].body);
    
        return idempotencyHandler.handle(
            body, 
            context,
            (evt) => evt.idempotencyKey, // How to extract the key
            myBusinessLogic // The function to execute idempotently
        );
    };

    Required IAM Policy

    Your Lambda's execution role will need the following permissions for the idempotency table:

    json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "dynamodb:PutItem",
                    "dynamodb:GetItem",
                    "dynamodb:UpdateItem"
                ],
                "Resource": "arn:aws:dynamodb:REGION:ACCOUNT_ID:table/IdempotencyStore"
            }
        ]
    }

    Performance and Cost Considerations

    * Latency: This pattern adds latency to your function's execution time. Specifically:

    * Cold Path (New Key): One GetItem (typically misses) + one PutItem (to lock) + one UpdateItem (to finalize). This is roughly 2 writes and 1 read. Expect an additional 5-20ms of latency depending on your region.

    * Warm Path (Completed Key): One GetItem. This is the fastest path, adding only a few milliseconds.

    * Conflict Path: One GetItem + one failed PutItem. Again, very fast.

    * Cost: With DynamoDB's On-Demand pricing, the cost is minimal but not zero. For 1 million invocations on the cold path, you'd incur roughly 1M reads and 2M writes. At current us-east-1 prices (~$1.25 per million writes, ~$0.25 per million reads), this idempotency layer would cost approximately $2.75 per million transactions. This is often a tiny price to pay for data integrity.

    * Payload Size: Be mindful of the size of responseData. DynamoDB items have a 400 KB limit. If your responses are large, consider storing only a reference (e.g., an S3 object key) in the idempotency record.

    Conclusion: From Theory to Resilient Practice

    By leveraging DynamoDB's atomic conditional writes, we can build a highly reliable and performant idempotency layer for our serverless applications. This pattern is not just a theoretical exercise; it is a battle-tested approach used in high-stakes systems to prevent the costly errors that arise from duplicate event processing. It elegantly transforms the 'at-least-once' guarantee of our message brokers into the 'effectively-once' behavior our business logic demands.

    For teams already invested in the AWS ecosystem, it's worth noting that the AWS Lambda Powertools for TypeScript library provides a production-ready, well-maintained implementation of this exact pattern. Whether you choose to build your own lightweight wrapper for full control or adopt a library like Powertools, understanding the underlying mechanics of conditional writes, state management, and timeout handling is crucial for any senior engineer building resilient, event-driven systems in the cloud.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles