DynamoDB Idempotency: Conditional Writes for Resilient APIs

October 13, 2025

24 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High Cost of Non-Idempotent APIs in Distributed Systems

In a distributed architecture, the guarantee of exactly-once message delivery or processing is a fallacy. Network partitions, client-side retry logic, and downstream service timeouts conspire to create scenarios where a single logical operation is attempted multiple times. For read operations, this is often benign. For state-changing mutations—processing a payment, creating an order, or provisioning a resource—the consequences are severe: duplicate charges, multiple identical orders, and inconsistent system state.

Idempotency is the property of an operation that ensures it can be applied multiple times without changing the result beyond the initial application. A senior engineer's responsibility is not just to implement the "happy path" but to architect systems that are resilient to the inherent unreliability of distributed communication. Naive solutions, such as a SELECT followed by an INSERT in a relational database, are fundamentally broken due to race conditions. A client request could be checked for existence by two concurrent server processes, both finding no record, and both proceeding to execute the mutation.

This is where DynamoDB's atomic conditional operations provide a powerful, scalable, and cost-effective solution. By leveraging a ConditionExpression, we can combine the check-and-write into a single, atomic API call. This post dissects a production-ready pattern for implementing idempotency using DynamoDB, moving from table design to a complete state-machine implementation with robust error handling and performance considerations.

The Idempotency Record: Table Design and TTL Strategy

Our foundation is a dedicated DynamoDB table to track the state of each idempotent request. The design of this table is critical for both functionality and performance.

Core Schema

The table requires a minimal set of attributes to manage the idempotency lifecycle:

* idempotencyKey (String, Partition Key): A client-generated unique identifier for the operation. A UUIDv4 is an excellent choice as it ensures high cardinality, leading to even data distribution across DynamoDB's partitions and avoiding hot spots.

* status (String): Tracks the state of the operation. The state machine will typically use IN_PROGRESS, COMPLETED, or FAILED.

* expiry (Number, TTL Attribute): A Unix timestamp representing when the record should be automatically deleted by DynamoDB's Time To Live (TTL) feature. This is crucial for garbage collection and cost management.

* responsePayload (String or Map): Stores the serialized response of the original successful operation. When a duplicate request is detected, this payload is returned directly, ensuring the client receives the same result without re-executing the business logic.

* lockExpiry (Number): An optional but highly recommended attribute. This is a timestamp to prevent a request from being stuck in the IN_PROGRESS state indefinitely if the processing server crashes. We'll explore its use in handling partial failures.

Infrastructure as Code (AWS CDK - TypeScript)

Defining this table using an IaC tool ensures it's version-controlled and reproducible. Here's an example using the AWS CDK in TypeScript:

typescript

import * as cdk from 'aws-cdk-lib';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

export class IdempotencyStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const idempotencyTable = new dynamodb.Table(this, 'IdempotencyTable', {
      tableName: 'ApiIdempotencyStore',
      partitionKey: {
        name: 'idempotencyKey',
        type: dynamodb.AttributeType.STRING,
      },
      // On-Demand is often a good choice for unpredictable, spiky API traffic.
      // For predictable workloads, provisioned throughput might be more cost-effective.
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      
      // Enable DynamoDB's automated TTL feature for garbage collection.
      timeToLiveAttribute: 'expiry',
      
      // This is a critical production setting. Point-in-time recovery allows you
      // to restore your table to any point in the last 35 days.
      pointInTimeRecovery: true,
      
      // It's best practice to destroy resources you create in dev environments.
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    new cdk.CfnOutput(this, 'IdempotencyTableName', {
      value: idempotencyTable.tableName,
    });
  }
}

Key Design Choices:

Billing Mode: PAY_PER_REQUEST (On-Demand) is chosen here because idempotency checks often correlate with user-facing API calls, which can be spiky. It eliminates the need for capacity planning for this specific table.

TTL (timeToLiveAttribute): We explicitly enable TTL on the expiry attribute. The TTL value itself (e.g., 24 hours, 7 days) is a business decision. It should be long enough to handle legitimate client retries but short enough to avoid indefinite storage costs. A 24-hour window is a common starting point.

Point-in-Time Recovery (PITR): Non-negotiable for production tables. It provides a safety net against accidental data deletion or corruption.

The Atomic Write-Execute-Update Flow

This pattern can be modeled as a state machine. A request with an idempotency key transitions through states, and DynamoDB's conditional writes act as the atomic gatekeeper for these transitions.

Let's implement this logic in Python using boto3, structured as a decorator for reusability across a Flask/FastAPI application.

State 1: Attempting to Acquire the Lock

The first step is to atomically create a record in the IN_PROGRESS state. This acts as a distributed lock for this specific idempotencyKey.

The magic is in the ConditionExpression: attribute_not_exists(idempotencyKey).

python

import boto3
import time
import logging
from botocore.exceptions import ClientError

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('ApiIdempotencyStore')

# Constants
IDEMPOTENCY_TTL_SECONDS = 24 * 60 * 60  # 24 hours
LOCK_TIMEOUT_SECONDS = 300 # 5 minutes

def acquire_lock(idempotency_key: str):
    """
    Attempts to create an 'IN_PROGRESS' record for the given key.
    This is an atomic operation that serves as a distributed lock.
    """
    current_time = int(time.time())
    try:
        table.put_item(
            Item={
                'idempotencyKey': idempotency_key,
                'status': 'IN_PROGRESS',
                'expiry': current_time + IDEMPOTENCY_TTL_SECONDS,
                'lockExpiry': current_time + LOCK_TIMEOUT_SECONDS
            },
            # This is the core of the atomic lock acquisition.
            # The write will only succeed if an item with this partition key does not already exist.
            ConditionExpression='attribute_not_exists(idempotencyKey)'
        )
        logger.info(f"Successfully acquired lock for key: {idempotency_key}")
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
            # This is an expected failure condition, not an error.
            # It means another request with the same key is already being processed or has completed.
            logger.warning(f"Lock acquisition failed for key: {idempotency_key}. Key already exists.")
            return False
        else:
            # Any other exception is unexpected and should be raised.
            logger.error(f"Unexpected DynamoDB error acquiring lock for key {idempotency_key}: {e}")
            raise

If this put_item call succeeds, the current process has successfully acquired the lock and can proceed to execute the business logic. If it fails with ConditionalCheckFailedException, it means we've detected a duplicate request.

State 2: Handling a Duplicate Request

When the lock acquisition fails, we must inspect the existing record to determine the correct action.

python

def handle_duplicate_request(idempotency_key: str):
    """
    Fetches the existing idempotency record to decide the next action.
    """
    try:
        response = table.get_item(Key={'idempotencyKey': idempotency_key})
        item = response.get('Item')

        if not item:
            # This is a rare edge case. The lock acquisition failed, but the item is now gone.
            # This could happen if the item's TTL expired between the failed put and this get.
            # The safest action is to treat it as a transient failure and ask the client to retry.
            logger.warning(f"Idempotency key {idempotency_key} disappeared after lock failure. Potential race condition or TTL expiry.")
            return {'status_code': 409, 'body': {'error': 'Concurrent request detected, please retry'}}

        status = item.get('status')
        current_time = int(time.time())

        if status == 'COMPLETED':
            # The original request was successful. Return the saved response.
            logger.info(f"Duplicate request for completed key {idempotency_key}. Returning cached response.")
            return {'status_code': 200, 'body': item.get('responsePayload', {})}
        
        elif status == 'IN_PROGRESS':
            # Another process is currently working on this request.
            # We check the lockExpiry to see if the lock has timed out.
            if 'lockExpiry' in item and item['lockExpiry'] < current_time:
                logger.warning(f"Stale lock detected for key {idempotency_key}. The original process may have crashed.")
                # Business decision: either fail the request or attempt to take over the lock.
                # For now, we'll fail it to be safe.
                return {'status_code': 500, 'body': {'error': 'Request processing timed out'}}
            else:
                # The lock is still valid. Tell the client to wait and retry.
                logger.info(f"Duplicate request for in-progress key {idempotency_key}. Responding with conflict.")
                return {'status_code': 409, 'body': {'error': 'Request in progress'}}

        elif status == 'FAILED':
            # The original request failed. Depending on the business logic, we might allow a retry.
            # For this generic pattern, we'll allow a new attempt by treating it like a new request.
            # A more advanced implementation might require a new idempotency key.
            logger.info(f"Duplicate request for failed key {idempotency_key}. Allowing retry.")
            return None # Signal to retry the entire process
            
    except ClientError as e:
        logger.error(f"DynamoDB error handling duplicate request for key {idempotency_key}: {e}")
        # Propagate a server error as we cannot determine the state.
        return {'status_code': 500, 'body': {'error': 'Internal server error'}}

This logic is nuanced:

* COMPLETED: The ideal idempotent case. We fetch the stored response and return it, completely bypassing the expensive business logic.

* IN_PROGRESS: A concurrent request is active. We must check the lockExpiry. If the lock is stale (meaning the original process likely crashed), we have a critical decision. A simple approach is to fail the request. A more complex (and risky) one might involve trying to take over the lock. For most use cases, returning a 409 Conflict is the safest path, prompting the client to retry with backoff.

* FAILED: The previous attempt failed. The business may decide to allow a retry with the same key. Here, we signal that the process should restart from the lock acquisition phase. This might require deleting the old record and starting fresh.

State 3 & 4: Executing Logic and Finalizing the State

If the lock is acquired, the core business logic runs. Afterward, we must update the idempotency record to COMPLETED or FAILED.

python

import json

def update_record(idempotency_key: str, status: str, response_payload: dict = None):
    """
    Updates the idempotency record to a final state (COMPLETED or FAILED).
    """
    logger.info(f"Updating record for key {idempotency_key} to status: {status}")
    try:
        update_expression = "SET #status = :status, #expiry = :expiry"
        expression_attribute_names = {
            '#status': 'status',
            '#expiry': 'expiry'
        }
        expression_attribute_values = {
            ':status': status,
            ':expiry': int(time.time()) + IDEMPOTENCY_TTL_SECONDS
        }

        if response_payload is not None:
            update_expression += ", #payload = :payload"
            expression_attribute_names['#payload'] = 'responsePayload'
            expression_attribute_values[':payload'] = response_payload

        table.update_item(
            Key={'idempotencyKey': idempotency_key},
            UpdateExpression=update_expression,
            ExpressionAttributeNames=expression_attribute_names,
            ExpressionAttributeValues=expression_attribute_values
        )
    except ClientError as e:
        # This update should ideally never fail. If it does, we have an inconsistent state.
        # This requires immediate alerting and investigation.
        logger.critical(f"FATAL: Failed to update idempotency record for key {idempotency_key} to status {status}. Manual intervention required. Error: {e}")
        # We do not re-raise, as the business logic has already completed.
        # The primary risk is that subsequent retries will re-execute the logic until the lock expires.

Critical Note: The failure to update the record to COMPLETED after successful business logic execution is a significant failure mode. The lock will remain IN_PROGRESS until the lockExpiry is hit, causing subsequent retries to fail or re-execute. This scenario should trigger high-priority alerts for manual intervention.

A Production-Grade Decorator

Let's tie this all together into a reusable Python decorator. This abstracts the complexity away from the individual API endpoint logic.

python

from functools import wraps
import traceback

def idempotent(key_location: str):
    """
    A decorator to make a function idempotent based on a key in the request.
    :param key_location: A string indicating where to find the idempotency key, 
                         e.g., 'json.idempotency_key' or 'headers.Idempotency-Key'.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # This is a simplified example of extracting the key from a request object.
            # In a real framework (Flask, FastAPI), you'd access the request context.
            # For this example, let's assume the request object is the first arg.
            request_obj = args[0] 
            
            try:
                # Poor man's object path resolver
                parts = key_location.split('.')
                value = request_obj
                for part in parts:
                    if isinstance(value, dict):
                        value = value.get(part)
                    else:
                        value = getattr(value, part)
                idempotency_key = value
                if not idempotency_key:
                    raise ValueError("Idempotency key not found or is empty.")
            except (AttributeError, ValueError, IndexError) as e:
                logger.error(f"Could not extract idempotency key from location '{key_location}': {e}")
                return {'status_code': 400, 'body': {'error': 'Idempotency key is missing or invalid.'}}

            # --- Main Idempotency Flow ---
            if not acquire_lock(idempotency_key):
                # Lock acquisition failed, handle the duplicate request.
                duplicate_response = handle_duplicate_request(idempotency_key)
                if duplicate_response is not None:
                    return duplicate_response
                # If handle_duplicate_request returns None (e.g., for a FAILED state),
                # we retry the lock acquisition once.
                if not acquire_lock(idempotency_key):
                     return {'status_code': 409, 'body': {'error': 'Concurrent request detected, please retry'}}

            # --- Business Logic Execution ---
            try:
                result = func(*args, **kwargs)
                # Assuming the function returns a tuple of (status_code, body)
                status_code, body = result
                
                # Only cache successful responses
                if 200 <= status_code < 300:
                    update_record(idempotency_key, 'COMPLETED', body)
                else:
                    # For client or server errors, mark as FAILED but don't cache response body.
                    update_record(idempotency_key, 'FAILED')
                
                return {'status_code': status_code, 'body': body}

            except Exception as e:
                logger.error(f"Exception during execution for key {idempotency_key}: {traceback.format_exc()}")
                update_record(idempotency_key, 'FAILED')
                # Return a generic 500 error to the client
                return {'status_code': 500, 'body': {'error': 'An internal error occurred'}}
        return wrapper
    return decorator

# --- Example Usage with a mock request object ---

class MockRequest:
    def __init__(self, json_data):
        self.json = json_data

@idempotent(key_location='json.idempotency_key')
def process_payment(request: MockRequest):
    """
    A mock function that simulates processing a payment.
    """
    logger.info(f"--- Starting business logic for payment: {request.json['payment_id']} ---")
    # Simulate a network call to a payment gateway
    time.sleep(2)
    logger.info(f"--- Business logic finished for payment: {request.json['payment_id']} ---")
    
    # Simulate a failure for a specific payment ID for testing
    if request.json['payment_id'] == 'fail_me':
        raise ValueError("Payment gateway timeout")

    return 201, {'transactionId': f"txn_{request.json['payment_id']}", 'status': 'success'}

# --- Simulation ---
if __name__ == '__main__':
    # First call - should execute logic
    request1 = MockRequest(json_data={'idempotency_key': 'uuid-1234-abc', 'payment_id': 'p_one'})
    print("First call result:", process_payment(request1))

    # Second call (duplicate) - should return cached response
    request2 = MockRequest(json_data={'idempotency_key': 'uuid-1234-abc', 'payment_id': 'p_one'})
    print("Second call result:", process_payment(request2))

    # Call that will fail
    failing_request = MockRequest(json_data={'idempotency_key': 'uuid-5678-def', 'payment_id': 'fail_me'})
    print("Failing call result:", process_payment(failing_request))
    
    # Retry of the failed call - should re-execute
    print("Retrying the 'failed' call will require manual cleanup in this simple model")
    # In a real scenario, handle_duplicate_request would be modified to handle the FAILED case
    # by deleting the item and allowing a full retry.

This decorator encapsulates the entire flow, making the API endpoint code clean and focused solely on the business logic.

Advanced Considerations and Performance Tuning

While the core pattern is robust, production environments demand attention to nuance.

1. Choosing an Idempotency Key

The client MUST generate the idempotency key. If the server generates it, it's useless for retries where the client doesn't know the key of the original failed request. The key should be a UUIDv4 or similarly high-entropy string to ensure good partition distribution in DynamoDB. Mandate this via API contracts and validation.

2. Performance and Cost

This pattern adds at least two DynamoDB operations per unique request (PutItem and UpdateItem) and one operation (GetItem) for every duplicate. For a high-throughput API, this can be significant.

* Capacity Mode: As mentioned, On-Demand (PAY_PER_REQUEST) is excellent for unpredictable workloads. For a stable, high-volume workload, Provisioned Capacity with Auto Scaling can be more cost-effective. Monitor your table's consumed capacity metrics in CloudWatch to make an informed decision.

* Payload Size: The responsePayload is stored in DynamoDB. Large response bodies (approaching the 400 KB DynamoDB item size limit) will increase storage costs and consumed WCUs/RCUs. If responses are large, consider storing only a confirmation token or ID in the idempotency record, with the full payload stored elsewhere (e.g., S3).

3. Stale Lock Handling

Our handle_duplicate_request function detects a stale lock using lockExpiry but takes a conservative approach (failing the request). A more aggressive strategy could be to attempt to "steal" the lock. This would involve an UpdateItem call with a ConditionExpression that checks if lockExpiry is in the past. This is complex and can lead to its own race conditions if not carefully implemented. For most systems, failing the request and relying on client retries is safer.

4. Comparison with Alternatives

* Relational Databases (e.g., PostgreSQL): You can implement a similar pattern using a dedicated table and INSERT ... ON CONFLICT DO NOTHING. However, this can put significant write load on your primary business database. Separating the idempotency concern into DynamoDB isolates the workloads, and DynamoDB is often better suited to this specific key-value, high-write-throughput pattern at scale.

* In-Memory Stores (e.g., Redis): Redis is incredibly fast but prioritizes speed over durability. If a Redis node fails after the business logic has completed but before the idempotency key is written, the guarantee is lost. For financial transactions or critical operations, the durability of DynamoDB is paramount.

Conclusion

Implementing idempotency is not an optional feature for critical API endpoints; it is a fundamental requirement for building resilient, reliable distributed systems. By leveraging DynamoDB's atomic conditional writes, we can build a highly scalable and robust idempotency layer that is decoupled from our primary business logic. This pattern transforms the difficult problem of distributed locking into a manageable, state-driven workflow. The provided decorator serves as a production-ready template that correctly handles concurrency, partial failures, and duplicate requests, ensuring that your API behaves predictably even in the chaotic reality of distributed networks.

The High Cost of Non-Idempotent APIs in Distributed Systems

The Idempotency Record: Table Design and TTL Strategy

Core Schema

Infrastructure as Code (AWS CDK - TypeScript)

The Atomic Write-Execute-Update Flow

State 1: Attempting to Acquire the Lock

State 2: Handling a Duplicate Request

State 3 & 4: Executing Logic and Finalizing the State

A Production-Grade Decorator

Advanced Considerations and Performance Tuning

1. Choosing an Idempotency Key

2. Performance and Cost

3. Stale Lock Handling

4. Comparison with Alternatives

Conclusion

Found this article helpful?