Idempotency Patterns in Event-Driven Systems via Distributed Locks

September 28, 2025

25 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Problem: At-Least-Once Delivery and Its Perils

In the world of distributed systems, especially those built on message queues and event streams (Kafka, SQS, RabbitMQ, etc.), the concept of "at-least-once" delivery is a foundational guarantee. It ensures that a message, once published, will be delivered to a consumer at least one time. While this prevents data loss, it introduces a significant challenge: messages can, and often will, be delivered more than once. Network glitches, consumer crashes after processing but before acknowledging, and broker failovers can all lead to duplicate message delivery.

For a senior engineer, the implication is clear: any non-idempotent operation triggered by such an event can lead to catastrophic failures. Imagine a payment processing service that charges a credit card twice, an inventory system that decrements stock multiple times for a single order, or an email service that spams a user with the same notification. The business impact is severe. Therefore, building idempotent consumers is not a best practice; it is a hard requirement for system correctness.

This article bypasses introductory concepts. We assume you understand why idempotency is critical. Instead, we will dissect and implement advanced, production-ready patterns to achieve it under high concurrency, focusing on the combination of an Idempotency Key and a Distributed Lock.

Why Naive Database Checks Fail

A common first attempt at idempotency is to store a unique identifier from the event (e.g., event_id) in a database and check for its existence before processing.

python

# WARNING: This is a naive and flawed implementation

def process_event_naive(event):
    event_id = event['id']

    # 1. CHECK
    if database.has_processed(event_id):
        print(f"Event {event_id} already processed.")
        return

    # --- RACE CONDITION WINDOW ---
    # Two concurrent processes can pass the check above for the same event_id
    # before either has a chance to write the record.

    # 2. ACT
    result = execute_business_logic(event['data'])

    # 3. WRITE
    database.mark_as_processed(event_id, result)

    return result

This CHECK -> ACT -> WRITE sequence is a classic race condition. If two instances of your consumer receive the same event simultaneously, both can execute the CHECK and find no record. Both will then proceed to ACT, executing the business logic twice, before one of them finally performs the WRITE. The second write might fail on a unique constraint, but by then, the damage is done.

To solve this, we need an atomic operation that combines the CHECK and the initial WRITE into a single, indivisible step. This is where distributed locking comes in.

The Core Pattern: Idempotency Key + Atomic Lock Acquisition

The robust pattern involves three components:

Idempotency Key: A unique value, provided by the client or derived from the message, that uniquely identifies a single invocation request. A good key is a client-generated UUIDv4. For messages from a queue, it could be the message ID, but be cautious if redrives generate new IDs.

Idempotency Store: A shared, low-latency data store (like Redis or DynamoDB) that tracks the status of requests associated with idempotency keys.

Atomic Operation: A mechanism to atomically "claim" an idempotency key, preventing concurrent processes from operating on the same request simultaneously.

Our refined flow looks like this:

Extract the Idempotency-Key from the event/request.

Attempt to atomically create a record in the idempotency store for this key with a status of IN_PROGRESS. This operation must fail if the record already exists.

If creation succeeds: You have acquired the lock. Proceed with the business logic.

a. Upon completion, update the record in the store with a COMPLETED status and cache the result/response.

b. If the business logic fails, delete the record to allow for a clean retry.

If creation fails: Another process has claimed this key.

a. Query the record. If its status is COMPLETED, return the cached result immediately without executing the business logic.

b. If its status is IN_PROGRESS, another process is actively working on it. You must decide on a concurrency strategy: reject the request (e.g., HTTP 429), wait with a timeout, or enter a short retry loop.

Let's implement this robust pattern using two popular technologies: Redis and AWS DynamoDB.

Implementation 1: Redis for High-Throughput, Low-Latency Locking

Redis is an excellent choice for an idempotency store due to its speed and atomic commands. The SET command with the NX (Not eXists) and PX (millisecond TTL) options is our primary tool. It allows us to set a key only if it doesn't already exist and to apply a timeout, all in one atomic operation.

This TTL is crucial. It acts as a safety net, preventing indefinite locks if a consumer crashes after acquiring the lock but before completing the operation.

The Idempotency Decorator in Python

We can encapsulate this logic cleanly in a Python decorator. This makes it reusable and separates the idempotency concern from the core business logic.

python

import redis
import json
import time
import functools
from uuid import uuid4

# Configure your Redis client
# In a real application, this would come from configuration management
redis_client = redis.StrictRedis(host='localhost', port=6379, db=0, decode_responses=True)

# Constants
LOCK_TTL_SECONDS = 60  # How long the lock is held (should exceed max processing time)
RESULT_TTL_SECONDS = 3600 # How long to cache the final result (e.g., 1 hour)

class IdempotencyException(Exception):
    pass

class RequestInProgress(IdempotencyException):
    pass

def idempotent(key_arg_name='idempotency_key'):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            idempotency_key = kwargs.get(key_arg_name)
            if not idempotency_key:
                raise ValueError(f"Missing required keyword argument: {key_arg_name}")

            lock_key = f"idempotency:lock:{idempotency_key}"
            result_key = f"idempotency:result:{idempotency_key}"

            # 1. Atomically try to acquire the lock
            # SET key value NX PX ttl -> Set key to value if key does not exist, with a millisecond TTL
            if redis_client.set(lock_key, "in_progress", nx=True, px=LOCK_TTL_SECONDS * 1000):
                try:
                    # 2. Lock acquired, execute the business logic
                    print(f"[{idempotency_key}] Lock acquired. Executing function.")
                    result = func(*args, **kwargs)
                    
                    # 3. Store the result and set its TTL
                    # We use a transaction (pipeline) to ensure atomicity
                    pipe = redis_client.pipeline()
                    pipe.set(result_key, json.dumps(result), px=RESULT_TTL_SECONDS * 1000)
                    # 4. Release the lock explicitly
                    pipe.delete(lock_key)
                    pipe.execute()

                    print(f"[{idempotency_key}] Execution successful. Result stored.")
                    return result
                except Exception as e:
                    # 5. On failure, release the lock to allow retries
                    print(f"[{idempotency_key}] Execution failed. Releasing lock.")
                    redis_client.delete(lock_key)
                    raise e # Re-raise the exception
            else:
                # 6. Lock could not be acquired
                print(f"[{idempotency_key}] Lock conflict. Checking for result.")
                # Check if a result is already cached
                cached_result = redis_client.get(result_key)
                if cached_result:
                    print(f"[{idempotency_key}] Returning cached result.")
                    return json.loads(cached_result)
                else:
                    # This is the critical edge case: another process holds the lock but hasn't finished.
                    # You must choose a strategy: fail fast, wait, or retry.
                    print(f"[{idempotency_key}] Request in progress by another worker.")
                    raise RequestInProgress("This request is currently being processed.")

        return wrapper
    return decorator

# --- Example Usage ---

@idempotent(key_arg_name='request_id')
def process_payment(request_id: str, amount: float, currency: str):
    """A mock function that simulates a long-running, critical operation."""
    print(f"Processing payment of {amount} {currency} for request {request_id}...")
    time.sleep(5)  # Simulate network latency or heavy computation
    return {"status": "success", "transaction_id": str(uuid4()), "amount_processed": amount}

if __name__ == '__main__':
    # Simulate two identical requests coming in at the same time
    # In a real system, these would be from two different threads/processes/servers
    from threading import Thread

    request_id_1 = f"req-{uuid4()}"

    def call_processor(req_id):
        try:
            result = process_payment(request_id=req_id, amount=100.0, currency="USD")
            print(f"Thread finished with result: {result}")
        except RequestInProgress:
            print("Thread caught RequestInProgress. Request is being handled elsewhere.")
        except Exception as e:
            print(f"Thread caught an unexpected error: {e}")

    print("--- Scenario 1: Concurrent identical requests ---")
    t1 = Thread(target=call_processor, args=(request_id_1,))
    t2 = Thread(target=call_processor, args=(request_id_1,))

    t1.start()
    time.sleep(0.1) # Ensure t1 gets the lock first
    t2.start()

    t1.join()
    t2.join()

    print("\n--- Scenario 2: A third request arrives after the first has completed ---")
    # This call should be very fast and return the cached result
    call_processor(request_id_1)

Analysis of the Redis Pattern

* Strengths:

* Performance: Redis is exceptionally fast. The SETNX operation is O(1) and typically completes in sub-milliseconds, making the lock acquisition phase very lightweight.

* Simplicity: The logic is relatively straightforward, relying on a single atomic command.

* Built-in TTL: Redis's native TTL support is perfect for handling stale locks from crashed workers.

* Weaknesses & Edge Cases:

* Persistence: Standard Redis configurations prioritize speed over durability. If a Redis primary node fails before a write is replicated, a lock or a result could be lost. For absolute certainty, you might need Redis Sentinel/Cluster with AOF persistence, which adds complexity.

* Clock Skew: The TTL is managed by the Redis server. If client clocks are wildly out of sync, it's not an issue. But distributed lock algorithms can be sensitive to clock drift in more complex scenarios (e.g., Redlock).

* Handling RequestInProgress: Our example fails fast. In a real system, you might implement a backoff-and-retry loop for the second caller, polling the result_key for a few seconds before giving up. This can smooth over short processing delays.

Implementation 2: AWS DynamoDB for Serverless, Durable Locking

For architectures running in AWS, particularly serverless (Lambda), DynamoDB offers a compelling alternative. It's fully managed, highly available, and provides the atomic operations we need through Condition Expressions.

A ConditionExpression allows you to make a write operation (like PutItem or UpdateItem) conditional on the state of the item on the server. We can use attribute_not_exists(IdempotencyKey) to achieve the exact same atomic "write-if-not-exist" behavior as Redis's SETNX.

The DynamoDB Idempotency Table

First, define a simple DynamoDB table. A pay-per-request model is often cost-effective for this use case.

* Table Name: IdempotencyStore

* Partition Key: IdempotencyKey (String)

* TTL Attribute: ExpiryTimestamp (Enable DynamoDB TTL on this attribute)

An item in this table will represent the state of a single idempotent request and will have a schema like:

* IdempotencyKey: The unique request ID.

* Status: IN_PROGRESS or COMPLETED.

* ExpiryTimestamp: A Unix timestamp after which the item is automatically deleted.

* ResponseData: The marshalled JSON response of the successful operation.

Python Implementation with Boto3

This implementation is slightly more complex due to the state machine (IN_PROGRESS -> COMPLETED) we need to manage explicitly.

python

import boto3
import json
import time
import functools
from botocore.exceptions import ClientError
from uuid import uuid4

# Configure Boto3 client
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('IdempotencyStore') # Make sure you've created this table

# Constants
IN_PROGRESS_TTL_SECONDS = 60
COMPLETED_TTL_SECONDS = 3600

class IdempotencyException(Exception):
    pass

class RequestInProgress(IdempotencyException):
    pass

def idempotent_dynamodb(key_arg_name='idempotency_key'):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            idempotency_key = kwargs.get(key_arg_name)
            if not idempotency_key:
                raise ValueError(f"Missing required keyword argument: {key_arg_name}")

            current_time = int(time.time())
            in_progress_expiry = current_time + IN_PROGRESS_TTL_SECONDS

            try:
                # 1. Atomically create the record if it doesn't exist
                # This is our lock acquisition step.
                table.put_item(
                    Item={
                        'IdempotencyKey': idempotency_key,
                        'Status': 'IN_PROGRESS',
                        'ExpiryTimestamp': in_progress_expiry
                    },
                    ConditionExpression='attribute_not_exists(IdempotencyKey)'
                )
                print(f"[{idempotency_key}] Lock acquired via DynamoDB. Executing function.")

            except ClientError as e:
                if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
                    # 2. Record already exists, lock was not acquired
                    print(f"[{idempotency_key}] Lock conflict. Checking item status.")
                    # We need to fetch the existing item to check its status
                    response = table.get_item(Key={'IdempotencyKey': idempotency_key})
                    item = response.get('Item')

                    if not item or item.get('Status') == 'IN_PROGRESS':
                        # Item might have expired or is still being processed
                        print(f"[{idempotency_key}] Request in progress by another worker.")
                        raise RequestInProgress("This request is currently being processed.")
                    
                    elif item.get('Status') == 'COMPLETED':
                        # Return the cached result
                        print(f"[{idempotency_key}] Returning cached result from DynamoDB.")
                        return json.loads(item.get('ResponseData', '{}'))
                else:
                    raise # Re-raise other DynamoDB errors

            try:
                # 3. Execution of the core business logic
                result = func(*args, **kwargs)
                result_json = json.dumps(result)
                completed_expiry = int(time.time()) + COMPLETED_TTL_SECONDS

                # 4. Update the record to COMPLETED and store the result
                table.update_item(
                    Key={'IdempotencyKey': idempotency_key},
                    UpdateExpression='SET #status = :status, #data = :data, #expiry = :expiry',
                    ExpressionAttributeNames={
                        '#status': 'Status',
                        '#data': 'ResponseData',
                        '#expiry': 'ExpiryTimestamp'
                    },
                    ExpressionAttributeValues={
                        ':status': 'COMPLETED',
                        ':data': result_json,
                        ':expiry': completed_expiry
                    }
                )
                print(f"[{idempotency_key}] Execution successful. Result stored in DynamoDB.")
                return result

            except Exception as e:
                # 5. On failure, delete the record to allow retries
                print(f"[{idempotency_key}] Execution failed. Deleting idempotency record.")
                table.delete_item(Key={'IdempotencyKey': idempotency_key})
                raise e

        return wrapper
    return decorator

# --- Example Usage ---

@idempotent_dynamodb(key_arg_name='request_id')
def create_order(request_id: str, user_id: str, items: list):
    """Simulates creating an order in a database."""
    print(f"Creating order for user {user_id} with items: {items}")
    time.sleep(5) # Simulate database writes and other operations
    order_id = f"order_{uuid4()}"
    return {"status": "created", "order_id": order_id, "user_id": user_id}

# You would need to set up AWS credentials and create the DynamoDB table
# for this to run. The simulation logic is the same as the Redis example.
# For brevity, the threading simulation is omitted here but would work identically.

Analysis of the DynamoDB Pattern

* Strengths:

* Durability & Availability: DynamoDB is a highly available, multi-AZ service. Your idempotency state is far more durable than with a standard Redis setup.

* Serverless-Friendly: This pattern is a perfect match for AWS Lambda. There are no connections to manage, and it scales seamlessly with your function's concurrency.

* Managed TTL: The built-in TTL feature offloads the responsibility of cleaning up old records, preventing your table from growing indefinitely.

* Weaknesses & Edge Cases:

* Latency: DynamoDB latency is higher than Redis, typically in the single-digit to low double-digit milliseconds. For most web services, this is perfectly acceptable, but for extreme high-frequency trading systems, it might be a factor.

* Cost Model: Pay-per-request can be very cheap, but a high volume of idempotent checks will incur costs. You are paying for at least two write units (initial put, final update) and potentially one read unit (on conflict) per successful operation. Monitor your costs.

* Hot Partitions: If a single idempotency key is retried many times in a short period, it could create a hot partition. This is unlikely for this pattern, as keys should be unique and short-lived, but it's a consideration for any DynamoDB design.

Performance and Architectural Considerations

Feature	Redis (ElastiCache)	DynamoDB
Latency	Sub-millisecond to low single-digit ms	Single-digit to low double-digit ms
Durability	Configurable (AOF, replication). Lower by default.	Extremely high (multi-AZ replication)
Scalability	Vertical scaling, clustering adds complexity.	Seamless, effectively infinite.
Operational Cost	Higher (managing cluster, patching, scaling)	Very low (fully managed serverless service)
Monetary Cost	Pay for provisioned instances (can be idle)	Pay-per-request or provisioned capacity. More granular.
Best Fit	High-throughput systems where lowest latency is critical. On-prem or multi-cloud deployments.	AWS-native and serverless architectures. When durability of the idempotency record is paramount.

Choosing Your TTL

Selecting the right TTL for your IN_PROGRESS state is a critical design decision. It's a trade-off:

* Too Short: If your business logic takes longer than the TTL, another process could prematurely acquire the lock, leading to a race condition.

* Too Long: If a worker crashes, the key will remain locked for the full TTL, blocking any retries until it expires.

Rule of Thumb: Set the IN_PROGRESS TTL to be slightly longer than the maximum expected execution time of your function (e.g., P99 latency + a buffer). For an AWS Lambda function with a 30-second timeout, a 35-40 second TTL would be a safe starting point.

The COMPLETED TTL should be based on business requirements—how long do you need to protect against duplicates? 24 hours is a common choice.

Conclusion: From Theory to Production Resiliency

Idempotency is not an optional feature in modern distributed systems; it is a cornerstone of reliability. By moving beyond naive database checks and embracing atomic operations via distributed locks, we can build consumers that are resilient to the inherent quirks of at-least-once message delivery.

Both the Redis SETNX pattern and the DynamoDB ConditionExpression pattern provide robust, production-ready solutions to this problem. The choice between them is not about which is "better" but which is the best fit for your specific architectural context, performance requirements, and operational model.

For senior engineers, mastering these patterns is essential. It represents the shift from simply writing code that works on a sunny day to engineering systems that are correct, predictable, and fault-tolerant in the chaotic reality of a distributed environment.

The Inescapable Problem: At-Least-Once Delivery and Its Perils

Why Naive Database Checks Fail

The Core Pattern: Idempotency Key + Atomic Lock Acquisition

Implementation 1: Redis for High-Throughput, Low-Latency Locking

The Idempotency Decorator in Python

Analysis of the Redis Pattern

Implementation 2: AWS DynamoDB for Serverless, Durable Locking

The DynamoDB Idempotency Table

Python Implementation with Boto3

Analysis of the DynamoDB Pattern

Performance and Architectural Considerations

Choosing Your TTL

Conclusion: From Theory to Production Resiliency

Found this article helpful?