Idempotency in Event-Driven Systems via Distributed Locks & Keys

October 4, 2025

15 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Idempotency Imperative in Modern Event-Driven Architectures

In distributed systems, especially those built on event-driven or message-based architectures, the concept of "exactly-once" processing is the holy grail. However, most message brokers (like Kafka, RabbitMQ, or AWS SQS) offer at-least-once delivery guarantees by default. This is a pragmatic trade-off. It's far easier for a system to guarantee a message will be delivered one or more times than to guarantee it will be delivered precisely once, especially in the face of network partitions, consumer crashes, and broker failures.

This guarantee shifts the responsibility of ensuring exactly-once processing to the consumer application. A senior engineer knows that if an operation is not naturally idempotent (e.g., creating a user, processing a payment, sending a notification), receiving the same message twice can lead to catastrophic data corruption, duplicate charges, or a degraded user experience.

Consider the classic failure scenario:

A consumer receives a ProcessPayment event.

It successfully processes the payment and writes the transaction to the database.

It crashes before it can acknowledge the message with the broker.

The broker, never receiving the acknowledgment, assumes the message was not processed and redelivers it to another (or the same restarted) consumer instance.
The payment is processed a second time.

This is where idempotency becomes a non-negotiable architectural requirement. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application. Our goal is to make our non-idempotent business logic behave idempotently. The standard tool for this is the Idempotency Key.

An idempotency key is a unique value, typically provided by the client or message producer, that identifies a specific operation. The consumer system uses this key to track the status of operations and ensure that the logic for a given key is executed only once.

The Naive Approach and Its Inherent Race Condition

A first attempt at implementing idempotency might look something like this. We'll use a key-value store like Redis to track processed keys.

The Basic Workflow:

The producer adds a unique Idempotency-Key to the message header (e.g., a UUID).

The consumer receives the message and extracts the key.

The consumer checks Redis: GET idempotency:.

If the key exists, the message has already been processed. Skip and acknowledge.
If the key does not exist, execute the business logic.

After successful execution, save the result to Redis: SET idempotency: EX 86400.

Acknowledge the message.

Here is a simplified Go implementation of this flawed logic:

// WARNING: This implementation contains a critical race condition.
package main

import (
	"context"
	"fmt"
	"time"

	"github.com/go-redis/redis/v8"
)

var redisClient *redis.Client

func init() {
	redisClient = redis.NewClient(&redis.Options{
		Addr: "localhost:6379",
	})
}

// NaiveIdempotentProcess simulates processing a message idempotently.
func NaiveIdempotentProcess(ctx context.Context, idempotencyKey string, businessLogic func() (string, error)) (string, error) {
	redisKey := fmt.Sprintf("idempotency:%s", idempotencyKey)

	// 1. Check if the key already exists
	result, err := redisClient.Get(ctx, redisKey).Result()
	if err == nil {
		fmt.Printf("Request %s already processed. Returning cached result.\n", idempotencyKey)
		return result, nil // Success, already done
	} else if err != redis.Nil {
		return "", fmt.Errorf("redis GET failed: %w", err) // Real error
	}

	// Key does not exist, so we process it.
	// !!! RACE CONDITION WINDOW STARTS HERE !!!

	fmt.Printf("Processing request %s for the first time...\n", idempotencyKey)
	result, err = businessLogic()
	if err != nil {
		return "", err
	}

	// 3. Store the result
	err = redisClient.Set(ctx, redisKey, result, 24*time.Hour).Err()
	if err != nil {
		// Critical problem: logic executed but we failed to record it.
		return "", fmt.Errorf("failed to cache result: %w", err)
	}

	// !!! RACE CONDITION WINDOW ENDS HERE !!!

	return result, nil
}

func main() {
	// Simulate two concurrent consumers getting the same message
	idempotencyKey := "payment-xyz-123"
	businessLogic := func() (string, error) {
		fmt.Println("Executing expensive, non-idempotent payment logic...")
		time.Sleep(100 * time.Millisecond) // Simulate work
		return "payment_successful", nil
	}

	// Using a channel to wait for goroutines
	done := make(chan bool)

	go func() {
		NaiveIdempotentProcess(context.Background(), idempotencyKey, businessLogic)
		done <- true
	}()

	go func() {
		NaiveIdempotentProcess(context.Background(), idempotencyKey, businessLogic)
		done <- true
	}()

	<-done
	<-done
}

The Flaw: This code is vulnerable to a classic Check-Then-Act race condition. In a distributed environment with multiple consumer instances, the following sequence is not just possible, but probable under load:

Consumer A receives message with key k1.

Consumer A checks Redis for idempotency:k1. It doesn't exist.

Consumer B receives the same message with key k1 (due to redelivery).

Consumer B checks Redis for idempotency:k1. It also doesn't exist because Consumer A hasn't written it yet.

Consumer A executes the business logic.

Consumer B executes the business logic.

Consumer A writes the result to Redis.

Consumer B writes the result to Redis, overwriting A's result.

The business logic ran twice. Our idempotency guarantee is broken.

Achieving Atomicity with Distributed Locks

To solve the race condition, the check-then-act sequence must be an atomic operation. In a distributed system, the tool for this is a distributed lock. By acquiring a lock before we check the idempotency store, we ensure that only one consumer can be inside that critical section for a given idempotency key at any given time.

The Refined Workflow:

Receive event with idempotency-key.

Attempt to acquire a distributed lock using a key derived from the idempotency key (e.g., lock:). If unable to acquire, wait or fail.

Once the lock is acquired, check the idempotency store (e.g., processed:).

If already processed, retrieve and return the stored result. Release the lock and we're done.
If not processed, execute the business logic.
Store the result and processing status in the idempotency store.
Release the lock.

Implementing a Distributed Lock with Redis

Redis provides a simple and efficient way to implement a distributed lock using a single atomic command: SET key value NX PX milliseconds.

key: The lock key (e.g., lock:payment-xyz-123).

value: A unique random value (e.g., a UUID). This is crucial to ensure that a consumer only deletes a lock that it owns, preventing a slow consumer from deleting a lock acquired by a subsequent consumer.

NX: Only set the key if it does not already exist. This is the core of the locking mechanism.

PX milliseconds: Set an expiration time on the key. This is a safety mechanism (a lease) to prevent deadlocks if a consumer crashes while holding the lock.

Here is a robust implementation in Go:

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/go-redis/redis/v8"
	"github.com/google/uuid"
)

var redisClient *redis.Client

const (
	lockTimeout   = 10 * time.Second
	recordTTL     = 24 * time.Hour
)

func init() {
	redisClient = redis.NewClient(&redis.Options{
		Addr: "localhost:6379",
	})
}

// acquireLock attempts to get a lock in Redis.
func acquireLock(ctx context.Context, lockKey string, lockValue string, timeout time.Duration) (bool, error) {
	return redisClient.SetNX(ctx, lockKey, lockValue, timeout).Result()
}

// releaseLock safely releases a lock using a Lua script to ensure atomicity.
func releaseLock(ctx context.Context, lockKey string, lockValue string) error {
	// This Lua script ensures we only delete the lock if the value matches.
	// This prevents a process from deleting a lock held by another process.
	script := `
	if redis.call("get", KEYS[1]) == ARGV[1] then
		return redis.call("del", KEYS[1])
	else
		return 0
	end
	`
	_, err := redisClient.Eval(ctx, script, []string{lockKey}, lockValue).Result()
	return err
}

// LockBasedIdempotentProcess uses a distributed lock to prevent race conditions.
func LockBasedIdempotentProcess(ctx context.Context, idempotencyKey string, businessLogic func() (string, error)) (string, error) {
	lockKey := fmt.Sprintf("lock:%s", idempotencyKey)
	lockValue := uuid.New().String()
	recordKey := fmt.Sprintf("record:%s", idempotencyKey)

	// 1. Attempt to acquire the lock
	acquired, err := acquireLock(ctx, lockKey, lockValue, lockTimeout)
	if err != nil {
		return "", fmt.Errorf("failed to acquire lock: %w", err)
	}
	if !acquired {
		return "", fmt.Errorf("another process is handling this request") // Or could wait/retry
	}
	// IMPORTANT: Ensure the lock is released no matter what happens.
	defer releaseLock(ctx, lockKey, lockValue)

	// 2. Inside the lock, check if the record exists
	result, err := redisClient.Get(ctx, recordKey).Result()
	if err == nil {
		fmt.Printf("[LOCKED] Request %s already processed. Returning cached result.\n", idempotencyKey)
		return result, nil
	} else if err != redis.Nil {
		return "", fmt.Errorf("redis GET failed: %w", err)
	}

	// 3. Not processed yet, execute the logic
	fmt.Printf("[LOCKED] Processing request %s for the first time...\n", idempotencyKey)
	result, err = businessLogic()
	if err != nil {
		// Note: We are NOT storing a failure result here, allowing for retries.
		// This is a design decision. See the next section for a more advanced pattern.
		return "", err
	}

	// 4. Store the result
	err = redisClient.Set(ctx, recordKey, result, recordTTL).Err()
	if err != nil {
		return "", fmt.Errorf("failed to cache result: %w", err)
	}

	return result, nil
}

func main() {
	idempotencyKey := "payment-def-456"
	businessLogic := func() (string, error) {
		fmt.Println("Executing expensive, non-idempotent payment logic...")
		time.Sleep(100 * time.Millisecond)
		return "payment_successful", nil
	}

	done := make(chan bool)

	// Simulate two concurrent requests
	go func() {
		_, err := LockBasedIdempotentProcess(context.Background(), idempotencyKey, businessLogic)
		if err != nil {
			fmt.Printf("Goroutine 1 finished with error: %v\n", err)
		}
		done <- true
	}()

	go func() {
		time.Sleep(10 * time.Millisecond) // Ensure the first goroutine gets the lock
		_, err := LockBasedIdempotentProcess(context.Background(), idempotencyKey, businessLogic)
		if err != nil {
			fmt.Printf("Goroutine 2 finished with error: %v\n", err)
		}
		done <- true
	}()

	<-done
	<-done
}

When you run this, you will see that the business logic is executed only once. The second goroutine will fail to acquire the lock and exit, correctly preventing duplicate processing.

Production-Grade Implementation: State Management and Edge Cases

The lock-based solution is a huge improvement, but it has limitations in a real-world production environment. What if the business logic is very long-running? Holding a distributed lock for minutes is an anti-pattern; it can cause contention and reduce system throughput. What if the process crashes after executing the logic but before writing the result?

To solve these, we need to evolve our idempotency record from a simple key-value pair into a state machine.

Idempotency Record State Machine:

* STARTED: The operation has been accepted and is currently in progress.

* COMPLETED: The operation finished successfully. The record should contain the result.

* FAILED: The operation failed. The record might contain error details.

This allows for a more nuanced and resilient workflow that minimizes lock duration.

The Advanced, Two-Phase Lock Workflow:

Phase 1: Start Processing

a. Acquire a short-lived distributed lock.

b. Check the idempotency record store.

i. If COMPLETED, return the cached result. Release lock.

ii. If FAILED, decide on a retry strategy. Perhaps return an error. Release lock.

iii. If STARTED, another process might be working on it. Check the record's timestamp. If it's old, it might be a crashed process. You could either wait, fail, or attempt to take over. For now, we'll assume we fail fast. Release lock.

iv. If nil (not found), create a new record with status STARTED and a TTL. Store it.

c. Release the lock.

Execute Business Logic

a. Now, outside the lock, execute the potentially long-running and non-idempotent business logic.

Phase 2: Finalize Processing

a. Acquire the distributed lock again.

b. Update the idempotency record with the final status (COMPLETED or FAILED) and the result/error data. Update the TTL to its final, longer value.

c. Release the lock.

This two-phase lock approach ensures that the lock is only held for the brief moments needed to perform atomic state transitions in Redis, not for the entire duration of the business logic. This drastically improves system concurrency.

Advanced Implementation with State

Let's implement this robust pattern. We'll define a struct for our idempotency record and use JSON to store it in Redis.

package main

import (
	"context"
	"encoding/json"
	"errors"
	"fmt"
	"time"

	"github.com/go-redis/redis/v8"
	"github.com/google/uuid"
)

// ... (redisClient, acquireLock, releaseLock from previous example)

// IdempotencyStatus defines the state of a request.
type IdempotencyStatus string

const (
	StatusStarted   IdempotencyStatus = "STARTED"
	StatusCompleted IdempotencyStatus = "COMPLETED"
	StatusFailed    IdempotencyStatus = "FAILED"
)

// IdempotencyRecord stores the state and result of a request.
type IdempotencyRecord struct {
	Status IdempotencyStatus `json:"status"`
	Result string            `json:"result,omitempty"`
	Error  string            `json:"error,omitempty"`
}

// ErrRequestInProgress is a specific error for concurrent requests.
var ErrRequestInProgress = errors.New("request already in progress")

const (
	lockTimeout      = 5 * time.Second     // Short lock for state transitions
	processingTTL    = 5 * time.Minute     // How long a 'STARTED' record is valid
	finalRecordTTL   = 24 * time.Hour    // How long a 'COMPLETED'/'FAILED' record is kept
)

// AdvancedIdempotentProcessor encapsulates the full logic.
type AdvancedIdempotentProcessor struct {
	Redis *redis.Client
}

func (p *AdvancedIdempotentProcessor) Process(ctx context.Context, key string, logic func() (string, error)) (string, error) {
	lockKey := fmt.Sprintf("lock:%s", key)
	recordKey := fmt.Sprintf("record:%s", key)
	lockValue := uuid.New().String()

	// --- PHASE 1: START PROCESSING ---
	acquired, err := acquireLock(ctx, lockKey, lockValue, lockTimeout)
	if err != nil {
		return "", fmt.Errorf("failed to acquire lock: %w", err)
	}
	if !acquired {
		return "", ErrRequestInProgress
	}
	defer releaseLock(ctx, lockKey, lockValue)

	// Check existing record
	rawRecord, err := p.Redis.Get(ctx, recordKey).Bytes()
	if err != nil && err != redis.Nil {
		return "", fmt.Errorf("failed to get record from redis: %w", err)
	}

	if err == nil { // Record exists
		var record IdempotencyRecord
		if json.Unmarshal(rawRecord, &record) != nil {
			return "", fmt.Errorf("failed to unmarshal existing record")
		}
		switch record.Status {
		case StatusCompleted:
			fmt.Printf("Request %s already completed. Returning cached result.\n", key)
			return record.Result, nil
		case StatusFailed:
			fmt.Printf("Request %s previously failed. Returning cached error.\n", key)
			return "", errors.New(record.Error)
		case StatusStarted:
			// This could indicate a crashed process. For now, we fail fast.
			// A more advanced system might check the record's age.
			return "", ErrRequestInProgress
		}
	}

	// No record found, create a 'STARTED' record
	startRecord := IdempotencyRecord{Status: StatusStarted}
	rawStartRecord, _ := json.Marshal(startRecord)
	if err := p.Redis.Set(ctx, recordKey, rawStartRecord, processingTTL).Err(); err != nil {
		return "", fmt.Errorf("failed to set STARTED record: %w", err)
	}

	// --- RELEASE LOCK and EXECUTE BUSINESS LOGIC ---
	// Release lock before long-running task
	releaseLock(ctx, lockKey, lockValue)

	fmt.Printf("Executing long-running business logic for key %s...\n", key)
	result, logicErr := logic()

	// --- PHASE 2: FINALIZE PROCESSING ---
	// Re-acquire the lock to update the record atomically
	for i := 0; i < 5; i++ { // Retry acquiring lock
		acquired, err = acquireLock(ctx, lockKey, lockValue, lockTimeout)
		if err == nil && acquired {
			break
		}
		time.Sleep(100 * time.Millisecond)
	}
	if !acquired {
		return "", fmt.Errorf("failed to re-acquire lock to finalize record")
	}
	defer releaseLock(ctx, lockKey, lockValue)

	var finalRecord IdempotencyRecord
	if logicErr != nil {
		finalRecord = IdempotencyRecord{Status: StatusFailed, Error: logicErr.Error()}
	} else {
		finalRecord = IdempotencyRecord{Status: StatusCompleted, Result: result}
	}

	rawFinalRecord, _ := json.Marshal(finalRecord)
	if err := p.Redis.Set(ctx, recordKey, rawFinalRecord, finalRecordTTL).Err(); err != nil {
		// This is a tricky state: logic succeeded but we can't save the result.
		// The 'STARTED' record will eventually expire.
		return "", fmt.Errorf("CRITICAL: failed to set final record: %w", err)
	}

	if logicErr != nil {
		return "", logicErr
	}
	return result, nil
}

func main() {
	processor := &AdvancedIdempotentProcessor{Redis: redisClient}
	idempotencyKey := "order-ghi-789"

	// Simulate a successful run
	fmt.Println("--- First Run (Success) ---")
	result, err := processor.Process(context.Background(), idempotencyKey, func() (string, error) {
		time.Sleep(200 * time.Millisecond)
		return "order_created_successfully", nil
	})
	fmt.Printf("Result: %s, Error: %v\n\n", result, err)

	// Simulate a duplicate run
	fmt.Println("--- Second Run (Duplicate) ---")
	result, err = processor.Process(context.Background(), idempotencyKey, func() (string, error) {
		fmt.Println("THIS SHOULD NOT RUN")
		return "", nil
	})
	fmt.Printf("Result: %s, Error: %v\n", result, err)
}

Advanced Considerations and Optimizations

Building a truly production-ready system requires thinking about even more edge cases.

1. Choosing the Right Store

* Redis: Excellent choice due to its high performance, atomic operations (SETNX), and built-in TTLs. The main drawback is that its default configuration is not strongly persistent. You must configure AOF (Append Only File) persistence with fsync set to everysec or always to minimize data loss, which has performance implications.

* DynamoDB/Cassandra: These are great persistent alternatives. You can achieve atomicity using conditional writes (ConditionExpression in DynamoDB). Latency will be higher than Redis, but you gain durability and scalability. TTL features can be used for garbage collection.

* PostgreSQL: You can implement this pattern in a relational database using a dedicated idempotency_records table. You would use SELECT ... FOR UPDATE to acquire a row-level lock, providing strong transactional guarantees. However, this can become a point of contention and requires careful index design.

2. Garbage Collection

Idempotency records cannot live forever. TTLs are the most common and effective strategy. The processingTTL on STARTED records acts as a timeout for crashed jobs, while the finalRecordTTL determines how long you will protect against duplicates. A 24-48 hour window is common, aligning with the maximum time a message might be retained and redelivered by the broker.

3. Storing the Full Response

Our example stores a simple string result. In many real-world API contexts (e.g., Stripe's API), the idempotency layer must store the entire original response, including HTTP status code and headers. When a duplicate request is detected, the system serves the identical cached response, making the non-idempotent endpoint behave like a pure, idempotent one from the client's perspective.

4. Performance and Benchmarking

This pattern is not free. It introduces network round-trips to your chosen store for every request.

* Latency: The overhead is typically two Redis round-trips in the best-case (new request) and one in the duplicate case. This is usually a few milliseconds, which is acceptable for most asynchronous operations.

Throughput: The distributed lock introduces serialization for operations with the same idempotency key. This is the desired behavior. For operations with different* keys, performance is limited only by the throughput of your store (e.g., Redis). It's crucial to ensure your store is not a bottleneck.

* Benchmarking: You should benchmark the overhead. Measure the latency of your business logic with and without the idempotency wrapper. For a 10ms business logic, adding 2-3ms of idempotency overhead might be a 30% increase. For a 500ms logic, it's negligible. The cost of the overhead must be weighed against the cost of data corruption from duplicate processing.

Conclusion

Implementing idempotency in event-driven systems is a complex but essential task for building resilient and reliable software. Moving from a naive check-then-act approach to a stateful, two-phase lock pattern addresses critical race conditions, handles process crashes, and minimizes performance bottlenecks from lock contention.

By combining a unique Idempotency Key, a Stateful Record (STARTED, COMPLETED, FAILED), and a Distributed Lock, you can transform non-idempotent operations into robust, predictable, and fault-tolerant components of your architecture. While the implementation requires careful attention to detail, especially around lock management and failure modes, it is a foundational pattern that every senior engineer working with distributed systems should master.