Building Idempotent Kafka Consumers with Redis & Distributed Locking

October 12, 2025

14 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Problem: At-Least-Once Delivery and Its Perils

In the world of distributed event streaming, Kafka stands as a titan. Its guarantee of at-least-once message delivery is a cornerstone of its resilience. However, this guarantee places a significant burden on the consumer. Network partitions, consumer group rebalancing, and producer retries can all lead to a single logical event being delivered to your application more than once. For any non-trivial system, processing the same event twice can be catastrophic: a user might be charged twice, an inventory count could be incorrectly decremented, or critical state can become corrupted.

The textbook answer is simple: "make your consumers idempotent." But the chasm between knowing this and implementing a truly robust, scalable, and fault-tolerant idempotency layer in a high-throughput production environment is vast. A simple check against a database processed_messages table quickly falls apart under the harsh realities of concurrency and failure.

This article dissects the problem and provides a battle-tested pattern for implementing a generic idempotency middleware for Kafka consumers. We will leverage the speed and atomic operations of Redis to build a system that provides effectively-exactly-once processing semantics at the application layer. We will not be discussing the basics of Kafka or idempotency; we assume you are a senior engineer who has faced this problem and is looking for a comprehensive, production-ready solution.

Our journey will cover:

Architecting the Idempotency Key and Middleware.

The Naive Approach: Using SETNX and why it's insufficient.

A Robust Pattern: A state machine managed by atomic Lua scripts.

Production Hardening: Handling long-running jobs, consumer crashes, and performance tuning.

Architecture of an Idempotency Layer

A robust idempotency layer isn't just a function; it's a stateful component that intercepts messages before they reach your business logic. Its design requires careful consideration of three core elements.

1. The Idempotency Key

The entire system hinges on a unique identifier for each logical event, the Idempotency Key. The key must be deterministic and derived from the message itself. It should be provided by the producer to ensure that even if the producer sends the exact same logical event twice (e.g., due to a retry), it carries the same key.

Good sources for an idempotency key include:

A UUID generated by the client application initiating the action (e.g., order-placement-uuid).

A composite key of stable, unique business identifiers (e.g., tenant_id:user_id:transaction_id).

It's crucial to embed this key in the message payload or, preferably, in the Kafka message headers for clean separation of concerns.

json

// Example Kafka Message Payload with Idempotency Key
{
  "eventId": "a1b2c3d4-e5f6-7890-1234-567890abcdef", // This can be our key
  "payload": {
    "userId": "usr_123",
    "amount": 99.99,
    "currency": "USD"
  }
}

2. The Processing State Machine

To handle concurrency and failures, we must track the state of each message processing attempt. A simple boolean is_processed is not enough. A more robust state machine is required:

RECEIVED: The initial state. The key has been seen, and processing is about to be attempted.

PROCESSING: A consumer has acquired a lock and is actively executing the business logic.

COMPLETED: Business logic finished successfully. The result of the operation is stored.

FAILED: Business logic failed. This allows for targeted retry or dead-lettering strategies.

3. The Middleware/Decorator Pattern

To keep our business logic clean, we'll wrap it in a middleware or decorator. This pattern is responsible for the entire idempotency workflow: checking the key's status, acquiring a lock, executing the business logic, and updating the status.

Here's a conceptual Go interface for our handler and middleware:

package main

import "context"

// Message represents a generic message from Kafka
type Message struct {
	ID      string // The Idempotency Key
	Payload []byte
}

// MessageHandler is the interface for our business logic
type MessageHandler interface {
	Handle(ctx context.Context, msg Message) error
}

// Middleware is a function that wraps a MessageHandler
type Middleware func(next MessageHandler) MessageHandler

// IdempotencyMiddleware creates a middleware for ensuring idempotency
func IdempotencyMiddleware(store IdempotencyStore) Middleware {
	return func(next MessageHandler) MessageHandler {
		return MessageHandlerFunc(func(ctx context.Context, msg Message) error {
			// 1. Check/Acquire lock using msg.ID
			// 2. If acquired, call next.Handle(ctx, msg)
			// 3. Update state (Completed/Failed)
			// 4. Release lock
			return nil // Simplified for now
		})
	}
}

// MessageHandlerFunc is an adapter to allow the use of ordinary functions as MessageHandlers.
type MessageHandlerFunc func(ctx context.Context, msg Message) error

func (f MessageHandlerFunc) Handle(ctx context.Context, msg Message) error {
	return f(ctx, msg)
}

This structure allows us to chain middlewares and keep the idempotency logic entirely separate from the PaymentProcessingHandler or OrderFulfillmentHandler.

Attempt 1: The `SETNX` Approach (And Its Flaws)

A common first attempt at implementing this with Redis is to use the atomic SETNX (SET if Not eXists) command.

The logic is straightforward:

On receiving a message with key K, execute SETNX idempotency:K 1.

If SETNX returns 1, the key was not set, so we are the first consumer. We proceed with processing.

If SETNX returns 0, the key already exists. Another consumer is processing or has processed it, so we skip.

To handle consumer crashes, we must add a TTL to the key: SET idempotency:K 1 EX 300 NX.

Here's what that looks like in Go:

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/go-redis/redis/v8"
)

// NaiveIdempotencyStore uses SETNX
type NaiveIdempotencyStore struct {
	client *redis.Client
	lockTTL time.Duration
}

func NewNaiveIdempotencyStore(client *redis.Client) *NaiveIdempotencyStore {
	return &NaiveIdempotencyStore{
		client:  client,
		lockTTL: 5 * time.Minute, // 5 minute lock TTL
	}
}

// Attempt to start processing a message. Returns true if successful.
func (s *NaiveIdempotencyStore) TryStartProcessing(ctx context.Context, key string) (bool, error) {
	// SET key value EX seconds NX
	wasSet, err := s.client.SetNX(ctx, key, "processing", s.lockTTL).Result()
	if err != nil {
		return false, fmt.Errorf("redis SETNX failed: %w", err)
	}
	return wasSet, nil
}

// This approach has no explicit 'MarkCompleted' phase, which is a core part of its weakness.

Why This Is Dangerously Insufficient

This simple approach has several critical failure modes in a production environment:

Indistinguishable States: The key's existence can mean two things: "processing is in progress" or "processing completed successfully." If a message is successfully processed, the key remains until the TTL expires. During this window, any redelivered messages are incorrectly skipped, but after the TTL expires, a redelivered message would be processed again. We could DEL the key on completion, but a crash between completion and DEL re-introduces the race condition.

Long-Running Processes: What if your business logic takes longer than the lock TTL? A second consumer will see the expired key, acquire a new lock via SETNX, and start processing the same message, leading to a classic race condition and duplicate processing.

No Stored Result: If an API call needs to return the result of the original operation on a duplicate request, this model can't help. There's no way to retrieve the outcome of the first successful processing attempt.

This approach is a ticking time bomb in any system that requires strong consistency.

Attempt 2: A Production-Grade State Machine with Lua

To solve the flaws of the SETNX model, we need to implement our full state machine (RECEIVED, PROCESSING, COMPLETED, FAILED) and manage it atomically. A Redis Hash is a perfect data structure for this, storing the state, owner, and potentially the result for each key.

The challenge is that operations like "check the status, if X, then set status to Y and update TTL" must be atomic. A series of commands from the client (HGET, HSET) introduces race conditions. This is the perfect use case for a Redis Lua script. Lua scripts are executed atomically on the Redis server, guaranteeing that no other command can run concurrently against the keys it touches.

The State-Aware Lua Script

We'll design two scripts: acquire_lock and release_lock.

acquire_lock.lua

This script is the heart of our logic. It takes the key, a unique consumer ID (for ownership), and a TTL as arguments.

lua

-- acquire_lock.lua
-- KEYS[1]: The idempotency key (e.g., 'idempotency:a1b2c3d4')
-- ARGV[1]: The ID of the consumer acquiring the lock (e.g., 'consumer-pod-xyz')
-- ARGV[2]: The lock TTL in milliseconds

local key = KEYS[1]
local owner_id = ARGV[1]
local ttl_ms = tonumber(ARGV[2])

-- Check if the key exists
local existing_data = redis.call('HGETALL', key)

-- Case 1: Key does not exist. This is the first time we see this message.
if #existing_data == 0 then
  redis.call('HSET', key, 'status', 'PROCESSING', 'owner', owner_id)
  redis.call('PEXPIRE', key, ttl_ms)
  return 'ACQUIRED'
end

-- Key exists, convert array to map for easier access
local state = {}
for i = 1, #existing_data, 2 do
  state[existing_data[i]] = existing_data[i+1]
end

-- Case 2: The message has already been completed successfully.
if state['status'] == 'COMPLETED' then
  return {'COMPLETED', state['result'] or ''} -- Return the stored result
end

-- Case 3: Another consumer is currently processing it.
if state['status'] == 'PROCESSING' then
  -- We can check for expired locks here, but it's safer to rely on TTLs and re-processing.
  -- A more advanced implementation might use fencing tokens.
  return 'LOCKED'
end

-- Case 4: The message previously failed. We can allow a retry.
if state['status'] == 'FAILED' then
  redis.call('HSET', key, 'status', 'PROCESSING', 'owner', owner_id, 'retries', (tonumber(state['retries']) or 0) + 1)
  redis.call('PEXPIRE', key, ttl_ms)
  return 'ACQUIRED_RETRY'
end

return 'UNKNOWN_STATE'

release_lock.lua

This script marks processing as complete, but only if the current consumer still owns the lock. This prevents a slow consumer from overwriting the result of a faster one that may have taken over after a lock expired.

lua

-- release_lock.lua
-- KEYS[1]: The idempotency key
-- ARGV[1]: The consumer ID that is trying to release the lock
-- ARGV[2]: The final status ('COMPLETED' or 'FAILED')
-- ARGV[3]: The result to store (e.g., a JSON string)
-- ARGV[4]: The final TTL for the result (e.g., 24 hours)

local key = KEYS[1]
local owner_id = ARGV[1]
local final_status = ARGV[2]
local result = ARGV[3]
local final_ttl_ms = tonumber(ARGV[4])

local current_owner = redis.call('HGET', key, 'owner')

-- Only the current lock holder can mark it as complete/failed
if current_owner == owner_id then
  redis.call('HSET', key, 'status', final_status, 'result', result)
  redis.call('PEXPIRE', key, final_ttl_ms)
  return 'RELEASED'
else
  return 'NOT_OWNER'
end

Go Implementation

Now, let's implement the IdempotencyStore in Go using these scripts.

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/go-redis/redis/v8"
	"github.com/google/uuid"
)

const (
	scriptAcquireLock = `... (lua script content here) ...`
	scriptReleaseLock = `... (lua script content here) ...`
)

// Status represents the outcome of an acquire attempt
type AcquireStatus string

const (
	Acquired       AcquireStatus = "ACQUIRED"
	AcquiredRetry  AcquireStatus = "ACQUIRED_RETRY"
	Locked         AcquireStatus = "LOCKED"
	Completed      AcquireStatus = "COMPLETED"
	UnknownState   AcquireStatus = "UNKNOWN_STATE"
)

// RedisIdempotencyStore implements the robust pattern
type RedisIdempotencyStore struct {
	client       *redis.Client
	consumerID   string
	lockTTL      time.Duration
	completedTTL time.Duration
	acquireSha   string
	releaseSha   string
}

func NewRedisIdempotencyStore(ctx context.Context, client *redis.Client) (*RedisIdempotencyStore, error) {
	// Load Lua scripts into Redis and get their SHA1 hashes for efficient execution
	acquireSha, err := client.ScriptLoad(ctx, scriptAcquireLock).Result()
	if err != nil {
		return nil, fmt.Errorf("failed to load acquire script: %w", err)
	}

	releaseSha, err := client.ScriptLoad(ctx, scriptReleaseLock).Result()
	if err != nil {
		return nil, fmt.Errorf("failed to load release script: %w", err)
	}

	return &RedisIdempotencyStore{
		client:       client,
		consumerID:   uuid.NewString(), // Unique ID for this consumer instance
		lockTTL:      5 * time.Minute,
		completedTTL: 24 * time.Hour,
		acquireSha:   acquireSha,
		releaseSha:   releaseSha,
	}, nil
}

func (s *RedisIdempotencyStore) Acquire(ctx context.Context, key string) (AcquireStatus, string, error) {
	keys := []string{key}
	args := []interface{}{s.consumerID, s.lockTTL.Milliseconds()}

	res, err := s.client.EvalSha(ctx, s.acquireSha, keys, args...).Result()
	if err != nil {
		return "", "", fmt.Errorf("failed to execute acquire script: %w", err)
	}

	switch result := res.(type) {
	case string:
		return AcquireStatus(result), "", nil
	case []interface{}:
		if len(result) == 2 {
			status, ok1 := result[0].(string)
			payload, ok2 := result[1].(string)
			if ok1 && ok2 {
				return AcquireStatus(status), payload, nil
			}
		}
	}
	return UnknownState, "", fmt.Errorf("unexpected result type from Lua script: %T", res)
}

func (s *RedisIdempotencyStore) Release(ctx context.Context, key, finalStatus, result string) error {
	keys := []string{key}
	args := []interface{}{s.consumerID, finalStatus, result, s.completedTTL.Milliseconds()}

	res, err := s.client.EvalSha(ctx, s.releaseSha, keys, args...).Result()
	if err != nil {
		return fmt.Errorf("failed to execute release script: %w", err)
	}

	if res.(string) == "NOT_OWNER" {
		// Log this! It means our process was too slow and another consumer may have taken over.
		return fmt.Errorf("failed to release lock for key '%s', not the owner", key)
	}

	return nil
}

This implementation is far more resilient. It distinguishes between states, prevents slow consumers from corrupting data, and can even return the original result on duplicate requests.

Production Hardening and Edge Cases

With the core logic in place, we must consider the harsh realities of a production environment.

Handling Long-Running Processes: The Heartbeat

What if a job legitimately takes longer than our lockTTL? We can't just increase the TTL to hours, as that would mean a crashed consumer would block a message for that entire duration. The solution is a lock heartbeat. The consumer that holds the lock must periodically extend its expiry.

This can be implemented with a background goroutine that runs alongside the main business logic.

func (h *MyBusinessHandler) Handle(ctx context.Context, msg Message) error {
    // ... acquire lock logic from middleware ...

	// Create a context that we can cancel to stop the heartbeat
	heartbeatCtx, cancelHeartbeat := context.WithCancel(ctx)
	defer cancelHeartbeat()

	// Start heartbeat in the background
	go func() {
		ticker := time.NewTicker(s.lockTTL / 2) // Extend lock at half its lifetime
		defer ticker.Stop()
		for {
			select {
			case <-ticker.C:
                // PEXPIRE is an efficient way to update the TTL
				s.client.PExpire(heartbeatCtx, msg.ID, s.lockTTL)
			case <-heartbeatCtx.Done():
				return
			}
		}
	}()

	// --- Execute long-running business logic here ---
	// time.Sleep(10 * time.Minute) 

	// On completion, the defer cancelHeartbeat() stops the goroutine.
	return nil
}

This pattern ensures that as long as our consumer process is alive and healthy, it will retain the lock. If it crashes, the lock will expire after lockTTL, allowing another consumer to safely take over.

Fencing Tokens for Zombie Processes

There's a subtle but dangerous race condition:

Consumer A acquires a lock with a 5-minute TTL.
Consumer A experiences a long GC pause or network partition, lasting > 5 minutes. It's now a "zombie".
The lock expires. Consumer B acquires a new lock for the same key and starts processing.
Consumer A wakes up and, not knowing its lock expired, proceeds to complete its work and writes its result.
Consumer B finishes and writes its result, potentially overwriting A's or vice-versa, leading to an inconsistent state.

The solution is fencing. Each time a lock is acquired, we can include a generation number or a unique token. The release_lock script must be modified to only accept the write if the provided token matches the currently stored token. Our current release_lock script's check against owner_id provides a basic form of this protection, but a monotonically increasing token can be even more robust.

Performance and Scalability Considerations

* Redis Latency: Every message now incurs at least two round-trips to Redis. This adds latency. Ensure your Redis instance is geographically close to your consumers. For ultra-low latency requirements, this might be a bottleneck.

* Redis Throughput: Redis is single-threaded. While extremely fast, a single instance can be saturated. Use Redis Cluster for horizontal scaling. The Lua scripts work seamlessly with Redis Cluster as long as they only operate on a single key (which ours do).

* Connection Pooling: Ensure your Go (or other language) application is using a properly configured connection pool to Redis to avoid the overhead of establishing connections for every message.

Memory Usage: The keys for completed messages will remain in Redis for completedTTL (e.g., 24 hours). Calculate the memory footprint: (avg_key_size + avg_hash_size) messages_per_day. Use tools like redis-memory-for-key to analyze usage and adjust your TTLs or Redis instance size accordingly.

What About Redis Failure?

If Redis is down, your idempotency layer fails. You have two choices:

Fail Closed (Recommended): Stop processing Kafka messages. This preserves correctness at the cost of availability. This is the right choice for financial transactions or critical state changes.

Fail Open: Continue processing messages, but without the idempotency guarantee. This prioritizes availability but risks data corruption. This might be acceptable for non-critical tasks like updating view counters.

Implement health checks against Redis and configure your consumer to pause consumption if Redis is unreachable. A robust deployment should use Redis Sentinel or a managed Redis Cluster for high availability.

Conclusion: The Price of Correctness

Implementing a truly resilient idempotency layer for an asynchronous, distributed system is a complex undertaking. A naive SETNX approach is fraught with peril and should be avoided in any system where correctness is paramount. By embracing a state machine model and leveraging the atomicity of Redis Lua scripts, we can build a generic, reusable middleware that provides effectively-exactly-once processing semantics.

This pattern, while adding latency and operational complexity (managing Redis), is a necessary investment for mission-critical systems. It transforms the vague requirement of "make consumers idempotent" into a concrete, battle-tested architecture that protects against duplicate processing in the face of concurrency, consumer crashes, and message redeliveries. It's a foundational component for building predictable and reliable event-driven microservices.