Robust Idempotency Layers in Microservices with Redis and Lua

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Problem: Duplicate Events in Distributed Systems

In any non-trivial event-driven architecture, you don't control the network. Message brokers like Kafka, RabbitMQ, or AWS SQS provide at-least-once delivery guarantees. This is a pragmatic trade-off. Guaranteeing exactly-once delivery is computationally expensive and often impossible in the face of network partitions and consumer failures. The consequence is that your message consumers will inevitably receive the same message more than once.

A naive consumer might re-charge a customer, send a duplicate notification, or corrupt state by processing the same event multiple times. The solution is not to fix the broker, but to make the consumer idempotent. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application.

This article bypasses introductory concepts. We assume you understand why idempotency is critical. Instead, we will architect and implement a robust, high-performance idempotency layer using Redis and Lua scripting. We will dissect common but flawed approaches and build a solution that withstands the concurrency and failure modes of a real-world production environment.


The Idempotency Key Pattern: A Foundation

The core pattern is straightforward: the event producer generates a unique idempotencyKey for each distinct operation and includes it in the event payload. The consumer uses this key to track the processing status of the event.

json
// Example Event Payload from a Kafka topic
{
  "eventId": "evt_1J9X2Y2eZvKYlo2CiBqjF9aA",
  "eventType": "payment.created",
  "idempotencyKey": "c1e6b5c8-3b1a-4f5c-8d3f-7e9a0b1c4d2e",
  "data": {
    "amount": 10000,
    "currency": "usd",
    "customerId": "cus_12345"
  }
}

The consumer's responsibility is to maintain a state machine for each idempotencyKey:

  • STARTED: The event has been received, but processing is not yet complete.
  • COMPLETED: The event has been successfully processed, and the result is stored.
  • FAILED: The event processing failed irrecoverably.
  • Our persistence layer for this state will be Redis, chosen for its high performance and atomic operation capabilities.

    The Naive Approach and Its Race Condition Flaw

    A junior engineer's first attempt might look like this in a Node.js consumer:

    typescript
    // DO NOT USE THIS IN PRODUCTION - FLAWED EXAMPLE
    import { createClient } from 'redis';
    
    const redisClient = createClient();
    await redisClient.connect();
    
    async function handleEvent(event: { idempotencyKey: string; data: any }) {
      const { idempotencyKey, data } = event;
      const key = `idempotency:${idempotencyKey}`;
    
      const existingStatus = await redisClient.get(key);
    
      if (existingStatus) {
        console.log(`Event ${idempotencyKey} already processed or in-progress.`);
        // Potentially return stored result if available
        return;
      }
    
      // Mark as in-progress
      await redisClient.set(key, JSON.stringify({ status: 'STARTED' }), { EX: 3600 });
    
      try {
        // --- CRITICAL BUSINESS LOGIC --- 
        const result = await processPayment(data);
        // --------------------------------
    
        // Mark as completed with result
        await redisClient.set(key, JSON.stringify({ status: 'COMPLETED', response: result }), { EX: 3600 });
        return result;
      } catch (error) {
        // Handle error, maybe set status to FAILED
        await redisClient.del(key); // Or set to FAILED
        throw error;
      }
    }

    This code appears logical, but it contains a critical race condition. Imagine two instances of your consumer service running in a Kubernetes cluster. They both receive the same message from a Kafka topic partition due to a consumer group rebalance.

  • Instance A: Executes redisClient.get(key). The key does not exist. It receives null.
  • Instance B: Executes redisClient.get(key) fractions of a millisecond later. The key still does not exist. It also receives null.
  • Instance A: Proceeds to redisClient.set(key, ...'STARTED'...) and starts processing the payment.
  • Instance B: Also proceeds to redisClient.set(key, ...'STARTED'...) and starts processing the same payment.
  • You have just double-charged your customer.

    The root problem is that the GET and SET operations are not atomic. We need a way to check for the key's existence and set it in a single, indivisible operation.

    Attempt 2: The `SETNX` Command

    Redis provides the SETNX (SET if Not eXists) command, which seems purpose-built for this problem. It sets a key only if it does not already exist.

    Let's refine our logic:

    typescript
    // BETTER, BUT STILL INCOMPLETE
    async function handleEventWithSetNX(event: { idempotencyKey: string; data: any }) {
      const { idempotencyKey, data } = event;
      const key = `idempotency:${idempotencyKey}`;
    
      const wasSet = await redisClient.set(key, JSON.stringify({ status: 'STARTED' }), {
        NX: true, // Only set if key does not exist
        EX: 3600, // Set an expiration
      });
    
      if (!wasSet) {
        // Key already existed. Another process is handling it or has handled it.
        console.log(`Event ${idempotencyKey} is being processed or is complete.`);
        // We need more logic here to check the actual status!
        return;
      }
    
      try {
        const result = await processPayment(data);
        await redisClient.set(key, JSON.stringify({ status: 'COMPLETED', response: result }), { EX: 3600 });
        return result;
      } catch (error) {
        await redisClient.del(key);
        throw error;
      }
    }

    This correctly prevents the race condition on initial processing. Only one consumer instance will successfully execute the SET with the NX option. However, this implementation introduces a new problem: it doesn't distinguish between an IN_PROGRESS event and a COMPLETED event.

    If a request comes in for a key that has already been successfully processed, wasSet will be false. The function will simply log a message and return. But the caller might need the result of the original operation. We need to fetch the value, parse its status, and act accordingly. This leads us back to multiple commands (SETNX followed by a GET), re-opening a window for new race conditions.

    The Definitive Solution: Atomic State Management with Lua Scripting

    To solve this cleanly, we need to perform our entire check-and-set logic in a single atomic operation on the Redis server. This is the perfect use case for Lua scripts. Redis guarantees that a Lua script is executed atomically. No other command can run concurrently with the script.

    Our Lua script will encapsulate the core state-machine logic.

    idempotency.lua

    lua
    -- ARGV[1]: The idempotency key (e.g., idempotency:c1e6b5c8...)
    -- ARGV[2]: The initial value to set for the 'STARTED' state
    -- ARGV[3]: The expiration time in seconds for the key
    
    -- Try to get the current value of the key
    local currentValue = redis.call('GET', ARGV[1])
    
    -- If the key already exists, return its value
    if currentValue then
      return currentValue
    end
    
    -- If the key does not exist, set it with the 'STARTED' status and an expiration
    -- redis.call returns 'OK' on success
    redis.call('SET', ARGV[1], ARGV[2], 'EX', ARGV[3])
    
    -- Return the value we just set to confirm the 'STARTED' state
    return ARGV[2]

    This script is the heart of our idempotency layer. It does the following, atomically:

    • Gets the current value of the idempotency key.
  • If a value exists, it immediately returns it. This could be a STARTED marker or a COMPLETED result from a previous run.
  • If no value exists, it sets the key to a STARTED state with a TTL and returns that STARTED value.
  • The consumer application can now interpret the script's return value to decide on the next action.

    Production-Grade Node.js/TypeScript Implementation

    Let's build a reusable IdempotencyService that encapsulates this logic.

    1. The Service Structure

    typescript
    import { createClient, RedisClientType } from 'redis';
    import fs from 'fs';
    import path from 'path';
    
    // Define the structure of our stored idempotency data
    interface IdempotencyRecord {
      status: 'STARTED' | 'COMPLETED' | 'FAILED';
      response?: any; // The stored result of the operation
      error?: any; // Stored error information
    }
    
    // Define the outcome of an idempotent execution attempt
    export type IdempotentExecutionResult<T> = 
      | { status: 'SUCCESS'; result: T }
      | { status: 'DUPLICATE'; result: T } // Successfully retrieved from cache
      | { status: 'IN_PROGRESS' }
      | { status: 'ERROR'; error: Error };
    
    export class IdempotencyService {
      private redisClient: RedisClientType;
      private luaScriptSha: string | null = null;
      private readonly keyPrefix = 'idempotency';
      private readonly defaultTtl = 3600; // 1 hour
    
      constructor(redisUrl: string) {
        this.redisClient = createClient({ url: redisUrl });
      }
    
      public async connect(): Promise<void> {
        await this.redisClient.connect();
        await this.loadScript();
      }
    
      private async loadScript(): Promise<void> {
        const scriptPath = path.join(__dirname, 'idempotency.lua');
        const script = fs.readFileSync(scriptPath, 'utf8');
        this.luaScriptSha = await this.redisClient.scriptLoad(script);
        console.log('Idempotency Lua script loaded with SHA:', this.luaScriptSha);
      }
    
      // ... implementation of executeIdempotently and update methods
    }

    2. The Core executeIdempotently Method

    This method orchestrates the entire process.

    typescript
    // Inside the IdempotencyService class
    
    public async executeIdempotently<T>(
      idempotencyKey: string,
      businessLogic: () => Promise<T>,
      ttl: number = this.defaultTtl
    ): Promise<IdempotentExecutionResult<T>> {
      if (!this.luaScriptSha) {
        throw new Error('Lua script not loaded.');
      }
    
      const redisKey = `${this.keyPrefix}:${idempotencyKey}`;
      const startedState: IdempotencyRecord = { status: 'STARTED' };
    
      try {
        // Atomically check and set the initial state
        const rawResult = await this.redisClient.evalSha(
          this.luaScriptSha,
          {
            keys: [], // No keys needed for this script, we pass as ARGV
            arguments: [redisKey, JSON.stringify(startedState), String(ttl)],
          }
        ) as string;
    
        const record: IdempotencyRecord = JSON.parse(rawResult);
    
        if (record.status === 'COMPLETED') {
          console.log(`[${idempotencyKey}] Duplicate request: returning cached result.`);
          return { status: 'DUPLICATE', result: record.response as T };
        }
    
        if (record.status === 'STARTED' && rawResult !== JSON.stringify(startedState)) {
          // This means our script returned an existing 'STARTED' record
          // set by another process. We should back off.
          console.log(`[${idempotencyKey}] Request already in progress.`);
          return { status: 'IN_PROGRESS' };
        }
    
        // If we are here, we are the first process. `rawResult` is our own 'STARTED' marker.
        console.log(`[${idempotencyKey}] New request: executing business logic.`);
        const logicResult = await businessLogic();
    
        // Store the final result
        await this.updateRecord(redisKey, {
          status: 'COMPLETED',
          response: logicResult,
        }, ttl);
    
        return { status: 'SUCCESS', result: logicResult };
    
      } catch (error) {
        console.error(`[${idempotencyKey}] Error during idempotent execution:`, error);
        // Clean up the 'STARTED' key to allow for retries
        await this.redisClient.del(redisKey);
        return { status: 'ERROR', error: error instanceof Error ? error : new Error(String(error)) };
      }
    }
    
    private async updateRecord(key: string, record: IdempotencyRecord, ttl: number): Promise<void> {
      await this.redisClient.set(key, JSON.stringify(record), { EX: ttl });
    }

    3. Putting it all together in a Consumer

    Now, our event consumer becomes clean and declarative.

    typescript
    // Example consumer logic
    
    const idempotencyService = new IdempotencyService('redis://localhost:6379');
    await idempotencyService.connect();
    
    interface PaymentEvent {
      idempotencyKey: string;
      data: { amount: number; customerId: string };
    }
    
    // Mock of a payment processing function
    async function processPayment(data: { amount: number; customerId: string }): Promise<{ transactionId: string }> {
      console.log(`Processing payment for ${data.customerId} of ${data.amount}`);
      // Simulate network delay and processing time
      await new Promise(resolve => setTimeout(resolve, 1000));
      return { transactionId: `txn_${Math.random().toString(36).substr(2, 9)}` };
    }
    
    async function handlePaymentEvent(event: PaymentEvent) {
      const result = await idempotencyService.executeIdempotently(
        event.idempotencyKey,
        () => processPayment(event.data)
      );
    
      switch (result.status) {
        case 'SUCCESS':
          console.log('Payment processed successfully:', result.result.transactionId);
          // Acknowledge the message from the broker
          break;
        case 'DUPLICATE':
          console.log('Duplicate payment event, original result:', result.result.transactionId);
          // Acknowledge the message from the broker
          break;
        case 'IN_PROGRESS':
          console.warn('Payment is already being processed by another worker. NACKing message for later retry.');
          // Do NOT acknowledge the message, let the broker redeliver it after a delay
          break;
        case 'ERROR':
          console.error('An error occurred:', result.error.message);
          // Move the message to a dead-letter queue or NACK for retry
          break;
      }
    }
    
    // Simulate receiving the same event twice quickly
    const event: PaymentEvent = {
      idempotencyKey: 'c1e6b5c8-3b1a-4f5c-8d3f-7e9a0b1c4d2e',
      data: { amount: 10000, customerId: 'cus_12345' }
    };
    
    handlePaymentEvent(event); // First call
    handlePaymentEvent(event); // Second call (simulates duplicate delivery)

    Analysis of the Lua-based Solution

    This architecture correctly handles the state machine:

  • First Request: The Lua script finds no key, sets the STARTED state, and returns it. The executeIdempotently function proceeds to run the business logic, and upon success, overwrites the key with the COMPLETED state and result.
  • Concurrent Request: A second instance calls the Lua script while the first is processing. The script finds the STARTED key and immediately returns it. The service identifies this as an IN_PROGRESS state and backs off, preventing duplicate execution.
  • Post-Completion Request: A third request arrives after the first has completed. The Lua script finds the COMPLETED key with the stored result and returns it. The service identifies this as a DUPLICATE and returns the cached response without re-executing the business logic.
  • Advanced Topic: Handling Long-Running Jobs

    The IN_PROGRESS state works well for operations that complete within seconds. But what if your business logic takes 5 minutes to run? A short-term NACK (Negative Acknowledgement) might not be sufficient. The message could be redelivered multiple times, with each consumer instance seeing the IN_PROGRESS state and backing off, effectively stalling the process if the original worker dies.

    For long-running jobs, we need to augment our idempotency check with a distributed lock.

    The flow becomes:

  • Check idempotency status (using the Lua script). If COMPLETED, return the result.
  • If the status is new or STARTED (from a failed previous run), attempt to acquire a distributed lock using the idempotency key.
    • If the lock is acquired, proceed with the business logic.
    • If the lock cannot be acquired, another worker holds it. Back off and retry.
  • Upon completion (success or failure), update the idempotency record and release the lock.
  • Using a library like redlock simplifies this greatly.

    typescript
    // PSEUDOCODE - Integrating a distributed lock
    import Redlock from 'redlock';
    
    // ... inside executeIdempotently ...
    
    // After checking the Lua script result...
    if (record.status === 'STARTED') {
      const lockKey = `lock:${idempotencyKey}`;
      let lock;
      try {
        // Attempt to acquire lock with a TTL, don't wait long
        lock = await redlock.acquire([lockKey], 5000); // 5s lock TTL
    
        // We got the lock! We can now execute the business logic.
        const logicResult = await businessLogic();
        await this.updateRecord(redisKey, { status: 'COMPLETED', response: logicResult }, ttl);
        return { status: 'SUCCESS', result: logicResult };
    
      } catch (err) {
        // Failed to acquire lock, another worker is active.
        return { status: 'IN_PROGRESS' };
      } finally {
        if (lock) {
          await lock.release();
        }
      }
    }

    This pattern ensures that only one worker can execute the long-running logic at a time, making the system more robust against worker failures during processing.

    Performance and Edge Case Considerations

    1. TTL Management is Critical

    Every key set in Redis for idempotency must have a TTL. Without it, your Redis memory will grow indefinitely. The TTL should be chosen carefully:

    • It must be longer than the maximum possible time for your message broker to redeliver a message plus the maximum processing time.
    • A common value is 24-72 hours. This ensures that if a message gets stuck in a retry loop, the key will eventually expire, preventing permanent blockage, but is long enough to handle standard operational delays.

    2. Consumer Crashes

    What happens if a consumer sets the STARTED state and then crashes?

  • The TTL is our safety net. The STARTED key will eventually expire.
    • The message broker will time out on the message acknowledgement and redeliver the message to another consumer.
    • The new consumer will find the key has expired (or is gone) and will be able to start processing from scratch. This is the desired behavior.

    3. Redis Availability

    If Redis is down, your idempotency layer is down. You must decide on a failure strategy:

  • Fail-Closed (Recommended): If the service cannot connect to Redis, it should not process the event. NACK the message and retry. This prioritizes correctness over availability.
  • Fail-Open: Process the event without the idempotency check. This risks duplicate processing but maintains service availability. This is only acceptable for non-critical operations.
    • Implement a circuit breaker pattern around the Redis client to handle transient network issues gracefully.

    4. Idempotency Key Generation

    The producer is responsible for creating a high-quality key. A random UUID (v4) is often sufficient. If the producer itself might retry sending an event, it must generate the same key for the same logical operation. A UUIDv5, which is a hash of a namespace and a name (e.g., UUIDv5('charge', customerId + orderId)), can be used to generate a deterministic key from the operation's parameters.

    5. Storing Large Responses

    Be mindful of storing large response objects in the idempotency record. Redis is an in-memory database. If your operation returns a multi-megabyte payload, storing it in the idempotency key is inefficient. A better pattern is to store the large payload in an object store (like S3) and save only the S3 object locator in the Redis record.

    Conclusion

    Implementing a robust idempotency layer is a non-negotiable requirement for building reliable event-driven systems. While the concept is simple, a production-grade implementation must be resilient to concurrency and failure. By leveraging the atomic nature of Redis Lua scripts, we can create a clean, reusable, and high-performance service that elevates the reliability of our entire architecture. This pattern moves the complexity of handling at-least-once delivery into a single, well-tested component, allowing your core business logic to remain focused, simple, and correct.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles