Idempotency Layers in Event-Driven Systems with Redis & Lua

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitability of Duplicates and the Fallacy of Naive Checks

In any non-trivial distributed system that relies on message queues (Kafka, RabbitMQ, SQS), the guarantee is typically at-least-once delivery. Network partitions, consumer crashes, and acknowledgement timeouts conspire to create duplicate messages. For operations that are not naturally idempotent (e.g., charging a credit card, sending a notification), processing a duplicate message can be catastrophic.

A senior engineer's first instinct is to build an idempotency layer. The common junior-level approach involves a simple check against a shared store like Redis:

javascript
// DO NOT USE THIS IN PRODUCTION - FLAWED EXAMPLE
async function naiveProcessMessage(message) {
  const idempotencyKey = `idempotency:${message.id}`;
  const isProcessed = await redisClient.get(idempotencyKey);

  if (isProcessed) {
    console.log(`Message ${message.id} is a duplicate, skipping.`);
    return;
  }

  // Business logic here...
  await chargeCreditCard(message.payload);

  // Mark as processed
  await redisClient.set(idempotencyKey, '1', 'EX', 86400); // 24-hour expiry
}

This code is critically flawed due to a race condition. Imagine two consumers (C1, C2) receive the same message simultaneously:

  • C1 executes redisClient.get(key) and receives null.
  • C2 executes redisClient.get(key) and also receives null.
  • C1 proceeds to the business logic.
  • C2 also proceeds to the business logic.
    • The credit card is charged twice.
  • Both C1 and C2 eventually call redisClient.set(key, '1').
  • This is a classic Time-of-check to time-of-use (TOCTOU) bug. The state of the system changed between the check (get) and the action (set).

    The Foundation: Atomic `SETNX` for Basic Locking

    The obvious next step is to use an atomic operation. Redis provides SET key value [EX seconds] [PX milliseconds] [NX|XX]. The NX option means "set the key only if it does not already exist." This single command combines the check and the set into one atomic operation, eliminating the race condition described above.

    javascript
    // A better, but still incomplete, implementation
    const { Redis } = require('ioredis');
    const redisClient = new Redis();
    
    async function processMessageWithSetNX(message) {
      const idempotencyKey = `idempotency:v1:${message.producer}:${message.id}`;
      const lockAcquired = await redisClient.set(idempotencyKey, 'PROCESSING', 'EX', 3600, 'NX');
    
      if (!lockAcquired) {
        console.log(`Message ${message.id} is a duplicate or is being processed, skipping.`);
        // In a real system, you would check the value to see if it's 'COMPLETED' vs 'PROCESSING'
        return; // Acknowledge the message
      }
    
      try {
        console.log(`Processing message ${message.id}...`);
        // ... execute critical business logic ...
        await someLongRunningTask(message.payload);
    
        // Mark as completed. Note the TTL update.
        await redisClient.set(idempotencyKey, 'COMPLETED', 'EX', 86400); // Keep record for 24 hours
        console.log(`Message ${message.id} processed successfully.`);
      } catch (error) {
        console.error(`Failed to process message ${message.id}`, error);
        // Release the lock on failure to allow for retries
        await redisClient.del(idempotencyKey);
        throw error; // NACK the message to trigger a redelivery
      }
    }

    This is a significant improvement. The SET ... NX command guarantees that only one consumer can acquire the initial lock. However, this pattern introduces a new, more subtle set of problems.

    The `SETNX` Pattern's Critical Flaw: The Crash-and-Stall Scenario

    What happens if a consumer acquires the lock (sets the key to 'PROCESSING') and then crashes before it can complete the work and update the key to 'COMPLETED' or delete it on failure?

    The idempotency key is now stuck in the 'PROCESSING' state for its entire TTL (1 hour in our example). Any subsequent attempts to process this message will see the key exists and incorrectly assume it's a duplicate or being actively processed, effectively stalling the message for an hour. This is often an unacceptable delay.

    We need a way to distinguish between a message that has been successfully completed and one that is merely in-progress. A simple key existence check is insufficient. We need to store and atomically update the state of the processing lifecycle within the key's value.

    Multi-State Management and the Need for Transactional Logic

    Let's define our desired states:

    * NULL (Key doesn't exist): Ready for processing.

    * PROCESSING: A consumer has acquired the lock and is working on the message. Includes a timestamp to detect stale locks.

    * COMPLETED: The message has been successfully processed. Store the result if needed.

    * FAILED: The message processing failed after multiple retries.

    Our logic for an incoming message now looks like this:

    • Check the key.
  • If it doesn't exist, set it to PROCESSING with a processing TTL and start work.
    • If it exists:

    a. If the value is COMPLETED, it's a duplicate. Acknowledge and stop.

    b. If the value is PROCESSING, check its timestamp. If it's older than our timeout threshold (e.g., 5 minutes), assume the previous consumer crashed. Re-acquire the lock and start work. If it's recent, another consumer is likely working on it. NACK the message for a short delay.

    Attempting to implement this with separate Redis commands (GET, then SET) re-introduces the original race condition. A Redis MULTI/EXEC transaction block also fails us here. Transactions in Redis are atomic, but they are not conditional. You cannot GET a value, use it in your application logic to decide what to do next, and then SET a new value all within a single atomic transaction. The commands are queued before the first one is even executed.

    This is the exact problem that Redis's server-side Lua scripting is designed to solve.

    Achieving True Atomicity with Lua Scripting

    A Lua script executed with EVAL or EVALSHA runs atomically on the Redis server. No other Redis command can run concurrently with the script. This allows us to implement our complex conditional logic safely.

    Let's design our first script: acquire_lock.lua.

    acquire_lock.lua

    lua
    -- KEYS[1]: The idempotency key (e.g., 'idempotency:msg123')
    -- ARGV[1]: The current processing instance ID (e.g., 'consumer-abc')
    -- ARGV[2]: The processing timeout in milliseconds (e.g., 300000 for 5 minutes)
    -- ARGV[3]: The initial TTL for the processing lock in seconds (e.g., 3600)
    
    local key = KEYS[1]
    local instance_id = ARGV[1]
    local processing_timeout_ms = tonumber(ARGV[2])
    local lock_ttl_s = tonumber(ARGV[3])
    
    local existing_data = redis.call('get', key)
    
    if not existing_data then
      -- Key does not exist, we can acquire the lock.
      local new_value = cjson.encode({
        status = 'PROCESSING',
        owner = instance_id,
        timestamp = redis.call('time')[1] * 1000 -- milliseconds
      })
      redis.call('set', key, new_value, 'EX', lock_ttl_s)
      return 'ACQUIRED'
    end
    
    -- Key exists, inspect its state.
    local data = cjson.decode(existing_data)
    
    if data.status == 'COMPLETED' then
      return 'DUPLICATE'
    end
    
    if data.status == 'PROCESSING' then
      local current_time_ms = redis.call('time')[1] * 1000
      local lock_age_ms = current_time_ms - data.timestamp
    
      if lock_age_ms > processing_timeout_ms then
        -- The lock is stale. Re-acquire it.
        local new_value = cjson.encode({
          status = 'PROCESSING',
          owner = instance_id,
          timestamp = current_time_ms
        })
        redis.call('set', key, new_value, 'EX', lock_ttl_s)
        return 'ACQUIRED_STALE'
      else
        -- Lock is held by another process and is not stale.
        return 'CONFLICT'
      end
    end
    
    -- Default case for other statuses like FAILED, etc.
    return 'CONFLICT'

    Key Features of this Script:

    * Atomicity: All logic from redis.call('get', key) to the final redis.call('set', ...) is executed as a single, indivisible operation.

    * Stateful Inspection: It decodes the JSON stored in the key to make decisions based on the status field.

    * Stale Lock Detection: It uses redis.call('time') (which is deterministic within a script) to check if a PROCESSING lock has expired, solving our crash-and-stall problem.

    * Clear Return Values: The script returns distinct strings (ACQUIRED, DUPLICATE, CONFLICT) that the application code can use to drive its behavior.

    Integrating the Lua Script into a Node.js Consumer

    Here's how you would build a robust consumer in Node.js using ioredis, which has excellent support for Lua scripting.

    javascript
    const { Redis } = require('ioredis');
    const fs = require('fs');
    const path = require('path');
    
    class IdempotentConsumer {
      constructor(redisOptions) {
        this.redis = new Redis(redisOptions);
        this.instanceId = `consumer-${require('crypto').randomBytes(8).toString('hex')}`;
    
        // Load and register the Lua script with Redis for better performance via EVALSHA
        const acquireLockScript = fs.readFileSync(path.join(__dirname, 'acquire_lock.lua'), 'utf8');
        this.redis.defineCommand('acquireLock', {
          numberOfKeys: 1,
          lua: acquireLockScript,
        });
      }
    
      async process(message) {
        const idempotencyKey = `idempotency:v2:${message.producer}:${message.id}`;
        const PROCESSING_TIMEOUT_MS = 300000; // 5 minutes
        const LOCK_TTL_S = 3600; // 1 hour
    
        try {
          const result = await this.redis.acquireLock(
            idempotencyKey,
            this.instanceId,
            PROCESSING_TIMEOUT_MS,
            LOCK_TTL_S
          );
    
          switch (result) {
            case 'ACQUIRED':
            case 'ACQUIRED_STALE':
              console.log(`[${this.instanceId}] Lock acquired for message ${message.id}. Processing...`);
              await this.executeBusinessLogic(message, idempotencyKey);
              // Acknowledge the message to the broker
              return 'ACK';
    
            case 'DUPLICATE':
              console.log(`[${this.instanceId}] Message ${message.id} is a duplicate. Skipping.`);
              // Acknowledge the message, work is already done
              return 'ACK';
    
            case 'CONFLICT':
              console.log(`[${this.instanceId}] Message ${message.id} is being processed by another consumer. NACKing.`);
              // Negative-acknowledge to retry after a delay
              return 'NACK';
    
            default:
              throw new Error(`Unknown lock acquisition result: ${result}`);
          }
        } catch (error) {
          console.error(`[${this.instanceId}] An unexpected error occurred for message ${message.id}`, error);
          // NACK for retry
          return 'NACK';
        }
      }
    
      async executeBusinessLogic(message, idempotencyKey) {
        try {
          // Your actual business logic goes here
          // For example, calling a payment gateway, writing to a database, etc.
          await new Promise(resolve => setTimeout(resolve, 2000)); // Simulate work
    
          // On success, update the key to 'COMPLETED'
          const finalValue = JSON.stringify({
            status: 'COMPLETED',
            owner: this.instanceId,
            completedAt: Date.now(),
            result: { success: true, transactionId: 'txn_123' },
          });
          // Use a longer TTL for completed records
          await this.redis.set(idempotencyKey, finalValue, 'EX', 86400);
          console.log(`[${this.instanceId}] Successfully processed message ${message.id}.`);
    
        } catch (businessError) {
          console.error(`[${this.instanceId}] Business logic failed for message ${message.id}`, businessError);
          // On failure, release the lock so another consumer can retry
          await this.redis.del(idempotencyKey);
          throw businessError; // Propagate to trigger NACK in the main processor
        }
      }
    }
    
    // --- Example Usage ---
    async function main() {
      const consumer = new IdempotentConsumer();
      const message = {
        id: 'msg-unique-456',
        producer: 'order-service',
        payload: { amount: 100, currency: 'USD' },
      };
    
      // Simulate two consumers trying to process the same message
      const consumer1 = new IdempotentConsumer();
      const consumer2 = new IdempotentConsumer();
    
      console.log('--- Concurrent Processing Simulation ---');
      await Promise.all([
        consumer1.process(message),
        consumer2.process(message)
      ]);
    
      console.log('\n--- Duplicate Processing Simulation ---');
      // Wait for the first one to complete, then try again
      await new Promise(resolve => setTimeout(resolve, 3000));
      await consumer1.process(message);
    }
    
    main();

    Performance Considerations and Benchmarking

    While correct, this pattern is not free. Every message now incurs at least one Redis round trip.

  • Latency Overhead: The EVAL command itself is extremely fast, but you are still subject to network latency between your consumer and the Redis server. In a high-throughput system processing thousands of messages per second, this can become a bottleneck. Co-locating your consumers and Redis cluster in the same availability zone is critical.
  • EVAL vs. EVALSHA: The ioredis.defineCommand method cleverly uses EVALSHA under the hood. The first time the command is called, it sends the full script to Redis via SCRIPT LOAD, which returns an SHA1 hash. Subsequent calls use the much smaller EVALSHA command with this hash, saving bandwidth. Always use a Redis client that supports this optimization.
  • CPU Load on Redis: Lua scripts are blocking and single-threaded. A slow-running script can block all other Redis operations. Our script is composed of fast, native Redis commands and minimal logic, so its performance is nearly identical to the underlying commands. Avoid complex loops, string manipulation, or other computationally intensive tasks in your Lua scripts.
  • Memory Usage: Each idempotency key consumes memory in Redis. For a system with 1 million messages per day and a 24-hour TTL, you will have 1 million keys in Redis at all times. The size of the stored JSON object matters.
  • * A simple key 'idempotency:msg-123' with value 'COMPLETED' is ~50 bytes.

    * Our JSON object is larger, perhaps 150-250 bytes.

    1,000,000 keys 250 bytes/key ≈ 250 MB. This is manageable for most Redis instances, but it's crucial to monitor and provision memory accordingly. Use realistic TTLs—there's rarely a need to keep idempotency records for more than a few days.

    Advanced Edge Cases and Production Hardening

    Even with this robust pattern, there are production realities to consider.

    Edge Case 1: The Out-of-Order Completion Update

    Consider this sequence:

  • Consumer C1 acquires the lock.
  • C1 successfully executes the business logic (e.g., writes to a PostgreSQL database).
  • C1 attempts to update the Redis key to COMPLETED but fails due to a network blip or Redis failover.
  • The PROCESSING lock eventually times out (after 5 minutes).
  • Consumer C2 sees the stale lock, re-acquires it, and re-runs the business logic, causing a duplicate database write.
  • Solution: This reveals that the idempotency layer is only one part of the solution. The business logic itself must be idempotent where possible. If not, you need to combine this pattern with a transactional outbox pattern. The state of the message (COMPLETED) should be written to your primary database within the same transaction as the business logic. A separate process then relays this state change to Redis. This moves the source of truth to your transactional database, with Redis acting as a fast cache for the idempotency check.

    Edge Case 2: Clock Skew

    Our stale lock detection relies on timestamps. data.timestamp is set by the Redis server time when the lock is acquired. The check current_time_ms - data.timestamp also uses Redis server time. This makes the calculation robust against clock skew between consumer instances. However, if you were to pass in a client-generated timestamp, you would be vulnerable to issues where a consumer with a clock set in the future could create locks that never appear stale to other consumers.

    Rule of thumb: Always use the Redis server's time (redis.call('time')) as the source of truth for time-based calculations within Lua scripts.

    Edge Case 3: Redis Cluster and Key Hashing

    If you are using Redis Cluster, all keys used in a single script execution must belong to the same hash slot. Our acquire_lock.lua script only operates on a single key (KEYS[1]), so it is perfectly compatible with Redis Cluster without any special considerations. If you were to design a more complex script that operated on multiple keys, you would need to use hash tags (e.g., idempotency:{user-123}:msg-abc, state:{user-123}:profile) to ensure those keys land on the same shard.

    Conclusion: Beyond Simple Locks

    Building a correct, production-grade idempotency layer in an event-driven system requires moving beyond simple atomic primitives like SETNX. While SETNX solves the most basic race condition, it fails to handle the operational reality of consumer crashes, which can lead to stalled messages and unacceptable processing delays.

    By leveraging Redis's server-side Lua scripting, we can implement complex, conditional, and truly atomic logic. The multi-state pattern (PROCESSING, COMPLETED) combined with stale lock detection provides a resilient foundation for guaranteeing exactly-once semantics at the application layer. This pattern is not a silver bullet—it must be paired with careful error handling, appropriate TTLs, and an understanding of its place within a larger distributed transaction strategy. However, for the vast majority of use cases, it provides the necessary atomicity and correctness to prevent the costly errors that arise from duplicate message processing.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles