Atomic Idempotency Key Management in Distributed Systems using Redis & Lua
The Illusion of Idempotency with Simple Primitives
In any distributed system, the promise of exactly-once processing is a siren's call. Network partitions, client retries, and asynchronous message delivery conspire to turn single operations into duplicates. The standard defense is idempotency: designing operations so that receiving them multiple times has the same effect as receiving them once. A common first-pass implementation for this in a system using Redis is the SETNX (SET if Not eXists) command.
Consider a payment processing endpoint. A client sends a request with a unique Idempotency-Key header. The server logic looks like this:
// DO NOT USE IN PRODUCTION - FLAWED EXAMPLE
async function naiveProcessPayment(request) {
const idempotencyKey = request.headers['Idempotency-Key'];
if (!idempotencyKey) {
throw new Error('Idempotency-Key header is required.');
}
const key = `idem:${idempotencyKey}`;
const isNew = await redisClient.set(key, 'in_progress', 'NX', 'EX', 60);
if (!isNew) {
// This is where it gets tricky. Is it already done? Or just in progress?
// This naive approach can't tell.
return { status: 'conflict', message: 'Request already in progress or completed.' };
}
try {
// --- CRITICAL SECTION ---
const result = await paymentGateway.charge(request.body);
// What if the server crashes right here? The key is 'in_progress' forever (or until TTL).
await redisClient.set(key, JSON.stringify({ status: 'completed', data: result }), 'EX', 86400); // Keep result for 24h
return { status: 'success', data: result };
} catch (error) {
// Clean up the key on failure so it can be retried.
await redisClient.del(key);
throw error;
}
}
This seemingly plausible approach is riddled with critical flaws that manifest under real-world production load:
SETNX fails, we don't know why. Is another process actively working on it, or did a previous process complete it successfully? The client receives a generic conflict and doesn't know whether to retry or accept a previous success.SETNX but before the final SET, the idempotency key is locked in the in_progress state until its 60-second TTL expires. For that minute, no other process can handle the request, effectively causing a temporary outage for that specific operation.SETNX is atomic, the subsequent logic to check for a completed result is not. A client might retry, find the key exists, and before it can fetch the result, the key expires and another process starts working on it.To build a truly resilient system, we need an atomic, state-aware mechanism that can handle the full lifecycle of an idempotent request. This requires moving beyond simple Redis commands to the power of server-side Lua scripting.
A State-Aware Idempotency Lifecycle
We will model the idempotency key's lifecycle with three distinct states, stored as a JSON object within the Redis key:
IN_PROGRESS: A process has acquired the lock and is executing the business logic. The key should have a short TTL (e.g., 30-60 seconds) to act as a safety net against process death.COMPLETED: The process finished successfully. The value will contain the final result of the operation. This key should have a long TTL (e.g., 24 hours) to serve cached responses to retrying clients.FAILED: The process encountered a terminal error. This state prevents clients from endlessly retrying a doomed operation. This key might have a medium TTL (e.g., 1 hour).To manage transitions between these states without race conditions, we need to perform the check-and-set logic in a single, atomic operation. This is the perfect use case for Redis's EVAL command, which executes a Lua script on the server as a single, uninterruptible transaction.
Atomicity with Lua: The `start_processing` Script
Our first script will handle the initial locking and state checking. It will be the gatekeeper for our critical section.
Lua Script: start_processing.lua
-- keys[1]: The idempotency key (e.g., 'idem:user123:charge:req456')
-- args[1]: The TTL for the IN_PROGRESS lock (e.g., 60 seconds)
-- args[2]: The initial value for the key (e.g., '{"status":"IN_PROGRESS"}')
local key = KEYS[1]
local lock_ttl = ARGV[1]
local in_progress_value = ARGV[2]
-- Check if the key exists
local existing_value = redis.call('GET', key)
if existing_value then
-- Key exists, return its current value so the client can decide what to do
-- e.g., if status is 'COMPLETED', use the cached response.
return existing_value
else
-- Key does not exist, this is the first time we've seen it.
-- Atomically set it to IN_PROGRESS with a short TTL.
redis.call('SET', key, in_progress_value, 'EX', lock_ttl)
-- Return a special value to indicate we've acquired the lock
return 'LOCK_ACQUIRED'
end
This script provides the atomicity we need. When executed:
- If the key exists, it immediately returns the stored value (which will be a JSON string containing the state and potentially the result). The application layer can then inspect this.
IN_PROGRESS state, applies the short TTL, and returns a unique LOCK_ACQUIRED string.This completely eliminates the race condition of a separate GET and SET.
Atomicity with Lua: The `complete_processing` Script
Once the business logic is complete, we need another atomic operation to transition the key from IN_PROGRESS to COMPLETED or FAILED and set the final, longer TTL.
Lua Script: complete_processing.lua
-- keys[1]: The idempotency key
-- args[1]: The final value to set (e.g., '{"status":"COMPLETED", "data":{...}}')
-- args[2]: The final TTL for the completed key (e.g., 86400 seconds)
local key = KEYS[1]
local final_value = ARGV[1]
local final_ttl = ARGV[2]
-- We simply overwrite the key with the final result and a new TTL.
-- This is safe because only the process that acquired the lock should be calling this.
-- An optional enhancement could be to pass a unique worker ID to ensure the completer is the original locker.
redis.call('SET', key, final_value, 'EX', final_ttl)
return 'OK'
This script is simpler. It atomically updates the key with the final result and applies the long-term TTL. This ensures that a subsequent retry hitting the start_processing.lua script will receive the cached COMPLETED state.
Production Implementation with Node.js
Let's integrate these scripts into a robust middleware for an Express.js application. This example uses the ioredis library.
First, we'll create a module to manage our Redis connection and load the scripts. Loading scripts on application startup using SCRIPT LOAD is a crucial optimization. It returns a SHA1 hash of the script, which we can then call using the more efficient EVALSHA command, avoiding the need to send the full script body over the network for every request.
redis-idempotency.js
const Redis = require('ioredis');
const fs = require('fs');
const path = require('path');
class IdempotencyManager {
constructor(redisOptions) {
this.redis = new Redis(redisOptions);
this.scriptShas = {};
}
async loadScripts() {
try {
const startScript = fs.readFileSync(path.join(__dirname, 'start_processing.lua'), 'utf8');
const completeScript = fs.readFileSync(path.join(__dirname, 'complete_processing.lua'), 'utf8');
this.scriptShas.start = await this.redis.script('load', startScript);
this.scriptShas.complete = await this.redis.script('load', completeScript);
console.log('Idempotency Lua scripts loaded successfully.');
console.log(`- Start SHA: ${this.scriptShas.start}`);
console.log(`- Complete SHA: ${this.scriptShas.complete}`);
} catch (error) {
console.error('Failed to load Lua scripts:', error);
process.exit(1);
}
}
async start(key, inProgressValue, lockTtl) {
return this.redis.evalsha(this.scriptShas.start, 1, key, lockTtl, inProgressValue);
}
async complete(key, finalValue, finalTtl) {
return this.redis.evalsha(this.scriptShas.complete, 1, key, finalValue, finalTtl);
}
}
// Singleton instance
const idempotencyManager = new IdempotencyManager({
host: 'localhost',
port: 6379
});
module.exports = idempotencyManager;
Now, let's create the Express middleware that uses this manager.
idempotency-middleware.js
const idempotencyManager = require('./redis-idempotency');
const LOCK_TTL = 30; // 30 seconds for in-progress lock
const RESULT_TTL = 86400; // 24 hours for final result
function idempotencyMiddleware() {
return async (req, res, next) => {
const idempotencyKeyHeader = req.headers['idempotency-key'];
if (!idempotencyKeyHeader) {
// For simplicity, we proceed. In a strict system, you might reject.
return next();
}
const key = `idem:${req.user.id}:${req.method}:${req.path}:${idempotencyKeyHeader}`;
try {
const inProgressValue = JSON.stringify({ status: 'IN_PROGRESS', timestamp: Date.now() });
const result = await idempotencyManager.start(key, inProgressValue, LOCK_TTL);
if (result === 'LOCK_ACQUIRED') {
// We have the lock. Attach completion logic to the response stream.
res.locals.idempotencyKey = key;
res.on('finish', async () => {
// 'finish' event fires when the response has been sent.
if (res.statusCode >= 200 && res.statusCode < 300) {
const finalValue = JSON.stringify({
status: 'COMPLETED',
statusCode: res.statusCode,
headers: res.getHeaders(),
body: res.locals.responseBody, // We need to capture the body before it's sent
});
await idempotencyManager.complete(key, finalValue, RESULT_TTL);
}
});
// We need a way to capture the response body. This is a common pattern.
const originalSend = res.send;
res.send = function (body) {
res.locals.responseBody = body;
originalSend.call(this, body);
};
return next();
} else {
// The key already existed. The result is the stored value.
const storedResult = JSON.parse(result);
if (storedResult.status === 'IN_PROGRESS') {
// Another request is currently processing this. Return a conflict.
return res.status(409).json({ message: 'A request with this idempotency key is already in progress.' });
} else if (storedResult.status === 'COMPLETED') {
// The request was already completed successfully. Return the cached response.
console.log(`Returning cached response for key: ${key}`);
res.set(storedResult.headers);
return res.status(storedResult.statusCode).send(storedResult.body);
}
}
} catch (error) {
console.error(`Idempotency middleware error for key ${key}:`, error);
// Let the standard error handler deal with it.
return next(error);
}
};
}
module.exports = idempotencyMiddleware;
Example Usage in an Express App
const express = require('express');
const idempotencyManager = require('./redis-idempotency');
const idempotencyMiddleware = require('./idempotency-middleware');
const app = express();
app.use(express.json());
// Dummy user middleware
app.use((req, res, next) => {
req.user = { id: 'user-123' };
next();
});
// Apply the idempotency middleware to critical routes
app.post('/api/payments', idempotencyMiddleware(), async (req, res) => {
try {
console.log('Processing payment...');
// Simulate a 2-second payment gateway call
await new Promise(resolve => setTimeout(resolve, 2000));
const response = { transactionId: `txn_${Date.now()}`, amount: req.body.amount, status: 'succeeded' };
res.status(201).json(response);
} catch (error) {
res.status(500).json({ error: 'Payment processing failed' });
}
});
async function startServer() {
await idempotencyManager.loadScripts();
app.listen(3000, () => {
console.log('Server running on port 3000');
});
}
startServer();
To test this:
- Start the server.
curl -X POST http://localhost:3000/api/payments -H "Content-Type: application/json" -H "Idempotency-Key: unique-key-123" -d '{"amount": 100}'201 Created response.201 Created response from the Redis cache, and the log "Processing payment..." will not appear.409 Conflict error while the first is processing.Advanced Considerations and Edge Cases
This pattern is robust, but in a high-stakes production environment, we must consider the sharp edges.
1. Stale Lock Recovery
Problem: A worker acquires a lock, starts processing, and then crashes. The IN_PROGRESS key remains in Redis until its 30-second TTL expires. What happens next?
Solution: The next request with the same key will find the lock has expired and will acquire it. This is generally the desired behavior, but it has implications:
IN_PROGRESS keys are a strong signal that your workers are crashing or are taking too long. Set up monitoring on your Redis keyspace (e.g., using keyspace notifications or scanning) to alert on this pattern.2. Large Response Payloads
Problem: Our COMPLETED state stores the entire response body in Redis. If your API returns megabytes of data, this can quickly bloat your Redis memory.
Solution: For large payloads, use a hybrid approach. Store a pointer to the data instead of the data itself.
- Upon successful completion, save the full response body to a more suitable storage layer like Amazon S3 or a blob store.
- In the idempotency key in Redis, store the pointer (e.g., the S3 object key) instead of the full body.
// Example value for a large payload
{
"status": "COMPLETED",
"statusCode": 200,
"headers": { ... },
"body_ref": "s3://my-app-responses/idem-key-123.json"
}
When a retrying client hits a cached result, the application layer is responsible for fetching the body from the reference location before serving the response. This adds latency but keeps Redis lean and fast.
3. Key Naming and Garbage Collection
Key Schema: A predictable and conflict-free key schema is vital. The example idem:${req.user.id}:${req.method}:${req.path}:${idempotencyKeyHeader} is a good start. It scopes the key to the user, the specific endpoint, and the client-provided key, preventing collisions.
TTL Management: TTLs are your garbage collectors. Be deliberate:
IN_PROGRESS TTL: Should be slightly longer than the P99 latency of the operation. If your endpoint typically takes 5 seconds, a 30-second TTL is a safe buffer. Too short, and you risk premature lock expiration on a slow but valid request. Too long, and a crashed worker causes a longer outage for that operation.COMPLETED TTL: This depends on your business requirements. How long should a client be able to retry and get a cached result? 24 hours is a common standard, matching Stripe's API.4. Performance: Lua vs. Alternatives
Why Lua is often superior:
GET and SET operations within the script.EVALSHA call replaces multiple commands (GET, SET), reducing latency, especially in geo-distributed environments.Alternative: Optimistic Locking with WATCH
One could also implement this using Redis's WATCH, MULTI, and EXEC commands. The client would WATCH the key, GET its value, and then start a MULTI transaction to SET the new value. If another client modified the key after the WATCH, the EXEC would fail, and the client would have to retry the entire process.
While this works, it's more complex on the client side (requires a retry loop) and often results in more network traffic. For this specific state management pattern, a server-side Lua script is cleaner, more performant, and less error-prone.
Conclusion
Building resilient distributed systems requires moving beyond simplistic patterns and embracing techniques that provide strong guarantees. By leveraging Redis's server-side Lua scripting, we can construct an atomic, state-aware idempotency layer that is both performant and fault-tolerant. This approach elevates idempotency from a simple de-duplication mechanism to a core part of your system's reliability, enabling safe client retries, preventing duplicate processing, and providing a transparent caching layer for repeated requests.
This pattern is not a silver bullet—it cannot magically make non-idempotent business logic safe—but it provides a robust and essential guardrail. It ensures that for any given idempotency key, the process of initiating your business logic happens exactly once, which is a foundational requirement for building predictable and trustworthy services at scale.