Idempotency-Key State Machines in Redis for Fault-Tolerant APIs
The Inevitability of Retries and the Idempotency Imperative
In any non-trivial distributed system, network unreliability is a given. Client applications, reverse proxies, and load balancers will inevitably retry requests upon timeouts or transient network failures. While GET, PUT, and DELETE requests can be designed to be idempotent, POST requests, which create new resources, are inherently not. A client retrying a POST /v1/payments request could easily result in a customer being charged twice—a catastrophic failure for most businesses.
The common, yet dangerously naive, approach is to store the idempotency key and check for its existence before processing:
// DO NOT USE THIS - FLAWED EXAMPLE
async function handlePayment(req) {
const idempotencyKey = req.headers['idempotency-key'];
const keyExists = await redis.get(`idempotency:${idempotencyKey}`);
if (keyExists) {
return { status: 200, body: JSON.parse(keyExists) };
}
// RACE CONDITION HERE!
// Another request with the same key could pass the check above
// before this next line executes.
const result = await processPayment(req.body);
await redis.set(`idempotency:${idempotencyKey}`, JSON.stringify(result), 'EX', 86400);
return { status: 201, body: result };
}
This code is riddled with a critical race condition. Two concurrent requests with the same idempotency-key can both execute the redis.get check, find nothing, and proceed to process the payment. This is precisely the failure mode we must prevent. To build a truly robust system, we need to think in terms of atomic state transitions. This brings us to the state machine model.
Designing the Idempotency State Machine
Instead of a simple binary exists/doesn't-exist check, we'll model the lifecycle of an idempotent request as a state machine. This gives us the visibility and control needed to handle complex scenarios like concurrent requests and process failures.
States:
STARTED: We have received the request and reserved the idempotency key. The operation is currently in-flight. No other request with this key should be allowed to proceed.COMPLETED: The operation finished successfully. We have stored the result (e.g., the HTTP status code and response body) and will serve it for any subsequent requests with the same key.FAILED: The operation failed due to a recoverable or non-recoverable error. We may choose to allow retries or cache the error response.Data Structure in Redis:
A Redis Hash is the ideal data structure for this pattern. It allows us to store multiple fields under a single key, which is perfect for our state and result.
Key: idem:v1:payments:
Fields:
* state: (string) STARTED, COMPLETED, or FAILED
* response_code: (string) e.g., 201
* response_body: (string) JSON-stringified response
* stage_ttl: (number) A short TTL for the STARTED state to prevent orphaned locks.
Request Lifecycle:
Idempotency-Key header.- Our middleware intercepts the request.
STARTED.* If successful (key was new): Proceed to the business logic.
* If the key exists with state STARTED: Another request is in-flight. Return 409 Conflict immediately.
* If the key exists with state COMPLETED: The request was already processed. Return the cached response code and body immediately.
- After the business logic executes:
* On success: Atomically update the Redis hash with state: 'COMPLETED', the response_code, and response_body. Set a final, longer TTL (e.g., 24 hours) on the key.
* On failure: Update the state to FAILED. Decide on a retry policy.
This state machine design is inherently more robust and provides the necessary foundation for handling concurrency and failures correctly.
Core Implementation: Atomic Operations with Redis Lua Scripts
To implement the atomic state transitions, especially the initial "check-and-set" operation, we cannot rely on multiple client-side commands. Even a MULTI/EXEC transaction block in Redis isn't ideal, as it uses optimistic locking. If another client modifies the key between our WATCH and EXEC, the transaction will fail, and we'd have to retry the entire logic loop. This adds complexity and can be inefficient under high contention.
The superior solution is to use Redis's server-side Lua scripting capabilities. A Lua script is executed atomically on the Redis server, guaranteeing that no other command can run concurrently against the database. This is the key to our lock-free, race-condition-free implementation.
The `check_and_start` Lua Script
This script is the entry point for our idempotency middleware. It handles the logic for the first three steps of our request lifecycle.
-- File: check_and_start.lua
-- ARGV[1]: The idempotency key
-- ARGV[2]: The initial 'in-progress' TTL in seconds (e.g., 300 for 5 minutes)
local key = ARGV[1]
local in_progress_ttl = tonumber(ARGV[2])
-- Check if the key already exists
local existing_data = redis.call('HGETALL', key)
-- If key does not exist, it's a new request
if #existing_data == 0 then
redis.call('HSET', key, 'state', 'STARTED')
redis.call('EXPIRE', key, in_progress_ttl)
-- Return 'PROCEED' to signal the application to run the business logic
return 'PROCEED'
end
-- Key exists, inspect its state
local state
for i = 1, #existing_data, 2 do
if existing_data[i] == 'state' then
state = existing_data[i+1]
break
end
end
if state == 'STARTED' then
-- Another request is in-flight. Return 'CONFLICT'.
return 'CONFLICT'
elseif state == 'COMPLETED' then
-- Request was already completed. Return the stored data.
return existing_data
else
-- Could be FAILED or some other state. For now, treat as conflict.
return 'CONFLICT'
end
Key aspects of this script:
* Atomicity: The entire script runs as a single, indivisible operation.
* HGETALL: We fetch the entire hash to check the state.
* New Request Logic: If the hash is empty (#existing_data == 0), we create it, set the state to STARTED, and—critically—set a short-term TTL. This in_progress_ttl acts as a safety net. If our server crashes mid-operation, the key will eventually expire, preventing an indefinite lock.
* Existing Request Logic: We parse the result of HGETALL (which is a flat list of key-value pairs) to find the state. Based on the state, we return a specific string ('CONFLICT') or the entire dataset to the application layer.
Node.js/TypeScript Middleware Implementation
Now let's integrate this script into our application middleware. We'll use ioredis, which has excellent support for loading and executing Lua scripts.
// File: idempotency.middleware.ts
import { Redis } from 'ioredis';
import { FastifyRequest, FastifyReply, HookHandlerDoneFunction } from 'fastify';
import * as fs from 'fs';
import * as path from 'path';
// --- Redis Setup and Lua Script Loading ---
const redisClient = new Redis({
// your redis config
});
const LUA_SCRIPTS_DIR = path.join(__dirname, 'lua');
const checkAndStartScript = fs.readFileSync(path.join(LUA_SCRIPTS_DIR, 'check_and_start.lua'), 'utf8');
// Define a new command on the ioredis client for our script
// This sends the script to Redis only once (using SCRIPT LOAD) and then uses EVALSHA
redisClient.defineCommand('checkAndStartIdempotency', {
numberOfKeys: 0, // We pass the key as an argument, not a KEY
lua: checkAndStartScript,
});
// Extend the ioredis client's type definition for TypeScript
declare module 'ioredis' {
interface Redis {
checkAndStartIdempotency(key: string, inProgressTtl: number): Promise<'PROCEED' | 'CONFLICT' | string[]>;
}
}
// --- Middleware Logic ---
const IN_PROGRESS_TTL_SECONDS = 300; // 5 minutes
const FINAL_TTL_SECONDS = 86400; // 24 hours
// Utility to parse HGETALL result
function parseHgetall(data: string[]): Record<string, string> {
const result: Record<string, string> = {};
for (let i = 0; i < data.length; i += 2) {
result[data[i]] = data[i + 1];
}
return result;
}
export async function idempotencyMiddleware(req: FastifyRequest, reply: FastifyReply, done: HookHandlerDoneFunction) {
const idempotencyKey = req.headers['idempotency-key'] as string;
if (!idempotencyKey) {
// Or handle as a bad request, depending on your API contract
return done();
}
const redisKey = `idem:v1:${idempotencyKey}`;
try {
const result = await redisClient.checkAndStartIdempotency(redisKey, IN_PROGRESS_TTL_SECONDS);
if (result === 'PROCEED') {
// Attach key to request object for post-processing
(req as any).idempotencyContext = { key: redisKey, completed: false };
return done();
} else if (result === 'CONFLICT') {
reply.code(409).send({ error: 'Request with this Idempotency-Key is already in progress.' });
return;
} else if (Array.isArray(result)) {
// This was a completed request, serve the cached response
const cachedData = parseHgetall(result);
const statusCode = parseInt(cachedData.response_code || '500', 10);
const body = cachedData.response_body ? JSON.parse(cachedData.response_body) : {};
reply.code(statusCode).header('Content-Type', 'application/json').send(body);
return;
}
} catch (error) {
console.error('Idempotency middleware error:', error);
// Fail open or closed? Failing open is risky. Failing closed is safer.
reply.code(500).send({ error: 'Internal server error during idempotency check.' });
return;
}
}
This middleware performs the initial check. If the result is PROCEED, it attaches some context to the request object and passes control to the next handler (the actual business logic). The real magic happens when we handle the response.
Handling Business Logic and Finalizing the State
After the business logic runs, we need to capture its result (success or failure) and update the state in Redis from STARTED to COMPLETED or FAILED. In a framework like Fastify, the onSend hook is a perfect place for this post-processing logic.
// File: server.ts (where you register hooks)
import Fastify from 'fastify';
import { idempotencyMiddleware } from './idempotency.middleware';
const app = Fastify();
// Register the preHandler middleware for all relevant routes
app.addHook('preHandler', idempotencyMiddleware);
// The crucial post-processing hook
app.addHook('onSend', async (request, reply, payload) => {
const context = (request as any).idempotencyContext;
// Only act if our middleware initiated the context and it's not yet completed
if (!context || context.completed) {
return;
}
try {
// Use a pipeline to send multiple commands atomically
const pipeline = redisClient.pipeline();
// Store the final result
pipeline.hset(context.key, 'state', 'COMPLETED');
pipeline.hset(context.key, 'response_code', reply.statusCode.toString());
pipeline.hset(context.key, 'response_body', payload as string); // payload is already stringified by Fastify
// Set the final, longer TTL
pipeline.expire(context.key, FINAL_TTL_SECONDS);
await pipeline.exec();
// Mark as completed to prevent double execution
context.completed = true;
} catch (error) {
console.error(`Failed to save idempotency result for key ${context.key}:`, error);
// This is a critical failure. The key might be stuck in 'STARTED' until its TTL expires.
// Monitoring and alerting on this error is essential.
}
});
// Example route
app.post('/payments', async (request, reply) => {
// ... your actual payment processing logic here ...
// This will only run if the idempotency check passes.
const paymentResult = { transactionId: 'txn_12345', status: 'succeeded' };
reply.code(201).send(paymentResult);
});
// ... start server ...
Handling Failures:
What if the business logic throws an error? Our onSend hook won't run in the same way. We need an onError hook to catch this and update the state appropriately.
// Add this hook in server.ts
app.addHook('onError', async (request, reply, error) => {
const context = (request as any).idempotencyContext;
if (!context || context.completed) {
return;
}
try {
// For a failure, we just update the state and let the short TTL expire.
// This allows a client to retry after the IN_PROGRESS_TTL.
await redisClient.hset(context.key, 'state', 'FAILED');
context.completed = true;
} catch (redisError) {
console.error(`Failed to set FAILED idempotency state for key ${context.key}:`, redisError);
}
});
This onError hook marks the operation as FAILED. The key will still expire based on the initial IN_PROGRESS_TTL_SECONDS, which is often the desired behavior for transient server errors, allowing the client to safely retry after a few minutes.
Advanced Scenarios and Edge Cases
A production-ready system must handle more than the happy path.
1. Server Crash Mid-Operation
* Problem: The server processes the checkAndStartIdempotency script (setting state to STARTED), begins business logic, and then crashes before the onSend or onError hook can run.
* State: The key idem:v1: is left in the STARTED state in Redis.
* Solution: This is precisely why we set the in_progress_ttl in the initial Lua script. The key is not locked forever. It will automatically be deleted by Redis after IN_PROGRESS_TTL_SECONDS (e.g., 5 minutes). A subsequent request from the client after this period will be treated as a new request, which is the correct recovery behavior.
* Tuning: The value of IN_PROGRESS_TTL_SECONDS is critical. It should be slightly longer than your p99.9 request processing time to avoid premature lock expiry, but short enough to allow for reasonable recovery time in case of a crash.
2. Idempotency Key Generation and Scope
* Problem: How does the client generate a key? What if two different users generate the same key (e.g., a simple UUIDv4)?
Solution: The idempotency key should be unique per operation per user/tenant*. The server should not blindly trust the client-provided key. A robust pattern is to scope the key in Redis with user-specific information.
* Bad: idem:v1:
* Good: idem:v1:user_
This prevents one user's key from colliding with another's. The user_id would be extracted from a JWT or session cookie.
3. Result Caching and Non-Deterministic Operations
Problem: Your business logic creates a resource with a timestamp, e.g., createdAt: new Date(). A client makes a request, gets a successful response. They retry 10 seconds later. The idempotency layer correctly serves the cached response. The createdAt timestamp in the cached response will be from the original* request.
Implication: This is usually the desired behavior. The idempotency layer guarantees that the result of the first successful operation* is consistently returned. Engineers must be aware that any non-deterministic values (timestamps, random IDs) generated during the initial execution will be frozen in the cached response.
4. Garbage Collection
* Problem: If you have millions of idempotent requests per day, Redis memory usage can grow. What if a FAILED state should be retried sooner than the IN_PROGRESS_TTL?
* Solution: The TTLs are the primary garbage collection mechanism. For more complex retry logic (e.g., allowing immediate retry on a FAILED state), you would modify the check_and_start.lua script:
-- ... inside the script
elseif state == 'FAILED' then
-- This is a design choice. Here, we allow a retry by re-starting the process.
redis.call('HSET', key, 'state', 'STARTED')
redis.call('EXPIRE', key, in_progress_ttl)
return 'PROCEED'
-- ...
This change transforms a FAILED key into a retryable one. This is highly application-specific. For a payment API, you might never want to automatically retry a failed state, whereas for a resource provisioning API, it might be acceptable.
Performance Considerations and Benchmarking
Introducing this layer is not free. It adds at least one Redis round-trip to every idempotent request.
* Latency Overhead: A request to a local or regional Redis instance typically adds 1-5ms of latency. For most API operations, this is a perfectly acceptable trade-off for the provided safety.
* Redis CPU Usage: Lua scripts are blocking, but they are also extremely fast as they run server-side and avoid network overhead between commands. The CPU cost of our script is negligible. The primary load on Redis will be from network I/O and memory, not script execution.
* Throughput: A single Redis instance can handle tens of thousands of these operations per second. For most applications, the idempotency layer will not be the bottleneck. The bottleneck will almost always be the downstream business logic (database writes, calls to other services).
* Benchmarking: When load testing, measure the latency difference between endpoints with and without the idempotency middleware. For an API with a p99 of 200ms, adding 5ms for idempotency is a 2.5% increase, which is a small price for correctness.
* Redis Topology: For high availability, this pattern works seamlessly with a Redis Sentinel setup. For horizontal scaling, use Redis Cluster. Ensure your key-scoping strategy (idem:v1:user_{1234}:...) includes a hash tag to ensure that all data related to a single user lands on the same shard, though for this specific pattern, it's not strictly necessary as each key is self-contained.
Conclusion: A Pattern for Production-Grade Reliability
The naive GET-then-SET approach to idempotency is a ticking time bomb in any concurrent system. By modeling the process as a state machine and leveraging the atomicity of Redis Lua scripts, we can build a truly fault-tolerant layer that correctly handles race conditions, server crashes, and client retries.
This pattern is more than just a theoretical exercise; it is a direct implementation of the techniques used by major payment processors and cloud providers to ensure that operations happen exactly once, even in the face of unreliable networks and system failures. While it introduces complexity, this complexity is essential for services where correctness is non-negotiable. The investment in building a robust idempotency layer pays for itself the first time it prevents a double-charge or a duplicate resource creation.