Idempotency Key Patterns for Resilient Event-Driven Architectures
The Inevitable Problem: At-Least-Once Delivery and Its Perils
In any mature, distributed system, asynchronous communication via message brokers (Kafka, RabbitMQ, SQS) or webhooks is the norm. The fundamental contract of these systems is typically at-least-once delivery. This guarantee is a cornerstone of reliability—it ensures that messages aren't lost in transit. However, it introduces a severe side effect: duplicate messages. A network partition, a consumer crash post-processing but pre-acknowledgment, or a simple client-side retry mechanism can all lead to the same event being delivered multiple times.
For many operations, this is benign. For critical business logic—processing a payment, creating an order, decrementing inventory—a duplicate operation can be catastrophic. It can lead to double-billing customers, shipping duplicate products, or corrupting financial data. The naive solution, hoping duplicates won't happen, is a ticking time bomb in any production system.
The industry-standard solution is to enforce idempotency at the consumer or API endpoint level. An operation is idempotent if making the same request multiple times produces the same result as making it once. The mechanism for this is the idempotency key.
This article is not an introduction to the concept. We assume you know why you need an idempotency key. We will focus on the how: architecting and implementing a robust, production-grade idempotency layer that handles the complex realities of distributed systems—race conditions, partial failures, and performance at scale.
We will build a state machine-based idempotency middleware, exploring two backend implementations (Redis and PostgreSQL), and dissect the edge cases that separate a textbook example from a battle-hardened solution.
Architecting the Idempotency Layer: A State Machine Approach
A common but flawed initial approach is to simply store the idempotency key in a cache or database upon successful completion. On subsequent requests, you check if the key exists. If it does, you skip processing. This pattern is dangerously incomplete.
Consider this scenario:
K1 arrives and begins a 10-second operation.K1 arrives due to a client-side timeout.K1 (since Request A hasn't finished) and begins a second, concurrent execution of the same operation.This is a classic race condition that nullifies the entire purpose of the idempotency check. To solve this, we must treat the lifecycle of an idempotent request as a state machine, not a simple boolean flag.
Our state machine will have three primary states:
STARTED: The key has been seen, and processing has begun. A lock is acquired.COMPLETED: Processing finished successfully. The final result (HTTP status code, response body) is stored.FAILED: Processing failed. The error details are stored.Here's the refined workflow:
* On Request Arrival:
1. Extract the Idempotency-Key from the request (header or payload).
2. Query the state store for this key.
3. Case 1: Key Not Found. This is a new request. Atomically create a record for the key with the status STARTED and a lock timeout. If the atomic creation succeeds, proceed with the business logic. If it fails (because another process just created it), treat it as a race condition and proceed to Case 2.
4. Case 2: Key Found with STARTED status. Another process is currently handling this request. The system must immediately return a conflict response (e.g., 409 Conflict) to the caller, signaling them to wait and potentially retry later.
5. Case 3: Key Found with COMPLETED status. The operation has already been successfully executed. The system should not re-run the business logic. Instead, it should immediately return the stored response from the original request.
6. Case 4: Key Found with FAILED status. The original operation failed. Depending on the desired semantics, the system could either return the stored error or allow a new attempt to proceed. For simplicity and safety, we'll focus on returning the stored error.
* Post-Processing:
1. If the business logic succeeds, update the idempotency record's status to COMPLETED, store the response, and release the lock.
2. If the business logic fails, update the status to FAILED, store the error details, and release the lock.
This stateful approach elegantly solves the concurrency problem and provides a predictable, deterministic behavior for retries.
Implementation 1: High-Throughput with Redis
Redis is an excellent choice for an idempotency store due to its high performance and atomic operations. We'll use ioredis in a TypeScript/Node.js context.
The Idempotency Middleware
Let's structure this as an Express.js middleware. This pattern is portable to any framework or consumer logic.
// src/middleware/idempotency.ts
import { Request, Response, NextFunction } from 'express';
import { Redis } from 'ioredis';
const redisClient = new Redis({
// Your Redis connection options
});
const LOCK_TIMEOUT_SECONDS = 30; // Max expected processing time
const KEY_EXPIRATION_SECONDS = 24 * 60 * 60; // 24 hours
interface IdempotencyRecord {
status: 'STARTED' | 'COMPLETED' | 'FAILED';
responseCode?: number;
responseBody?: string;
lockExpiresAt?: number;
}
export const idempotencyMiddleware = async (req: Request, res: Response, next: NextFunction) => {
const idempotencyKey = req.headers['idempotency-key'] as string;
if (!idempotencyKey) {
return next(); // Or return 400 Bad Request if the key is mandatory
}
const key = `idempotency:${idempotencyKey}`;
try {
// 1. Atomically set the key if it doesn't exist (NX) with an expiration (EX)
// This single command is the core of our lock acquisition.
const lockAcquired = await redisClient.set(
key,
JSON.stringify({
status: 'STARTED',
lockExpiresAt: Date.now() + LOCK_TIMEOUT_SECONDS * 1000,
}),
'EX', KEY_EXPIRATION_SECONDS,
'NX' // Only set if the key does not exist
);
if (lockAcquired) {
// --- We have the lock, proceed with business logic ---
const originalJson = res.json;
const originalSend = res.send;
let responseBody: any;
// Intercept response to save it before sending
res.json = (body) => {
responseBody = body;
return originalJson.call(res, body);
};
res.send = (body) => {
responseBody = body;
return originalSend.call(res, body);
};
res.on('finish', async () => {
if (res.statusCode >= 200 && res.statusCode < 300) {
const finalRecord: IdempotencyRecord = {
status: 'COMPLETED',
responseCode: res.statusCode,
responseBody: JSON.stringify(responseBody),
};
// Update the key with the final result, retaining the original TTL
await redisClient.set(key, JSON.stringify(finalRecord), 'KEEPTTL');
} else {
// Handle application error
const finalRecord: IdempotencyRecord = {
status: 'FAILED',
responseCode: res.statusCode,
responseBody: JSON.stringify(responseBody),
};
await redisClient.set(key, JSON.stringify(finalRecord), 'KEEPTTL');
}
});
return next();
} else {
// --- Lock not acquired, another request is in progress or has completed ---
const existingRecordRaw = await redisClient.get(key);
if (!existingRecordRaw) {
// This is a rare edge case where the key expired between SETNX and GET.
// Treat as a transient error and ask the client to retry.
return res.status(500).json({ error: 'Idempotency key conflict, please retry.' });
}
const existingRecord: IdempotencyRecord = JSON.parse(existingRecordRaw);
if (existingRecord.status === 'STARTED') {
// Check for stale lock
if (existingRecord.lockExpiresAt && Date.now() > existingRecord.lockExpiresAt) {
// The lock has expired, indicating a crashed process.
// This is a complex recovery scenario. For now, we'll treat it as a conflict.
// A more advanced implementation might attempt to take over the lock.
return res.status(409).json({ error: 'Request in progress (stale lock detected).' });
}
return res.status(409).json({ error: 'A request with this idempotency key is already in progress.' });
}
if (existingRecord.status === 'COMPLETED') {
res.set('Content-Type', 'application/json');
return res
.status(existingRecord.responseCode!)
.send(existingRecord.responseBody!);
}
if (existingRecord.status === 'FAILED') {
res.set('Content-Type', 'application/json');
return res
.status(existingRecord.responseCode!)
.send(existingRecord.responseBody!);
}
}
} catch (error) {
console.error('Idempotency middleware error:', error);
return next(error);
}
};
Analysis of the Redis Implementation
Atomicity: The SET key value EX seconds NX command is the workhorse. It's an atomic operation that creates the key, sets its value and expiration, but only if it doesn't already exist*. This single command prevents the race condition we identified earlier. There is no need for a separate EXISTS check followed by a SET.
* Performance: Redis is in-memory, making this check extremely fast—typically sub-millisecond. This ensures the idempotency layer adds minimal overhead to the request path.
* Lock Management: We embed a lockExpiresAt timestamp inside the STARTED record. This is crucial for detecting stale locks from crashed processes. Our current implementation simply reports a conflict, but a more aggressive strategy could involve a LUA script to atomically check the timestamp and take over the lock.
* Response Caching: By intercepting the res.json and res.send methods, we capture the response of the original successful request. This allows us to serve the exact same response on subsequent retries, fulfilling the idempotency contract.
* State Updates: The res.on('finish', ...) event listener ensures we update the idempotency record after the response has been sent to the client. We use SET ... KEEPTTL to update the value without altering the original expiration time.
Implementation 2: Transactional Integrity with PostgreSQL
While Redis is fast, its data is not typically persisted with the same durability guarantees as a primary database like PostgreSQL. For systems where the idempotency record is as critical as the business data itself (e.g., financial ledgers), using PostgreSQL provides transactional guarantees.
Database Schema
First, we need a table to store the idempotency state.
CREATE TABLE idempotency_keys (
key VARCHAR(255) PRIMARY KEY,
status VARCHAR(20) NOT NULL CHECK (status IN ('STARTED', 'COMPLETED', 'FAILED')),
lock_expires_at TIMESTAMPTZ,
response_code INTEGER,
response_body JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Index for efficient lookups
CREATE INDEX idx_idempotency_keys_created_at ON idempotency_keys(created_at);
The PRIMARY KEY constraint on key is our primary mechanism for enforcing atomicity. An INSERT will fail if the key already exists, which we can use to detect a race condition.
The Middleware (PostgreSQL version)
This implementation requires careful transaction management.
// src/middleware/idempotency-pg.ts
import { Request, Response, NextFunction } from 'express';
import { Pool, PoolClient } from 'pg';
const pool = new Pool({
// Your PostgreSQL connection options
});
const LOCK_TIMEOUT_MINUTES = 5;
// A helper to manage transactions
async function withTransaction<T>(callback: (client: PoolClient) => Promise<T>): Promise<T> {
const client = await pool.connect();
try {
await client.query('BEGIN');
const result = await callback(client);
await client.query('COMMIT');
return result;
} catch (e) {
await client.query('ROLLBACK');
throw e;
} finally {
client.release();
}
}
export const idempotencyMiddlewarePg = async (req: Request, res: Response, next: NextFunction) => {
const idempotencyKey = req.headers['idempotency-key'] as string;
if (!idempotencyKey) return next();
let client: PoolClient | null = null;
try {
client = await pool.connect();
// 1. Check for an existing key
const { rows } = await client.query('SELECT * FROM idempotency_keys WHERE key = $1 FOR UPDATE', [idempotencyKey]);
const existingRecord = rows[0];
if (existingRecord) {
// --- Key exists, handle based on status ---
if (existingRecord.status === 'STARTED') {
if (new Date() > new Date(existingRecord.lock_expires_at)) {
// Stale lock - advanced recovery needed. For now, conflict.
return res.status(409).json({ error: 'Request in progress (stale lock).' });
}
return res.status(409).json({ error: 'Request in progress.' });
}
if (existingRecord.status === 'COMPLETED' || existingRecord.status === 'FAILED') {
res.set('Content-Type', 'application/json');
return res.status(existingRecord.response_code).json(existingRecord.response_body);
}
} else {
// --- Key does not exist, attempt to create it ---
try {
await client.query(
`INSERT INTO idempotency_keys (key, status, lock_expires_at) VALUES ($1, 'STARTED', NOW() + INTERVAL '${LOCK_TIMEOUT_MINUTES} minutes')`,
[idempotencyKey]
);
} catch (error: any) {
// Check for unique_violation, which indicates a race condition we lost.
if (error.code === '23505') { // unique_violation
// Another process inserted the key between our SELECT and INSERT.
// This is expected under contention. Tell the client to retry.
return res.status(409).json({ error: 'Concurrent request detected, please retry.' });
}
throw error; // Re-throw other errors
}
}
// --- We have the lock, proceed with business logic ---
// We attach the client to the request so the main handler can use the same transaction
(req as any).dbClient = client;
(req as any).idempotencyKey = idempotencyKey;
const originalJson = res.json;
let responseBody: any;
res.json = (body) => {
responseBody = body;
return originalJson.call(res, body);
};
res.on('finish', async () => {
// This part runs *outside* the main request-response cycle.
// It needs its own client to update the record.
const updateClient = await pool.connect();
try {
const status = (res.statusCode >= 200 && res.statusCode < 300) ? 'COMPLETED' : 'FAILED';
await updateClient.query(
'UPDATE idempotency_keys SET status = $1, response_code = $2, response_body = $3, updated_at = NOW() WHERE key = $4',
[status, res.statusCode, JSON.stringify(responseBody), idempotencyKey]
);
} finally {
updateClient.release();
}
});
// The transaction will be committed/rolled back by the main handler
// We don't release the client here yet.
return next();
} catch (error) {
console.error('Idempotency PG middleware error:', error);
if (client) client.release();
return next(error);
}
};
// Example of how the main handler would use the transaction
app.post('/api/orders', idempotencyMiddlewarePg, async (req: Request, res: Response) => {
const { dbClient, idempotencyKey } = req as any;
try {
// Business logic uses the provided client
await dbClient.query('INSERT INTO orders (...) VALUES (...)');
await dbClient.query('UPDATE inventory SET count = count - 1 WHERE ...');
const result = { orderId: '123' };
res.status(201).json(result);
// Manually commit the transaction
await dbClient.query('COMMIT');
} catch (error) {
await dbClient.query('ROLLBACK');
res.status(500).json({ error: 'Failed to create order' });
} finally {
if (dbClient) dbClient.release();
}
});
Analysis of the PostgreSQL Implementation
* Transactional Guarantees: The key advantage here is the ability to tie the business logic and the idempotency state update into a single, atomic transaction. In the example above, the INSERT into orders and the eventual UPDATE idempotency_keys can be part of the same transaction. If the inventory update fails, the entire transaction is rolled back, and the idempotency key is never marked as COMPLETED. This is a significantly more robust failure mode than the Redis approach, which has a small window of inconsistency between business logic completion and the idempotency record update.
* Locking Strategy: We use SELECT ... FOR UPDATE. This places a row-level lock on the (non-existent) row matching the idempotency key. If two concurrent requests try this, the second one will block until the first one's transaction commits or rolls back. This effectively serializes requests for the same key, preventing the race condition.
* Race Condition Handling: Our SELECT ... FOR UPDATE followed by an INSERT is not fully atomic. Another transaction could commit between our SELECT (which finds nothing) and our INSERT. This is why we have a try/catch block around the INSERT specifically to handle the unique_violation error. This is our fallback mechanism to detect the race.
Complexity: This approach is more complex. It requires careful management of the database client and transaction lifecycle, passing the client from the middleware to the main request handler. The res.on('finish', ...) callback now runs after the transaction is committed, requiring a separate client to perform the final state update. An alternative, more robust pattern would be to perform the final UPDATE to COMPLETED inside* the main transaction, just before the COMMIT.
Advanced Edge Cases and Production Hardening
An implementation is only as good as its handling of edge cases.
1. Stale Lock Recovery
What happens if a process acquires a lock (sets status to STARTED) and then crashes? The lock is never released, and all subsequent requests for that key will fail with a 409 Conflict until the lock expires.
Solution:
* Lock Timeout: As implemented, both Redis and PG solutions have a lock timeout (LOCK_TIMEOUT_SECONDS or lock_expires_at).
Recovery Logic: When a new request encounters an expired* STARTED record, it faces a choice:
1. Pessimistic: Reject the request. This is the safest option but can lead to manual intervention if a process is truly stuck.
2. Optimistic: Attempt to take over the lock. This is complex. The new process could atomically update the lock_expires_at timestamp to a new future value, but only if the old value matches the one it read. This is a form of check-and-set (CAS) operation. In PostgreSQL, this could be an UPDATE ... WHERE key = $1 AND lock_expires_at = $2. In Redis, it would require a LUA script for atomicity.
Optimistic recovery is risky. The original process might not have crashed; it might just be experiencing extreme slowdown (e.g., GC pause). If it resumes, you could have two processes executing the same logic.
Recommendation: For most systems, a reasonably long lock timeout (e.g., 5-15 minutes) combined with robust monitoring and alerting for stale locks is the most pragmatic approach.
2. Partial Failures in the Redis Model
In the Redis implementation, there's a non-transactional gap:
- Business logic completes successfully.
- The response is sent to the client.
res.on('finish') callback can SET the key status to COMPLETED.In this state, the idempotency key is still STARTED. A subsequent retry will see an IN_PROGRESS lock and fail with a 409, even though the operation was successful. The system has performed the action but failed to record its success.
Solutions:
* Client-Side Behavior: Well-behaved clients, upon receiving a 409, should wait and retry. Eventually, the lock will expire, and a new request can proceed. This might re-execute the business logic, which must be designed to be safely re-runnable (e.g., INSERT ... ON CONFLICT DO NOTHING).
* Out-of-Band Reconciliation: A background job could periodically scan for old STARTED keys and check the primary data store to determine if the operation actually completed, then update the idempotency record accordingly.
This is the fundamental trade-off for Redis's performance: a small window of potential inconsistency that must be handled at the application or operational level. The PostgreSQL transactional model completely avoids this specific problem.
3. Garbage Collection of Keys
An idempotency store cannot grow indefinitely. We need a strategy to purge old keys.
* Redis: This is trivial. The EX option on the initial SET command establishes a TTL. The key will be automatically evicted by Redis after it expires. A 24-48 hour TTL is a common choice, covering most reasonable client retry windows.
* PostgreSQL: This requires a dedicated background job. A simple cron job that runs a DELETE FROM idempotency_keys WHERE created_at < NOW() - INTERVAL '48 hours' is effective. pg_cron or a similar scheduler can run this directly within the database.
4. Performance at Scale
* Write-Heavy Workload: The idempotency store is a write-heavy system. Every new request is a write. This can become a bottleneck.
* Benchmarking: Profile your chosen implementation. A Redis SETNX is likely <1ms. A PostgreSQL SELECT FOR UPDATE + INSERT will be in the single-digit milliseconds, but heavily dependent on network latency and transaction contention. At 10,000 requests per second, this overhead becomes significant.
* Connection Pooling: Ensure you are using a robust connection pool for your database. Starving the application of connections for idempotency checks will degrade overall performance.
* Read Replicas: The check for COMPLETED keys is a read operation. For very high-traffic endpoints, you could potentially direct these read-only checks to a PostgreSQL read replica, but only if eventual consistency is acceptable. If a request hits a replica that hasn't yet seen the COMPLETED record from the primary, it might incorrectly start a new operation. This is a dangerous optimization and generally not recommended unless the stakes are low.
Conclusion: Choosing Your Pattern
Implementing a correct idempotency layer is a non-trivial engineering task that is absolutely critical for building reliable distributed systems. A simple key/value check is insufficient and dangerous.
Choose Redis when:
* Performance is paramount. The sub-millisecond overhead is ideal for latency-sensitive APIs.
* You can tolerate a small window of inconsistency on process failure and can design your business logic to be safely re-runnable to compensate.
* Your operational maturity includes monitoring for stale locks.
Choose PostgreSQL when:
* Transactional integrity is non-negotiable. The ability to atomically commit business data and idempotency state in a single transaction eliminates entire classes of failure modes.
* The operation is already database-heavy, and the additional transactional overhead is acceptable.
* Your system handles high-value operations (payments, orders) where correctness trumps raw performance.
The state machine pattern—STARTED, COMPLETED, FAILED—is the core concept that makes either implementation robust. It correctly handles concurrency, provides deterministic responses for retries, and forms the foundation for building truly resilient, fault-tolerant asynchronous services.