Idempotency Keys via PostgreSQL Advisory Locks for Resilient APIs
The Subtle but Dangerous Flaw in Naive Idempotency Checks
As senior engineers designing distributed systems, we understand that network fallibility is a given. Clients will retry requests, and our APIs must handle these retries gracefully, especially for mutating operations like creating a resource or processing a payment. The standard solution is the idempotency key, typically passed in a header like Idempotency-Key. The server is expected to process the first request for a given key and return a cached response for all subsequent retries.
A common first attempt at implementing this involves a dedicated table and a simple SELECT-then-INSERT logic:
-- idempotency_keys table schema
CREATE TABLE idempotency_keys (
key VARCHAR(255) PRIMARY KEY,
tenant_id BIGINT NOT NULL,
-- ... other fields like response_code, response_body
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(key, tenant_id)
);
The application logic in a request handler might look something like this (pseudocode):
// THIS IS THE FLAWED, NAIVE APPROACH
async function handleRequest(request) {
const idempotencyKey = request.headers['idempotency-key'];
const tenantId = request.auth.tenantId;
// 1. Check if the key exists
const existingRecord = await db.query(
'SELECT * FROM idempotency_keys WHERE key = $1 AND tenant_id = $2',
[idempotencyKey, tenantId]
);
if (existingRecord) {
// Key found, return cached response
return existingRecord.response;
}
// 2. Key not found, so insert it and process the request
await db.query(
'INSERT INTO idempotency_keys (key, tenant_id, status) VALUES ($1, $2, $3)',
[idempotencyKey, tenantId, 'processing']
);
const result = await performBusinessLogic();
// 3. Update the record with the final result
await db.query(
'UPDATE idempotency_keys SET response = $1, status = $2 WHERE key = $3 AND tenant_id = $4',
[result, 'completed', idempotencyKey, tenantId]
);
return result;
}
This logic appears sound, but it contains a critical race condition. Imagine two identical requests arriving at nearly the same time, processed by two different application threads or nodes:
SELECT query. It finds no record.SELECT query before Request A has inserted anything. It also finds no record.INSERT the key and execute the business logic.INSERT the key. This will fail due to the UNIQUE constraint, throwing an error. However, the business logic on Request A's thread is already running or may have even completed. The client for Request B gets an error, while the business logic has been triggered twice.This is a catastrophic failure for operations like charging a credit card or provisioning a critical resource. We need a mechanism to make the check-and-set operation atomic. This is where PostgreSQL Advisory Locks provide an elegant and highly performant solution.
Why Advisory Locks are the Superior Tool for This Problem
PostgreSQL Advisory Locks are a cooperative locking mechanism. Unlike standard row or table locks, they don't lock any specific data. Instead, they lock an application-defined integer value from a shared, global pool. This makes them lightweight and perfectly suited for managing concurrency around a concept—like an idempotency key—rather than a specific table row.
Key characteristics that make them ideal:
* Application-Defined: We control the lock's scope and meaning entirely.
Non-Blocking Acquisition: The pg_try_advisory_lock() function attempts to acquire a lock without blocking*. If the lock is already held, it immediately returns false. This is perfect for an API, as we can instantly respond with a 409 Conflict or a custom status, rather than holding a request handler thread waiting for a lock to be released.
* Session-Scoped: A standard advisory lock is held for the duration of the database session (connection). If the application server crashes or the connection is terminated, PostgreSQL automatically cleans up the lock, preventing deadlocks.
By acquiring an advisory lock before we check our idempotency_keys table, we can create a critical section that guarantees only one process can operate on a given idempotency key at a time.
The Production-Grade Implementation Pattern
Let's build a robust implementation from the ground up, covering schema, middleware logic, and code.
1. Enhanced Schema Design
We need a more comprehensive schema to handle in-progress states and cache responses effectively.
CREATE TYPE idempotency_status AS ENUM ('processing', 'completed', 'failed');
CREATE TABLE idempotency_keys (
id BIGSERIAL PRIMARY KEY,
-- The client-provided idempotency key
idempotency_key TEXT NOT NULL,
-- Scope the key to a specific user or tenant to prevent collisions
user_id BIGINT NOT NULL,
-- The advisory lock key we will use. Storing it helps with debugging.
lock_key BIGINT NOT NULL,
-- Current status of the operation
status idempotency_status NOT NULL DEFAULT 'processing',
-- Store the full response to return on retries
response_code INT,
response_body JSONB,
-- Timestamps for lifecycle management and cleanup
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Ensure a user cannot use the same key twice
CONSTRAINT idempotency_keys_user_id_idempotency_key_key UNIQUE (user_id, idempotency_key)
);
-- Critical index for fast lookups inside our locked critical section
CREATE INDEX idx_idempotency_keys_lookup ON idempotency_keys (user_id, idempotency_key);
Key Design Choices:
* idempotency_status ENUM: Provides clear states for the key's lifecycle.
user_id / tenant_id: Absolutely crucial in multi-tenant systems. An idempotency key must be unique per user/tenant*.
* lock_key: We'll store the bigint used for the advisory lock. This is derived from a hash of the (user_id, idempotency_key) tuple.
* response_code & response_body: Caching the exact HTTP response code and body is essential for true idempotency.
* UNIQUE Constraint: This still acts as a final safety net, but our advisory lock logic should prevent us from ever hitting it under normal concurrent operation.
2. The Hashing Strategy for Lock Keys
Advisory locks operate on bigint (64-bit signed integer) values. Our idempotency keys are strings (often UUIDs). We need a deterministic way to map our composite key (user_id, idempotency_key) to a single bigint.
A simple and effective approach is to use a non-cryptographic hash function like xxhash or murmur3, which are extremely fast, and take the first 64 bits of the output.
import { xxh64 } from '@node-rs/xxhash';
// A Buffer containing a 64-bit BigInt hash
const keyBuffer = Buffer.from(`${userId}:${idempotencyKey}`);
const hashBuffer = xxh64(keyBuffer);
// Convert the Buffer to a BigInt string for the database driver
const lockKey: bigint = hashBuffer.readBigInt64BE();
Collision Risk: With a 64-bit space, hash collisions are theoretically possible but astronomically unlikely in a practical application. The space of 2^64 is vast. For a system with billions of unique keys, the probability of a collision remains negligible. The risk is far lower than other potential system failures.
3. The Idempotency Middleware (TypeScript/Express/node-postgres)
Now, let's tie everything together in an Express middleware. This example uses the pg library.
Crucial Prerequisite: Stable Connection Management
An advisory lock is tied to the connection that acquired it. When using a connection pool, you must ensure that all operations within the critical section (acquiring the lock, checking the table, inserting the initial record) happen on the same database connection. The pg library's pool.connect() is perfect for this.
import { Pool, PoolClient } from 'pg';
import { Request, Response, NextFunction } from 'express';
import { xxh64 } from '@node-rs/xxhash';
const pool = new Pool({ /* ...db config... */ });
// Helper to generate the lock key
function getLockKey(userId: string, idempotencyKey: string): bigint {
const keyBuffer = Buffer.from(`${userId}:${idempotencyKey}`);
const hashBuffer = xxh64(keyBuffer);
return hashBuffer.readBigInt64BE();
}
export async function idempotencyMiddleware(req: Request, res: Response, next: NextFunction) {
const idempotencyKey = req.headers['idempotency-key'] as string;
const userId = req.user?.id; // Assuming user is attached by auth middleware
if (!idempotencyKey || !userId) {
// Or handle as an error, depending on your API contract
return next();
}
let client: PoolClient | null = null;
try {
// 1. Get a dedicated client from the pool
client = await pool.connect();
const lockKey = getLockKey(userId, idempotencyKey);
// 2. Attempt to acquire the advisory lock in a NON-BLOCKING way
const lockResult = await client.query('SELECT pg_try_advisory_lock($1)', [lockKey.toString()]);
const lockAcquired = lockResult.rows[0].pg_try_advisory_lock;
if (!lockAcquired) {
// Could not get the lock, another request is processing this key.
return res.status(409).json({ message: 'A request with this idempotency key is already in progress.' });
}
// --- CRITICAL SECTION START ---
// 3. WITH THE LOCK HELD, check if the key has already been processed.
// This is the crucial step that prevents the race condition.
const existingRecordResult = await client.query(
'SELECT * FROM idempotency_keys WHERE user_id = $1 AND idempotency_key = $2',
[userId, idempotencyKey]
);
if (existingRecordResult.rows.length > 0) {
const record = existingRecordResult.rows[0];
// Found a completed or failed record, return the cached response
if (record.status === 'completed' || record.status === 'failed') {
res.set('Content-Type', 'application/json');
return res.status(record.response_code).send(record.response_body);
}
// If it's still 'processing', this indicates a stale record from a crashed process.
// We'll discuss handling this in the Edge Cases section.
// For now, we can treat it like a conflict.
return res.status(409).json({ message: 'Stale request detected for this idempotency key.' });
}
// 4. No record found. Insert a 'processing' record to claim this key.
await client.query(
`INSERT INTO idempotency_keys (idempotency_key, user_id, lock_key, status)
VALUES ($1, $2, $3, 'processing')`,
[idempotencyKey, userId, lockKey.toString()]
);
// --- CRITICAL SECTION END ---
// We can now release the lock. The business logic can take time,
// and we don't need to hold the lock during its execution.
await client.query('SELECT pg_advisory_unlock($1)', [lockKey.toString()]);
// Attach the key and a dedicated function to update it to the response object
// This allows the final route handler to update the result.
res.locals.idempotency = {
key: idempotencyKey,
userId: userId,
update: async (code: number, body: any) => {
await pool.query(
`UPDATE idempotency_keys
SET status = 'completed', response_code = $1, response_body = $2, updated_at = NOW()
WHERE user_id = $3 AND idempotency_key = $4`,
[code, body, userId, idempotencyKey]
);
}
};
// Proceed to the actual route handler
next();
} catch (error) {
// Ensure the lock is released on error
if (client && lockKey) {
await client.query('SELECT pg_advisory_unlock($1)', [lockKey.toString()]);
}
next(error);
} finally {
// ALWAYS release the client back to the pool
if (client) {
client.release();
}
}
}
And here's how you'd use it in your route handler:
app.post('/api/payments', idempotencyMiddleware, async (req: Request, res: Response) => {
try {
const paymentResult = await paymentService.process(req.body);
// Use the attached update function to save the final result
if (res.locals.idempotency) {
await res.locals.idempotency.update(201, paymentResult);
}
res.status(201).json(paymentResult);
} catch (error) {
// Handle business logic errors
if (res.locals.idempotency) {
await pool.query(
`UPDATE idempotency_keys SET status = 'failed', response_code = 500, response_body = $1, updated_at = NOW() WHERE user_id = $2 AND idempotency_key = $3`,
[{ message: error.message }, res.locals.idempotency.userId, res.locals.idempotency.key]
);
}
res.status(500).json({ message: error.message });
}
});
Advanced Considerations & Edge Case Handling
A production system requires more than the happy path.
1. Stale Record Cleanup
What happens if your server process crashes after inserting the 'processing' record but before updating it to 'completed'? The advisory lock is released automatically when the database connection dies, but the row in idempotency_keys remains in a 'processing' state indefinitely.
Another request with the same key will acquire the lock, see the 'processing' record, and (based on our current logic) return a 409 error, effectively deadlocking that key forever.
Solution: Implement a background worker (e.g., a cron job) that periodically cleans up stale records.
-- A query to run periodically
DELETE FROM idempotency_keys
WHERE
status = 'processing'
AND created_at < NOW() - INTERVAL '15 minutes';
The '15 minutes' interval should be longer than your longest-running possible transaction to avoid cleaning up a genuinely active request. This acts as a garbage collector for failed processes.
2. Transactional vs. Session-Level Locks
We used pg_try_advisory_lock(), which is a session-level lock. There is also pg_try_advisory_xact_lock(), which is a transaction-level lock. A transaction-level lock is automatically released at the end of the transaction (COMMIT or ROLLBACK).
For our pattern, a session-level lock is often preferable. Why? We want to minimize the lock duration. Our critical section is very short: lock, check DB, insert record, unlock. The actual business logic (paymentService.process) might be long-running and could even involve multiple, separate transactions. By releasing the lock before calling the business logic, we allow other requests (with different keys) to proceed without being blocked by a long-held lock on a single connection.
If we used a transaction-level lock, we would have to keep the transaction open for the entire duration of the business logic, which is an anti-pattern that can lead to connection pool exhaustion.
3. Performance and Scalability
Lock Contention: This pattern intentionally serializes requests for the same* key. This is the desired behavior. For different keys, requests will be hashed to different lock IDs and can proceed in parallel without contention. The performance of pg_try_advisory_lock itself is extremely high.
* Indexing: The idx_idempotency_keys_lookup on (user_id, idempotency_key) is non-negotiable. The SELECT query inside the critical section must be lightning fast to minimize lock hold time.
* Connection Pooling: As emphasized before, the pool.connect() pattern is mandatory. If you were to grab different clients from the pool for each query (pool.query(...)), your lock would be acquired on one connection and you'd attempt to check the table and release the lock on different connections, completely defeating the purpose of the lock.
4. Client-Side Considerations
Your API contract must be clear. Clients should generate a high-entropy key (like a UUIDv4) and use the same key for retries of the same conceptual operation. They must generate a new key for a new operation. The server-side logic we've built is only as good as the client's adherence to this contract.
Alternatives and When Not to Use This Pattern
* Redis-based Locking (e.g., Redlock): Using a SET NX command in Redis is a common alternative.
* Pros: Can be faster as Redis is in-memory. Decouples locking from your primary database.
* Cons: Introduces another system to manage, monitor, and scale. Distributed locking in Redis has its own complexities and failure modes (see Martin Kleppmann's critique of Redlock). Our PostgreSQL approach keeps the locking mechanism and the data in the same transactional system, simplifying consistency.
* INSERT ... ON CONFLICT DO NOTHING: This is a simpler atomic operation.
INSERT INTO idempotency_keys (key, user_id) VALUES ($1, $2) ON CONFLICT (key, user_id) DO NOTHING;
* Limitation: This only tells you if the insert was successful (it was the first) or not. It doesn't allow you to retrieve the cached response from the first request. It's suitable for fire-and-forget operations where you only need to guarantee at-most-once execution, but not for idempotent GET-after-POST style operations.
When is this overkill? This pattern is designed for critical, state-changing mutations. For non-critical operations or read-only endpoints, it's unnecessary complexity. Always apply it judiciously where the cost of duplicate execution is high.
By leveraging PostgreSQL's often-overlooked advisory locks, we can build truly robust, race-condition-free idempotency layers that are performant, scalable, and reside within the same trusted data store as our business logic, providing a powerful pattern for any senior engineer's toolkit.