Robust Idempotency Key Patterns for Asynchronous API Operations
The Inevitability of Duplicates: Why Idempotency is Non-Negotiable
In any non-trivial distributed system, the contract of "exactly-once" delivery is a myth. Between client-side retry logic, network proxies, load balancers, and message queue redelivery mechanisms (like in RabbitMQ or Kafka with at-least-once semantics), your API endpoints will inevitably receive the same logical request multiple times. For senior engineers, the question isn't if this will happen, but how the system is designed to handle it gracefully.
Consider a payment processing API. A client initiates a POST /v1/charges request. The request times out due to a transient network partition. The client, following best practices, retries with the same payload. If the first request actually succeeded and the system isn't idempotent, you've just double-charged the customer. This single failure mode can have catastrophic business consequences.
This article assumes you understand this fundamental problem. We will not cover the basics. Instead, we'll focus on the hard parts of building a bulletproof idempotency layer: managing state, preventing race conditions under high concurrency, and handling system failures mid-operation.
Our goal is to implement a system that can receive the same POST or PUT request multiple times with an Idempotency-Key and guarantee the underlying business logic is executed only once.
The Core Pattern: A Foundational Control Flow
The standard approach involves a client-generated unique identifier sent in an HTTP header, typically Idempotency-Key.
Idempotency-Key header. For any retry of this exact same operation, it MUST use the same key.* Has a request with this key been seen before?
* If yes, and the original operation completed, return the stored response from the original operation.
* If no, proceed with the business logic. Upon completion, store the result (HTTP status code, headers, body) against the idempotency key and then return the result to the client.
This sounds simple, but the complexity lies in the implementation details, particularly in the phrase "proceed with the business logic." What happens if two identical requests arrive at the same microsecond? This is where our deep dive begins.
Section 1: The Storage Layer - Your System's Memory
The choice of storage for your idempotency records dictates the performance, consistency, and durability guarantees of your system. The two primary contenders are a relational database (like PostgreSQL) or an in-memory datastore (like Redis).
Option A: PostgreSQL for Durability and Consistency
Using your primary relational database is often the best choice for operations that are already transactional and require high durability. The idempotency check can be wrapped in the same transaction as the business logic, providing strong ACID guarantees.
Schema Design
A minimal but effective schema for storing idempotency records in PostgreSQL would look like this:
CREATE TYPE idempotency_status AS ENUM ('started', 'completed', 'failed');
CREATE TABLE idempotency_keys (
-- The idempotency key provided by the client, scoped to a user/tenant.
key_hash CHAR(64) NOT NULL, -- SHA-256 hash of the user-provided key
user_id UUID NOT NULL,
-- State machine fields
status idempotency_status NOT NULL DEFAULT 'started',
-- The response to be returned on subsequent requests.
response_status_code SMALLINT,
response_body JSONB,
-- Timestamps for lifecycle management and debugging.
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Unlock marker for recovering from failures.
unlock_at TIMESTAMPTZ,
PRIMARY KEY (user_id, key_hash)
);
-- Index for efficient lookups.
CREATE INDEX idx_idempotency_keys_created_at ON idempotency_keys (created_at);
Key Design Decisions:
* key_hash and user_id Composite Primary Key: We don't store the raw idempotency key. Instead, we hash it (e.g., SHA-256) to ensure a fixed-length, indexed field. Crucially, the key is scoped by user_id (or a tenant ID). This prevents a key like "first-payment" from colliding between different users.
* status Enum: This is the core of our state machine, which we'll explore in the race condition section. It tracks whether an operation is in-flight, completed, or failed.
* response_body as JSONB: PostgreSQL's JSONB type is highly efficient for storing and retrieving the cached API response.
* unlock_at: A critical field for handling process failures. If a process starts an operation but dies, this timestamp determines when another process can retry the operation.
Option B: Redis for Low-Latency Operations
For performance-critical endpoints where sub-millisecond lookup times are essential, Redis is a compelling alternative. Its atomic commands like SETNX provide a powerful primitive for building an idempotency layer.
Data Model
In Redis, we'd typically use a simple key-value structure. The key would be a composite of the user ID and the idempotency key.
* Key: idempotency:{user_id}:{idempotency_key}
* Value: A JSON string containing the status, response code, and response body.
{
"status": "completed",
"response_code": 201,
"response_body": "{\"charge_id\": \"ch_123abc\"}"
}
We would use the SET command with the NX (only set if it does not exist) and EX (set an expiration) options.
# Atomically set the key only if it doesn't exist, with a 24-hour TTL
SET idempotency:user123:uuid-abc-123 '{"status":"started"}' EX 86400 NX
Comparison and Trade-offs
| Feature | PostgreSQL | Redis |
|---|---|---|
| Latency | Higher (milliseconds, disk I/O) | Lower (sub-millisecond, in-memory) |
| Durability | High (persisted to disk, WAL) | Lower (configurable persistence, potential for data loss) |
| Consistency | Strong (ACID transactions with business logic) | Weaker (operations are atomic, but not in a transaction with your DB) |
| Complexity | Lower if already using Postgres. Higher if adding it. | Higher if it introduces a new piece of infrastructure. |
| Garbage Collection | Requires a background job or cron to clean up old keys. | Built-in (TTL expiration). |
Recommendation: For financial transactions or critical state changes, co-locating the idempotency records in your primary PostgreSQL database within the same transaction provides the strongest safety guarantees. For less critical, high-throughput operations (e.g., event ingestion), Redis is a superior choice for performance.
Section 2: The Crux of the Problem - Defeating Race Conditions
A naive check-then-act implementation will fail under concurrency.
Process A and Process B receive the same request with the same key at nearly the same time.
We need an atomic way to "claim" an idempotency key. This is where we implement a robust, two-phase state machine.
The Two-Phase State Machine Pattern
Instead of a simple boolean processed flag, we use a multi-state record (started, completed, failed). This allows us to differentiate between an operation that has finished and one that is currently in-flight.
Here's the refined control flow:
* Upon receiving a request, immediately try to INSERT a new record into the idempotency_keys table with the status started.
* We use a database feature like INSERT ... ON CONFLICT DO NOTHING (PostgreSQL) or SETNX (Redis). This is an atomic operation.
* If the insert succeeds: This process has successfully claimed the key. It can now proceed to execute the business logic.
If the insert fails (due to a key collision): This means another process has already claimed this key. We now read* the existing record.
* Read the status of the record claimed by the other process.
* If status is completed: The original operation finished successfully. Retrieve the cached response and return it immediately.
* If status is failed: The original operation failed. You can choose to either return a generic error or re-attempt the operation.
* If status is started: This is the critical case. An operation is currently in-flight. The correct response is to return an HTTP 409 Conflict or 429 Too Many Requests, telling the client to wait and retry. Polling the record in a tight loop is an anti-pattern that can lead to thundering herd problems.
* After the business logic is executed, the process that claimed the key must update the record.
* On success, UPDATE the status to completed and store the response details.
* On failure, UPDATE the status to failed.
Go Implementation with PostgreSQL
Let's implement this logic as a middleware in Go using the database/sql package.
package idempotency
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"net/http"
"time"
)
// Represents the record in our idempotency_keys table
type IdempotencyRecord struct {
KeyHash string
UserID string
Status string
ResponseStatusCode sql.NullInt16
ResponseBody []byte
UnlockAt sql.NullTime
}
// The core idempotency store interface
type Store interface {
// Get finds a record by its key.
Get(ctx context.Context, userID, keyHash string) (*IdempotencyRecord, error)
// Create attempts to atomically insert a new record.
Create(ctx context.Context, record *IdempotencyRecord) error
// Update transitions the state of an existing record.
Update(ctx context.Context, record *IdempotencyRecord) error
}
// PostgresStore implements the Store interface for PostgreSQL
type PostgresStore struct {
DB *sql.DB
}
// NewPostgresStore creates a new store.
func NewPostgresStore(db *sql.DB) *PostgresStore {
return &PostgresStore{DB: db}
}
// Create uses INSERT ... ON CONFLICT to atomically create the record.
// It returns a special error if the key already exists.
var ErrKeyAlreadyExists = fmt.Errorf("idempotency key already exists")
func (s *PostgresStore) Create(ctx context.Context, record *IdempotencyRecord) error {
// Set a lock timeout. If an operation is stuck, we want to be able to recover.
record.UnlockAt = sql.NullTime{Time: time.Now().Add(5 * time.Minute), Valid: true}
query := `
INSERT INTO idempotency_keys (user_id, key_hash, status, unlock_at)
VALUES ($1, $2, 'started', $3)
ON CONFLICT (user_id, key_hash) DO NOTHING`
res, err := s.DB.ExecContext(ctx, query, record.UserID, record.KeyHash, record.UnlockAt)
if err != nil {
return fmt.Errorf("failed to create idempotency record: %w", err)
}
rowsAffected, err := res.RowsAffected()
if err != nil {
return fmt.Errorf("failed to check rows affected: %w", err)
}
if rowsAffected == 0 {
return ErrKeyAlreadyExists
}
return nil
}
// Get retrieves the current state of a key.
func (s *PostgresStore) Get(ctx context.Context, userID, keyHash string) (*IdempotencyRecord, error) {
// Implementation details omitted for brevity. Standard SELECT query.
// SELECT user_id, key_hash, status, response_status_code, response_body FROM idempotency_keys WHERE ...
return nil, nil // Placeholder
}
// Update saves the final result of the operation.
func (s *PostgresStore) Update(ctx context.Context, record *IdempotencyRecord) error {
query := `
UPDATE idempotency_keys
SET status = $3, response_status_code = $4, response_body = $5, last_updated_at = NOW(), unlock_at = NULL
WHERE user_id = $1 AND key_hash = $2`
_, err := s.DB.ExecContext(ctx, query, record.UserID, record.KeyHash, record.Status, record.ResponseStatusCode, record.ResponseBody)
if err != nil {
return fmt.Errorf("failed to update idempotency record: %w", err)
}
return nil
}
// --- Middleware Logic ---
// responseWriter captures the response to be cached.
type responseWriter struct {
http.ResponseWriter
statusCode int
body []byte
}
func (rw *responseWriter) WriteHeader(statusCode int) {
rw.statusCode = statusCode
rw.ResponseWriter.WriteHeader(statusCode)
}
func (rw *responseWriter) Write(b []byte) (int, error) {
rw.body = append(rw.body, b...)
return rw.ResponseWriter.Write(b)
}
// Middleware provides the idempotency layer.
func Middleware(store Store) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
idempotencyKey := r.Header.Get("Idempotency-Key")
if idempotencyKey == "" {
next.ServeHTTP(w, r)
return
}
// In a real app, you'd get this from a JWT or session.
userID := "user-123"
// Remember to hash the key!
keyHash := calculateHash(idempotencyKey)
// 1. Attempt to create the initial record.
createRecord := &IdempotencyRecord{UserID: userID, KeyHash: keyHash}
err := store.Create(r.Context(), createRecord)
if err == ErrKeyAlreadyExists {
// The key exists, so we need to check its status.
existingRecord, getErr := store.Get(r.Context(), userID, keyHash)
if getErr != nil {
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
switch existingRecord.Status {
case "completed":
// Operation is done, return the cached response.
w.Header().Set("Content-Type", "application/json")
w.Header().Set("Idempotency-Replayed", "true")
w.WriteHeader(int(existingRecord.ResponseStatusCode.Int16))
w.Write(existingRecord.ResponseBody)
return
case "started":
// Operation is in-flight. Return a conflict.
http.Error(w, "Request for this idempotency key is already in progress", http.StatusConflict)
return
case "failed":
// You might decide to retry here, but for now, we'll treat it as a conflict.
http.Error(w, "A previous request with this key failed", http.StatusConflict)
return
}
} else if err != nil {
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
// 2. We've claimed the key. Proceed with the actual handler.
// We wrap the response writer to capture the result.
crw := &responseWriter{ResponseWriter: w}
next.ServeHTTP(crw, r)
// 3. Update the record with the captured response.
finalRecord := &IdempotencyRecord{
UserID: userID,
KeyHash: keyHash,
}
if crw.statusCode >= 200 && crw.statusCode < 300 {
finalRecord.Status = "completed"
} else {
finalRecord.Status = "failed"
}
finalRecord.ResponseStatusCode = sql.NullInt16{Int16: int16(crw.statusCode), Valid: true}
finalRecord.ResponseBody = crw.body
if updateErr := store.Update(r.Context(), finalRecord); updateErr != nil {
// Log this critical failure! The idempotency record is out of sync.
}
})
}
}
func calculateHash(s string) string {
// Implementation using sha256.Sum256 and hex.EncodeToString
return ""
}
This implementation correctly handles the race condition by leveraging the atomic nature of INSERT ... ON CONFLICT. It provides a robust foundation for an idempotent API.
Section 3: Advanced Considerations and Production Hardening
Building a system that works 99.9% of the time is one thing. Building one that is resilient to failure is another.
Edge Case 1: The Process Dies Mid-Flight
What happens if our Go application crashes after claiming the key (status=started) but before it can update the record to completed? The key is now poisoned. Any subsequent request with the same key will see the started status and return a 409 Conflict forever.
Solution: The unlock_at Field
This is precisely why our schema included an unlock_at timestamp. When we create the record in the started state, we set unlock_at to a time in the near future (e.g., NOW() + 5 minutes).
Our Get logic needs to be enhanced:
// Modified Get logic
// ... inside the 'case "started":' block
if existingRecord.UnlockAt.Valid && time.Now().After(existingRecord.UnlockAt.Time) {
// The lock has expired. We can consider this operation failed and allow a new one to start.
// This is a recovery path. You could either delete the old record and restart the flow,
// or update it to 'failed' and then restart.
// For simplicity, let's treat it as a failed attempt.
http.Error(w, "A previous request for this key timed out and failed", http.StatusConflict)
} else {
// Lock is still valid, operation is in-flight.
http.Error(w, "Request for this idempotency key is already in progress", http.StatusConflict)
}
A background job should also periodically scan for old, started records with expired unlock_at timestamps and mark them as failed to prevent database bloat.
Edge Case 2: Partial Business Logic Failures
Imagine a complex operation that involves multiple steps:
- Start database transaction.
table_a.- Make an external API call to a third party.
table_b.- Commit transaction.
If the external API call (step 3) fails, the entire transaction should be rolled back. Our middleware captures the resulting 500 error and marks the idempotency key as failed. This is correct.
A subsequent retry from the client will see the failed status. The system can now decide its policy: should it allow a full retry? Or should it return a permanent failure? For most use cases, allowing a retry is desirable. The logic would be to treat a failed record similarly to a non-existent one, allowing the Create step to proceed again (after perhaps deleting the old failed record).
Edge Case 3: Request Body Divergence
What if a client retries with the same Idempotency-Key but a different request body? The specification for idempotency generally implies that the entire request (method, path, headers, body) is identical.
Solution: Hashing the Request Body
For an even more robust implementation, you can store a hash of the request body along with the idempotency key. When a key collision occurs, you can re-calculate the hash of the incoming request and compare it to the stored hash.
ALTER TABLE idempotency_keys ADD COLUMN request_body_hash CHAR(64);
If the hashes do not match, it's a client error. The client is improperly re-using an idempotency key for a different logical operation. The correct response is a 422 Unprocessable Entity with a clear error message.
Conclusion: From Theory to Production-Ready
Implementing a truly robust idempotency layer is a hallmark of a mature, reliable distributed system. It moves beyond simple checks into the realm of state machines, atomic database operations, and careful failure mode analysis.
Key Takeaways for Senior Engineers:
claim operation, achieved via INSERT ... ON CONFLICT or SETNX.started, completed, failed state machine, combined with a lock timeout (unlock_at), is necessary to handle in-flight requests and process crashes.By moving past the naive check-then-act pattern and implementing the two-phase state machine discussed here, you can build APIs that are resilient, predictable, and safe, even in the chaotic world of distributed systems.