Building Resilient APIs: Idempotency Layers with Redis and Lua
The Inevitability of Retries and the Peril of Side Effects
In any non-trivial distributed system, network partitions, transient service unavailability, and client-side timeouts are not edge cases; they are operational certainties. A well-behaved client, upon receiving a timeout or a network error for a state-changing operation (POST, PATCH, DELETE), will implement a retry strategy. This is where the contract of a resilient API is truly tested. If a POST /api/v1/payments request times out, did the payment process? Did it fail? The client doesn't know. A naive retry could result in a double charge, a catastrophic business logic failure.
This is the core problem that idempotency solves. An operation is idempotent if making the same request multiple times produces the same result and has the same side effect as making it a single time. While GET, HEAD, OPTIONS, PUT, and DELETE are often idempotent by definition, POST and PATCH are not. It's our responsibility as backend engineers to enforce idempotency for these endpoints when they trigger critical, non-reversible side effects.
This article bypasses introductory concepts. We assume you understand why idempotency is critical. Instead, we will focus on building a robust, high-performance, and race-condition-free idempotency layer suitable for production systems, using Redis as our state store and Lua scripting for atomicity.
Why Naive Approaches Fail Under Concurrency
Before diving into the robust solution, let's briefly dissect why simpler patterns are insufficient for high-throughput systems.
Pattern 1: The Database Flag
A common first attempt is to use a relational database. Create a table idempotency_keys with a unique constraint on the key.
CREATE TABLE idempotency_keys (
key VARCHAR(255) PRIMARY KEY,
request_hash VARCHAR(255) NOT NULL,
response_code INT,
response_body TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
The flow would be:
- Begin transaction.
SELECT * FROM idempotency_keys WHERE key = ?- If a row exists, return the stored response.
INSERT the key.- Commit transaction.
- Execute business logic.
UPDATE the row with the final response.The Flaw: This introduces a significant race condition. Two concurrent requests with the same key can both execute the SELECT query (step 2) before either has performed the INSERT. Both will see that the key doesn't exist and proceed, leading to a duplicate operation. While a PRIMARY KEY or UNIQUE constraint will cause one of the INSERTs to fail, the business logic may have already been initiated in parallel. This pattern is also slow due to disk I/O and transactional overhead.
Pattern 2: Redis SETNX
Using Redis is a step in the right direction due to its in-memory performance. A naive Redis approach might use SETNX (Set if Not Exists).
Idempotency-Key header.SETNX idempotency: 'in_progress' EX 60 SETNX returns 1 (key was set), proceed with business logic.SETNX returns 0 (key already exists), you have a problem. Is the original request still in progress? Or did it complete? You don't know the state.This is better, but it's an incomplete state machine. You can't differentiate between an in-progress request and a completed one. You could try to GET the key first, but that re-introduces a read-modify-write race condition between your application server and Redis.
To solve this correctly, we need to perform the entire check-and-set logic in a single, atomic operation on the Redis server. This is the perfect use case for Lua scripting.
The Production Pattern: Atomic State Machine with Redis Hashes and Lua
Our goal is to build a state machine for each idempotency key. The key can be in one of three states:
We will store this state in a Redis Hash. A Hash is ideal because it allows us to store multiple fields (e.g., status, response_code, response_body) under a single key, which can be manipulated atomically.
The Core Lua Script: `acquire_lock.lua`
This script is the heart of our idempotency layer. It's executed atomically by Redis. It takes the idempotency key and a timeout for the lock as arguments.
-- acquire_lock.lua
-- KEYS[1]: The idempotency key (e.g., 'idempotency:uuid-123')
-- ARGV[1]: The lock timeout in seconds (e.g., 60)
-- ARGV[2]: The current request's unique identifier (optional, for logging/debugging)
-- Check if the key exists
local key_exists = redis.call('EXISTS', KEYS[1])
if key_exists == 1 then
-- Key exists, check its state
local status = redis.call('HGET', KEYS[1], 'status')
if status == 'in_progress' then
-- Another request is currently processing. Return 'locked'.
return {'locked'}
elseif status == 'completed' then
-- The request was already completed. Return the cached response.
local response_code = redis.call('HGET', KEYS[1], 'response_code')
local response_body = redis.call('HGET', KEYS[1], 'response_body')
return {'completed', response_code, response_body}
else
-- Should not happen in a correct implementation, but handle as an error.
return {'error', 'unknown_status'}
end
else
-- Key does not exist. This is the first request.
-- Create the hash and set its status to 'in_progress'.
redis.call('HSET', KEYS[1], 'status', 'in_progress', 'request_id', ARGV[2])
redis.call('EXPIRE', KEYS[1], ARGV[1])
return {'acquired'}
end
Analysis of the Lua Script:
* Atomicity: Redis guarantees that this entire script executes without interruption. This completely eliminates the read-modify-write race condition.
* KEYS and ARGV: We use the KEYS table for the key name and ARGV for values. This is a Redis best practice that enables future compatibility with Redis Cluster (as all keys in a script must belong to the same hash slot).
* Stateful Responses: The script returns different arrays based on the state. Our application code will need to parse this response to decide on the next action (proceed, return 409 Conflict, or return cached response).
* Lock Timeout (EXPIRE): This is a crucial guardrail. If our application server crashes after acquiring the lock but before completing the request, the key would be stuck in the in_progress state forever. The EXPIRE command ensures the lock is automatically released after a reasonable period (e.g., 60 seconds), allowing a subsequent retry to proceed. This is a form of garbage collection.
The Second Lua Script: `release_lock_and_cache.lua`
Once our business logic is complete, we need to update the key's state to completed and store the response. We use another script for atomicity.
-- release_lock_and_cache.lua
-- KEYS[1]: The idempotency key
-- ARGV[1]: Final response code (e.g., '201')
-- ARGV[2]: Final response body (e.g., '{"id":"payment_123"}')
-- ARGV[3]: The TTL for the completed key in seconds (e.g., 86400 for 24 hours)
redis.call('HSET', KEYS[1], 'status', 'completed', 'response_code', ARGV[1], 'response_body', ARGV[2])
redis.call('EXPIRE', KEYS[1], ARGV[3])
return 'OK'
Analysis:
* Atomic Update: HSET with multiple field/value pairs is atomic. We update the status and store the response in a single command.
* Completed Key TTL: We set a new, typically longer, TTL for the completed key. This determines how long we'll recognize retries for a completed request. 24 hours is a common choice. After this period, the key will be removed, and a new request with the same idempotency key will be treated as a new, unique request.
Implementation in a Go Middleware
Now, let's integrate these scripts into a practical HTTP middleware using Go. We'll use the popular go-redis library.
1. Setup and Script Loading
First, we define our middleware struct and load the Lua scripts into Redis on application startup. Redis will cache the SHA1 hash of the scripts, allowing us to call them efficiently using EVALSHA.
package idempotency
import (
"context"
"fmt"
"net/http"
"time"
"github.com/go-redis/redis/v8"
"github.com/google/uuid"
)
const (
IdempotencyKeyHeader = "Idempotency-Key"
InProgressLockTTL = 60 * time.Second
CompletedKeyTTL = 24 * time.Hour
)
// Pre-load Lua scripts
const acquireLockScript = `
local key_exists = redis.call('EXISTS', KEYS[1])
if key_exists == 1 then
local status = redis.call('HGET', KEYS[1], 'status')
if status == 'in_progress' then
return {'locked'}
elseif status == 'completed' then
local response_code = redis.call('HGET', KEYS[1], 'response_code')
local response_body = redis.call('HGET', KEYS[1], 'response_body')
return {'completed', response_code, response_body}
else
return {'error', 'unknown_status'}
end
else
redis.call('HSET', KEYS[1], 'status', 'in_progress', 'request_id', ARGV[2])
redis.call('EXPIRE', KEYS[1], ARGV[1])
return {'acquired'}
end
`
const releaseLockScript = `
redis.call('HSET', KEYS[1], 'status', 'completed', 'response_code', ARGV[1], 'response_body', ARGV[2])
redis.call('EXPIRE', KEYS[1], ARGV[3])
return 'OK'
`
type Middleware struct {
rdb *redis.Client
acquireLockSHA string
releaseLockSHA string
}
func New(rdb *redis.client.Client) (*Middleware, error) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
acquireSHA, err := rdb.ScriptLoad(ctx, acquireLockScript).Result()
if err != nil {
return nil, fmt.Errorf("failed to load acquire lock script: %w", err)
}
releaseSHA, err := rdb.ScriptLoad(ctx, releaseLockScript).Result()
if err != nil {
return nil, fmt.Errorf("failed to load release lock script: %w", err)
}
return &Middleware{
rdb: rdb,
acquireLockSHA: acquireSHA,
releaseLockSHA: releaseSHA,
}, nil
}
2. The HTTP Middleware Handler
The core logic resides here. We intercept the request, run the acquire_lock script, and then wrap the http.ResponseWriter to capture the response before running the release_lock script.
// responseRecorder captures the status code and body
type responseRecorder struct {
http.ResponseWriter
statusCode int
body []byte
}
func (rec *responseRecorder) WriteHeader(statusCode int) {
rec.statusCode = statusCode
rec.ResponseWriter.WriteHeader(statusCode)
}
func (rec *responseRecorder) Write(body []byte) (int, error) {
rec.body = append(rec.body, body...)
return rec.ResponseWriter.Write(body)
}
func (m *Middleware) Handler(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
idempotencyKey := r.Header.Get(IdempotencyKeyHeader)
if idempotencyKey == "" {
// Not an idempotent request, pass through
next.ServeHTTP(w, r)
return
}
// It's good practice to prefix keys
redisKey := "idempotency:" + idempotencyKey
requestID := uuid.New().String()
// 1. Acquire Lock
result, err := m.rdb.EvalSha(r.Context(), m.acquireLockSHA, []string{redisKey}, InProgressLockTTL.Seconds(), requestID).Result()
if err != nil {
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
resSlice, ok := result.([]interface{})
if !ok || len(resSlice) == 0 {
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
status := resSlice[0].(string)
switch status {
case "locked":
w.WriteHeader(http.StatusConflict)
w.Write([]byte("Request with this Idempotency-Key is already in progress"))
return
case "completed":
// Return cached response
responseCode, _ := resSlice[1].(string)
responseBody, _ := resSlice[2].(string)
code, _ := strconv.Atoi(responseCode)
w.Header().Set("Content-Type", "application/json") // Assuming JSON
w.WriteHeader(code)
w.Write([]byte(responseBody))
return
case "acquired":
// Fall through to execute the handler
break
default:
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
// 2. Execute Business Logic and Capture Response
recorder := &responseRecorder{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(recorder, r)
// 3. Release Lock and Cache Response
// Note: We do this even if the handler panics, hence the need for a proper middleware framework with recovery
// For simplicity, we do it directly here.
_, err = m.rdb.EvalSha(r.Context(), m.releaseLockSHA, []string{redisKey}, recorder.statusCode, string(recorder.body), CompletedKeyTTL.Seconds()).Result()
if err != nil {
// Log the error, but don't fail the request as the client already has the response.
// The lock will eventually expire via TTL.
fmt.Printf("Error releasing idempotency lock for key %s: %v\n", redisKey, err)
}
})
}
Analysis of the Middleware:
* Header Check: The middleware is a no-op if the Idempotency-Key header is missing, allowing the same API to serve both idempotent and non-idempotent requests.
* Response Recorder: The use of a custom ResponseWriter is a standard pattern in Go for intercepting the response. This is critical for capturing the status code and body to be cached.
* Error Handling on Release: Notice the comment in the releaseLockSHA call. If caching the response fails, we log it but don't fail the request. The client has already received a successful response. The worst-case scenario is that the lock expires via its TTL, and a future retry is re-processed. This is a graceful degradation path.
Advanced Considerations and Production Hardening
Implementing the core logic is only half the battle. Senior engineers must consider the edge cases and operational realities.
Edge Case: Handling Large Response Payloads
Our current implementation stores the entire response body in Redis. This is fine for small JSON payloads, but what if your API returns a 10MB file or a large dataset? Storing this in Redis is inefficient and can strain your Redis instance's memory.
Solution: A hybrid caching approach.
- For responses under a certain threshold (e.g., 64KB), store them directly in the Redis Hash.
- For larger responses, upload the response body to a dedicated object store like Amazon S3 or Google Cloud Storage.
- In the Redis Hash, instead of the body, store a pointer or URL to the object in S3.
- When serving a cached response, the middleware would check for this pointer, fetch the object from S3, and stream it back to the client.
This adds complexity but keeps your Redis instance lean and fast, using it only for metadata and small payloads.
Edge Case: Request Body Hashing
The idempotency key guarantees that two requests with the same key are treated as one. But what if a client accidentally reuses a key for a different request payload?
POST /payments with Idempotency-Key: A and body: {amount: 100}
POST /payments with Idempotency-Key: A and body: {amount: 200}
With our current implementation, the second request would receive the cached response for the first, which is incorrect and dangerous. The idempotency key should be tied to the request's parameters.
Solution: Store a hash of the request body along with the initial lock.
Modify acquire_lock.lua:
-- ARGV[3]: A hash of the request body (e.g., SHA256)
-- ... inside the 'if key_exists == 1' block ...
local stored_hash = redis.call('HGET', KEYS[1], 'request_hash')
if stored_hash and stored_hash ~= ARGV[3] then
-- Key is being reused for a different request. This is a client error.
return {'conflict', 'key_reused'}
end
-- ... inside the 'else' block (first request) ...
redis.call('HSET', KEYS[1], 'status', 'in_progress', 'request_id', ARGV[2], 'request_hash', ARGV[3])
Your middleware would now be responsible for reading and hashing the request body before calling the script. This makes your API significantly safer against client-side bugs.
Performance and Scalability
* Redis Latency: The entire process adds two Redis round trips to each idempotent request. Since these are EVALSHA calls to in-memory data, the latency is typically sub-millisecond, which is acceptable for most APIs.
* Redis Cluster: The use of KEYS[1] ensures our scripts are compatible with Redis Cluster. The idempotency key itself would act as the sharding key, distributing the load across the cluster.
* Connection Pooling: Ensure your application maintains a healthy pool of connections to Redis to avoid connection setup overhead on each request.
Client-Side Responsibilities
An idempotent API is a contract. The server does its part, but the client must also behave correctly.
5xx error or a timeout, it should retry with the exact same idempotency key and request body.409 Conflict: If the client receives a 409 Conflict (our locked state), it means another request is being processed. The client should wait using an exponential backoff strategy before retrying.422 Unprocessable Entity: If using request body hashing, the server should return a 422 or 400 error if the key is reused with a different payload. The client should treat this as a fatal error and not retry.Full Example: Tying It All Together
Here is a runnable main.go to demonstrate the middleware in action.
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"time"
"idempotency-example/idempotency"
"github.com/go-redis/redis/v8"
)
func main() {
rdb := redis.NewClient(&redis.Options{
Addr: "localhost:6379",
})
if err := rdb.Ping(context.Background()).Err(); err != nil {
log.Fatalf("Could not connect to Redis: %v", err)
}
idemMiddleware, err := idempotency.New(rdb)
if err != nil {
log.Fatalf("Could not create idempotency middleware: %v", err)
}
// A simple handler that simulates work and returns a response
paymentHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
log.Println("Executing business logic...")
time.Sleep(2 * time.Second) // Simulate a slow operation
resp := map[string]string{
"status": "success",
"transaction_id": "txn_" + fmt.Sprintf("%d", time.Now().UnixNano()),
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusCreated)
json.NewEncoder(w).Encode(resp)
log.Println("Business logic finished.")
})
http.Handle("/payments", idemMiddleware.Handler(paymentHandler))
log.Println("Server starting on :8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatalf("Server failed: %v", err)
}
}
How to Test:
docker run -p 6379:6379 redisgo run .- In one terminal, send the first request:
curl -v -X POST \
-H "Content-Type: application/json" \
-H "Idempotency-Key: 111f787c-3538-4034-8032-a1649959f65d" \
http://localhost:8080/payments
This will take 2 seconds, and you will see the "Executing business logic..." log.
curl -v -X POST \
-H "Content-Type: application/json" \
-H "Idempotency-Key: 111f787c-3538-4034-8032-a1649959f65d" \
http://localhost:8080/payments
This will return a 409 Conflict instantly.
- After the first request completes, send it again:
curl -v -X POST \
-H "Content-Type: application/json" \
-H "Idempotency-Key: 111f787c-3538-4034-8032-a1649959f65d" \
http://localhost:8080/payments
This will return instantly with the exact same 201 Created response and body as the first request, but you will not see the "Executing business logic..." log message. The response was served from the Redis cache.
Conclusion
Building a truly resilient, idempotent API requires moving beyond simple database checks and embracing atomic operations. By leveraging Redis Hashes for a flexible state machine and Lua scripts for guaranteed atomicity, we can construct an idempotency layer that is fast, scalable, and safe from the race conditions that plague naive implementations. This pattern, while adding a layer of complexity, is a non-negotiable component of any system where the side effects of duplicate operations are unacceptable. It provides the strong guarantees necessary to build reliable distributed systems in finance, e-commerce, and any domain where correctness under failure conditions is paramount.