Atomic Idempotency Middleware in Go with Redis and Lua Scripting

September 30, 2025

15 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inescapable Problem of Duplicates in Distributed Systems

In any non-trivial distributed system, particularly those employing message queues or asynchronous API calls, the promise of "exactly-once" delivery is often a costly, complex, and sometimes unattainable ideal. The far more common reality is "at-least-once" delivery. This pragmatic guarantee ensures messages are never lost, but at the cost of potential duplicates. For a senior engineer, this isn't a surprise; it's a fundamental design constraint.

The burden of handling these duplicates invariably falls upon the consumer or service endpoint. A simple POST /orders request, if it encounters a network timeout, will likely be retried by the client. If the first request actually succeeded but the response was lost, this retry could result in a duplicate order. This is where idempotency ceases to be an academic concept and becomes a critical business requirement.

This article is not an introduction to idempotency. We assume you understand the why. Instead, we will perform a deep, surgical dive into the how: specifically, how to implement a robust, high-performance, and atomic idempotency middleware in Go, using Redis as a distributed state store and Lua scripting as our guarantee of atomicity. We will dissect naive approaches, expose their race conditions, and build a production-grade solution that handles concurrent requests, in-progress states, and graceful failure modes.

Establishing the Contract: The Idempotency Key

Before a single line of middleware code is written, we must establish the contract with our clients: the Idempotency-Key. This client-generated, unique identifier is the cornerstone of the entire pattern. The server uses this key to recognize resent requests for the same operation.

Key Generation Strategies:

UUIDs (v4 or v7): The most common and recommended approach. A client generates a UUIDv4 before the first attempt and reuses it for all retries of that specific operation. UUIDv7 is an emerging alternative that combines a timestamp with random bits, making keys sortable and offering better database locality, which can be a marginal benefit if you're persisting these keys.

Hash of Request Payload: A deterministic hash (e.g., SHA-256) of the stable parts of the request body can also serve as a key. This is less common as it requires careful normalization of the payload (e.g., ordering JSON keys) to ensure the hash is consistent. It's brittle and can fail if any trivial, non-functional part of the payload changes.

For our implementation, we'll assume the client sends the key via an HTTP header, Idempotency-Key. Our middleware's first job is to extract this key. A request without this key is treated as non-idempotent and bypasses the middleware's logic.

// A simplified example of extracting the key
func getIdempotencyKey(r *http.Request) string {
    return r.Header.Get("Idempotency-Key")
}

The Flawed "Check-Then-Act" Approach

A junior engineer's first attempt at this problem might involve a simple two-step Redis operation:

GET idempotency_key

If the key doesn't exist (redis.Nil), proceed with the operation, and then SET idempotency_key result.

If the key exists, return the cached result.

This is a classic race condition waiting to happen in production. Imagine two identical requests, A and B, arriving milliseconds apart:

Request A executes GET. The key doesn't exist.

Request B executes GET. The key still doesn't exist.

Request A's business logic executes.
Request B's business logic executes (duplicate operation!).

Request A finishes and SETs the result.

Request B finishes and SETs the result, overwriting A's.

This is completely broken. The check and the action must be atomic.

Atomic Operations: The `SET NX` Primitive

Redis provides the perfect primitive for this: the SET command with the NX option. SET key value NX will only set the key if it does not already exist. This single, atomic command eliminates the race condition described above.

Let's build a slightly more robust middleware using this primitive.

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"time"

	"github.com/go-redis/redis/v8"
)

var rdb *redis.Client

func init() {
	rdb = redis.NewClient(&redis.Options{
		Addr: "localhost:6379",
	})
}

// IdempotencyMiddlewareV1 provides a basic, but still flawed, idempotency check.
func IdempotencyMiddlewareV1(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		key := r.Header.Get("Idempotency-Key")
		if key == "" {
			next.ServeHTTP(w, r)
			return
		}

		redisKey := fmt.Sprintf("idempotency:%s", key)
		ctx := context.Background()

		// Atomically set the key if it doesn't exist.
		// We set a placeholder value "in-progress" with a short TTL.
		// This acts as a lock.
		wasSet, err := rdb.SetNX(ctx, redisKey, "in-progress", 5*time.Minute).Result()
		if err != nil {
			http.Error(w, "Internal Server Error", http.StatusInternalServerError)
			return
		}

		if wasSet {
			// We acquired the lock. Proceed with the handler.
			next.ServeHTTP(w, r)

			// After processing, we should update the key with the actual response.
			// For simplicity, we'll just set it to "completed".
			// A real implementation would store status code, headers, and body.
			rdb.Set(ctx, redisKey, "completed", 24*time.Hour)
		} else {
			// The key already existed. This could be a concurrent request or a completed one.
			// This is where V1 is flawed. We don't know the state.
			val, err := rdb.Get(ctx, redisKey).Result()
			if err != nil {
				http.Error(w, "Internal Server Error", http.StatusInternalServerError)
				return
			}

			if val == "completed" {
				// Here you would retrieve and serve the cached response.
				w.WriteHeader(http.StatusOK)
				w.Write([]byte("Returning cached response"))
			} else { // val is "in-progress"
				// A concurrent request is being processed. Reject this one.
				http.Error(w, "Request in progress", http.StatusConflict)
			}
		}
	})
}

The Lingering Flaw in `SET NX`

This IdempotencyMiddlewareV1 is better, but still has a subtle race condition and a poor user experience.

The Race: There's a gap between SetNX returning false and us executing GET. It's possible for the first request to complete and update the key to "completed" in that tiny window. Our GET might read "in-progress" and incorrectly return a 409 Conflict when it should be waiting or returning the now-completed response.

The User Experience: Returning a 409 Conflict immediately forces the client to implement its own polling or backoff logic. It would be far better if the server could handle this, perhaps by waiting for the first request to complete.

To solve these issues, we need to manage multiple states (PENDING, COMPLETED) and perform our check-and-act logic in a single, indivisible operation. This is a textbook use case for Redis Lua scripting.

The Production-Grade Solution: Multi-State Atomicity with Lua

By embedding Lua scripts within Redis, we can execute complex logic on the Redis server itself, ensuring that no other command can run concurrently against the keys our script is touching. This is our guarantee of atomicity.

We will design a two-stage process, orchestrated by two Lua scripts:

check_and_lock.lua: Executed at the beginning of a request. It checks the key's state and either acquires a PENDING lock or returns the COMPLETED result.

set_completed.lua: Executed at the end of a successful request. It atomically updates the key from PENDING to COMPLETED, storing the final response and setting a long-term TTL.

Script 1: `check_and_lock.lua`

This script is the core of our logic. It will atomically handle all initial states.

lua

-- check_and_lock.lua
-- KEYS[1]: The idempotency key (e.g., "idempotency:uuid-123")
-- ARGV[1]: The lock TTL in seconds (for the PENDING state)

local key = KEYS[1]
local lock_ttl = ARGV[1]

local value = redis.call('GET', key)

if value == false then
    -- Key does not exist. This is the first time we see this request.
    -- Set it to PENDING with a short lock TTL.
    redis.call('SET', key, '{"status":"PENDING"}', 'EX', lock_ttl)
    -- Return 'PROCEED' to signal the application to run the business logic.
    return 'PROCEED'
else
    -- Key exists. We need to check its status.
    local data = cjson.decode(value)
    if data.status == 'PENDING' then
        -- Another request is currently processing this key.
        -- Return 'CONFLICT' to signal the application to wait or reject.
        return 'CONFLICT'
    elseif data.status == 'COMPLETED' then
        -- The request has already been completed.
        -- Return the cached response.
        return value
    end
end

-- Fallback, should not be reached
return 'ERROR'

Key Design Decisions in this Script:

* JSON Payload: We store a JSON object as the value. This allows us to evolve the state machine later (e.g., add a FAILED state) and to store the full response for COMPLETED requests.

* Lock TTL (ARGV[1]): This is critical. If a server instance crashes after acquiring the lock but before completing the request, this TTL ensures the lock is eventually released, allowing a subsequent retry to proceed. A value of 5-10 minutes is a reasonable starting point.

* Return Values: The script returns distinct strings (PROCEED, CONFLICT) or the cached JSON data. This creates a simple, clear API for our Go middleware to consume.

Script 2: `set_completed.lua`

This script is simpler. It's called after the business logic has successfully executed.

lua

-- set_completed.lua
-- KEYS[1]: The idempotency key
-- ARGV[1]: The final response data (JSON string)
-- ARGV[2]: The final TTL for the completed key

local key = KEYS[1]
local response_data = ARGV[1]
local final_ttl = ARGV[2]

local value = redis.call('GET', key)

-- We should only update if the key is in PENDING state,
-- as a safeguard against unexpected state transitions.
if value ~= false then
    local data = cjson.decode(value)
    if data.status == 'PENDING' then
        redis.call('SET', key, response_data, 'EX', final_ttl)
        return 'OK'
    end
end

-- This might happen if the lock expired. The caller should handle this.
return 'ERROR'

Implementing the Full Go Middleware

Now, let's integrate these scripts into a production-ready Go middleware. We'll need a way to capture the response from the next handler to cache it.

package main

import (
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"net/http/httptest"
	"time"

	"github.com/go-redis/redis/v8"
)

// StoredResponse is the structure we'll save in Redis.
type StoredResponse struct {
	Status  string      `json:"status"` // PENDING, COMPLETED
	Code    int         `json:"code"`
	Headers http.Header `json:"headers"`
	Body    []byte      `json:"body"`
}

// IdempotencyMiddleware manages the entire lifecycle.
type IdempotencyMiddleware struct {
	RedisClient *redis.Client
	CheckScript *redis.Script
	SetScript   *redis.Script
	LockTTL     time.Duration
	FinalTTL    time.Duration
}

// NewIdempotencyMiddleware initializes the middleware and loads the Lua scripts into Redis.
func NewIdempotencyMiddleware(rdb *redis.Client) *IdempotencyMiddleware {
	checkLua := `
        local key = KEYS[1]
        local lock_ttl = ARGV[1]
        local value = redis.call('GET', key)
        if value == false then
            redis.call('SET', key, '{"status":"PENDING"}', 'EX', lock_ttl)
            return 'PROCEED'
        else
            local data = cjson.decode(value)
            if data.status == 'PENDING' then
                return 'CONFLICT'
            elseif data.status == 'COMPLETED' then
                return value
            end
        end
        return 'ERROR'
    `

	setLua := `
        local key = KEYS[1]
        local response_data = ARGV[1]
        local final_ttl = ARGV[2]
        local value = redis.call('GET', key)
        if value ~= false then
            local data = cjson.decode(value)
            if data.status == 'PENDING' then
                redis.call('SET', key, response_data, 'EX', final_ttl)
                return 'OK'
            end
        end
        return 'ERROR'
    `

	return &IdempotencyMiddleware{
		RedisClient: rdb,
		CheckScript: redis.NewScript(checkLua),
		SetScript:   redis.NewScript(setLua),
		LockTTL:     5 * time.Minute,
		FinalTTL:    24 * time.Hour,
	}
}

func (im *IdempotencyMiddleware) Middleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		key := r.Header.Get("Idempotency-Key")
		if key == "" {
			next.ServeHTTP(w, r)
			return
		}

		redisKey := fmt.Sprintf("idempotency:%s", key)
		ctx := r.Context()

		// === STAGE 1: Check and Lock ===
		result, err := im.CheckScript.Run(ctx, im.RedisClient, []string{redisKey}, im.LockTTL.Seconds()).Result()
		if err != nil {
			log.Printf("Error running CheckScript: %v", err)
			// Fail-closed: If our idempotency store is down, we reject the request.
			http.Error(w, "Idempotency service unavailable", http.StatusServiceUnavailable)
			return
		}

		switch resStr := result.(string); resStr {
		case "PROCEED":
			// This is a new request. We have the lock.
			// Use a response recorder to capture the response from the next handler.
			recorder := httptest.NewRecorder()
			next.ServeHTTP(recorder, r)

			// Copy the recorded response to the actual response writer.
			for k, v := range recorder.Header() {
				w.Header()[k] = v
			}
			w.WriteHeader(recorder.Code)
			w.Write(recorder.Body.Bytes())

			// === STAGE 2: Store the final response ===
			if recorder.Code < 400 { // Only cache successful responses
				responseToCache := StoredResponse{
					Status:  "COMPLETED",
					Code:    recorder.Code,
					Headers: recorder.Header(),
					Body:    recorder.Body.Bytes(),
				}
				jsonData, _ := json.Marshal(responseToCache)

				err := im.SetScript.Run(ctx, im.RedisClient, []string{redisKey}, jsonData, im.FinalTTL.Seconds()).Err()
				if err != nil {
					// Critical: Failed to save the result. The lock will eventually expire.
					// Log this as a high-priority error.
					log.Printf("CRITICAL: Failed to set completed state for key %s: %v", redisKey, err)
				}
			}

		case "CONFLICT":
			// Another request is in-flight. Reject this one.
			http.Error(w, "A request with this Idempotency-Key is already in progress.", http.StatusConflict)

		default: // This is a cached response
			var cachedResponse StoredResponse
			if err := json.Unmarshal([]byte(resStr), &cachedResponse); err != nil {
				log.Printf("Error unmarshalling cached response: %v", err)
				http.Error(w, "Internal Server Error", http.StatusInternalServerError)
				return
			}

			// Serve the cached response.
			for k, v := range cachedResponse.Headers {
				w.Header()[k] = v
			}
			w.WriteHeader(cachedResponse.Code)
			w.Write(cachedResponse.Body)
		}
	})
}

Production Patterns & Performance Considerations

This implementation is robust, but deploying it requires careful thought about performance and failure modes.

Performance Overhead

* Best Case (New Request): Two Redis round-trips are added to the request lifecycle. CheckScript.Run() before the handler, SetScript.Run() after. With a co-located Redis, this might be ~1-2ms of added latency.

* Cached Case (Duplicate Request): One Redis round-trip. CheckScript.Run() returns the cached data immediately. This is extremely fast.

* Serialization: We're using encoding/json. For high-throughput services, consider a faster binary format like MessagePack or Protobuf to reduce CPU overhead and the size of data sent to/from Redis.

This pattern is best suited for operations where the business logic is significantly more expensive than the Redis overhead (e.g., database writes, calls to third-party services). Do not apply it to trivial GET requests.

TTL Strategy Deep Dive

* Lock TTL: This is your safety net. It must be longer than the maximum expected processing time for your handler, but short enough to not block retries for an unreasonable amount of time if a server dies. If your P99 response time is 2 seconds, a Lock TTL of 30-60 seconds might be appropriate. If it's a long-running batch job, this could be 10-15 minutes.

* Final TTL: This determines how long you remember a completed request. 24 hours is a common standard. It depends on your clients' retry policies. For payment processing, this might be extended to 72 hours or more.

Advanced Edge Cases and Error Handling

* Redis Unavailability: Our code implements a fail-closed strategy. If Redis is down, we return a 503 Service Unavailable. This is the safest approach for operations with side effects. Processing the request (fail-open) would break the idempotency guarantee. This choice must be a conscious one, aligned with business requirements.

* Partial Failures (The Happy Path for Self-Healing):

1. Request A acquires the PENDING lock.

2. The handler logic writes to the database.

3. The server process crashes before it can run SetScript to mark the request as COMPLETED.

4. The PENDING key remains in Redis.

5. A client retry (Request B) arrives. CheckScript sees the PENDING state and returns CONFLICT. This is correct; it prevents a duplicate DB write.

6. After the Lock TTL expires, the key is evicted from Redis.

7. A further client retry (Request C) arrives. CheckScript sees no key, acquires a new PENDING lock, and proceeds. The application logic must now be idempotent on its own (e.g., INSERT ... ON CONFLICT DO NOTHING in the database) to handle this second execution. The middleware guarantees atomicity at the entrypoint, but your core logic should still strive for idempotency where possible.

* What to Cache: Be cautious about caching large response bodies in Redis, as this can strain memory. For responses with large payloads, you might choose to only cache the headers and status code, and store the body in a more suitable object store like S3, putting only the S3 object URL in the Redis value.

Conclusion

We have journeyed from a fundamentally flawed check-then-act pattern to a production-grade, atomic idempotency layer. By leveraging the atomicity of Redis Lua scripts, we created a state machine (PENDING -> COMPLETED) that correctly handles concurrency, provides a clean API for our application logic, and self-heals from partial failures through a well-considered TTL strategy.

Implementing this pattern is not trivial. It adds operational complexity and a dependency on a healthy Redis cluster. However, for any service that exposes critical, side-effect-producing endpoints in an at-least-once delivery environment, this level of rigor is not a luxury—it is a prerequisite for building a reliable and trustworthy system.