Idempotent gRPC Retries with Linkerd Service Profiles
The Peril of Naive Retries in Distributed Systems
In any non-trivial microservices architecture, transient network failures are not an edge case; they are an inevitability. A common response is to implement client-side retry logic. However, for state-changing, non-idempotent operations, this is a dangerous path. Consider a canonical CreateOrder gRPC endpoint. A standard interaction flow that leads to disaster looks like this:
CreateOrderRequest to the OrderService.CreateOrderResponse back to Client A.- A transient network partition occurs. The response packet is dropped and never reaches the client.
CreateOrderRequest again.This is the classic double-spend or duplicate-operation problem. The root cause is that the CreateOrder operation is not idempotent. An operation is idempotent if making the same call multiple times produces the same result as making it once. GET requests are typically idempotent; POST or CREATE requests are not.
To solve this, we must make our mutation operations effectively idempotent. The most robust pattern for this is the Idempotency-Key. The client generates a unique key (e.g., a UUID) for each distinct operation. It sends this key with the request. The server then tracks these keys and ensures that the logic for a given key is executed only once.
While this server-side logic is crucial, we can elevate the architecture by offloading the decision to retry from the client to the service mesh. This is where Linkerd's ServiceProfile Custom Resource Definition (CRD) becomes an incredibly powerful tool. It allows us to declaratively configure per-route behavior, including retries, without a single line of code change in our clients. This post details the end-to-end implementation of this pattern.
Step 1: Implementing Server-Side Idempotency
Before we can tell Linkerd to retry anything safely, our gRPC service must be able to handle repeated requests for the same operation without causing side effects. We'll build an OrderService in Go that uses Redis to track idempotency keys.
Defining the gRPC Service
First, we modify our .proto definition to include the idempotency_key field. Placing it directly in the message is explicit and avoids potential ambiguity with gRPC metadata, which can be stripped by intermediaries.
proto/orders/v1/orders.proto
syntax = "proto3";
package orders.v1;
option go_package = "github.com/your-org/your-repo/gen/orders/v1;ordersv1";
service OrderService {
rpc CreateOrder(CreateOrderRequest) returns (CreateOrderResponse);
}
message Order {
string id = 1;
string user_id = 2;
repeated string item_ids = 3;
int64 total_price_cents = 4;
}
message CreateOrderRequest {
string user_id = 1;
repeated string item_ids = 2;
// A UUIDv4 generated by the client for this specific operation.
string idempotency_key = 3;
}
message CreateOrderResponse {
Order order = 1;
}
The Idempotency Store Logic
Our server will use Redis to store the status and result of operations associated with an idempotency key. The state machine for a key is simple:
ABORTED status to prevent a thundering herd problem.We'll use Redis's SETNX (SET if Not eXists) command to atomically claim an idempotency key and handle race conditions.
internal/idempotency/store.go
package idempotency
import (
"context"
"encoding/json"
"errors"
"time"
"github.com/redis/go-redis/v9"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
// Status represents the state of an idempotent operation.
_type Status string
const (
InProgress Status = "IN_PROGRESS"
Completed Status = "COMPLETED"
)
// Result stores the outcome of a completed operation.
_type Result struct {
Status Status `json:"status"`
Body json.RawMessage `json:"body"`
}
// Store provides an interface for an idempotency key store.
_type Store struct {
client *redis.Client
// How long to hold the lock while an operation is in progress.
// This should be longer than the expected operation duration.
inProgressTTL time.Duration
// How long to cache the final result of a completed operation.
completedTTL time.Duration
}
func NewStore(client *redis.Client) *Store {
return &Store{
client: client,
inProgressTTL: 5 * time.Second,
completedTTL: 24 * time.Hour,
}
}
// CheckAndSet checks the status of an idempotency key. If the key is new,
// it sets it to IN_PROGRESS. If it's already completed, it returns the cached result.
// It returns an error if the operation is already in progress.
func (s *Store) CheckAndSet(ctx context.Context, key string) (*Result, error) {
val, err := s.client.Get(ctx, key).Result()
if err == nil {
// Key exists, decode it
var res Result
if err := json.Unmarshal([]byte(val), &res); err != nil {
return nil, status.Error(codes.Internal, "failed to decode idempotency result")
}
if res.Status == InProgress {
// Another request is already processing this. Fail fast.
return nil, status.Error(codes.Aborted, "request with this idempotency key is already in progress")
}
if res.Status == Completed {
// Operation already completed, return cached result.
return &res, nil
}
}
if err != redis.Nil {
// A real error occurred with Redis
return nil, status.Error(codes.Internal, "idempotency store check failed")
}
// Key does not exist. Try to claim it.
progressMarker := Result{Status: InProgress}
progressJSON, _ := json.Marshal(progressMarker)
// Use SETNX to atomically claim the key.
claimed, err := s.client.SetNX(ctx, key, progressJSON, s.inProgressTTL).Result()
if err != nil {
return nil, status.Error(codes.Internal, "failed to claim idempotency key")
}
if !claimed {
// Race condition: another process claimed it between our GET and SETNX.
// Treat as if it's already in progress.
return nil, status.Error(codes.Aborted, "request with this idempotency key is already in progress")
}
// We successfully claimed the key. Return nil to signal the caller to proceed.
return nil, nil
}
// SetCompleted marks an operation as completed and stores its result.
func (s *Store) SetCompleted(ctx context.Context, key string, body interface{}) error {
bodyJSON, err := json.Marshal(body)
if err != nil {
return status.Error(codes.Internal, "failed to marshal response body")
}
result := Result{
Status: Completed,
Body: json.RawMessage(bodyJSON),
}
resultJSON, _ := json.Marshal(result)
if err := s.client.Set(ctx, key, resultJSON, s.completedTTL).Err(); err != nil {
// Log this critical failure. The operation succeeded but we failed to cache it.
// Subsequent retries will re-execute the logic.
// This is a trade-off: better to re-execute than to lose the result of a successful call.
return status.Error(codes.Internal, "failed to save idempotency result")
}
return nil
}
The gRPC Server Implementation
Now, we integrate this idempotency store into our CreateOrder handler.
internal/server/server.go
package server
import (
"context"
"encoding/json"
"log"
"github.com/google/uuid"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
ordersv1 "github.com/your-org/your-repo/gen/orders/v1"
"github.com/your-org/your-repo/internal/idempotency"
)
_type OrderServiceServer struct {
ordersv1.UnimplementedOrderServiceServer
idemStore *idempotency.Store
// In a real app, this would be a database connection pool.
db *mockDB
}
func NewOrderServiceServer(idemStore *idempotency.Store) *OrderServiceServer {
return &OrderServiceServer{
idemStore: idemStore,
db: &mockDB{},
}
}
func (s *OrderServiceServer) CreateOrder(ctx context.Context, req *ordersv1.CreateOrderRequest) (*ordersv1.CreateOrderResponse, error) {
if req.GetIdempotencyKey() == "" {
return nil, status.Error(codes.InvalidArgument, "idempotency_key is required")
}
// 1. Check the idempotency store
cachedResult, err := s.idemStore.CheckAndSet(ctx, req.GetIdempotencyKey())
if err != nil {
return nil, err // Propagate Aborted or Internal errors
}
if cachedResult != nil {
// Operation was already completed. Return the cached response.
log.Printf("Idempotency hit for key: %s", req.GetIdempotencyKey())
var resp ordersv1.CreateOrderResponse
if err := json.Unmarshal(cachedResult.Body, &resp); err != nil {
return nil, status.Error(codes.Internal, "failed to unmarshal cached response")
}
return &resp, nil
}
// 2. If we are here, we have the lock. Execute the business logic.
log.Printf("Idempotency miss for key: %s. Processing...", req.GetIdempotencyKey())
newOrder, err := s.processNewOrder(ctx, req)
if err != nil {
// Note: We don't clear the idempotency key on failure.
// If the client retries with the same key, they will get this same error.
// This prevents a client from retrying a failing call indefinitely.
// To allow retries on business logic failures, you would need to clear the key here.
return nil, status.Error(codes.Internal, "failed to process order")
}
resp := &ordersv1.CreateOrderResponse{Order: newOrder}
// 3. Store the successful result before returning.
if err := s.idemStore.SetCompleted(ctx, req.GetIdempotencyKey(), resp); err != nil {
// This is a critical failure. The DB commit succeeded but the idempotency
// cache failed. The system is in an inconsistent state for this key.
// A robust system might have a background job to reconcile these.
log.Printf("CRITICAL: Failed to set idempotency key %s to completed: %v", req.GetIdempotencyKey(), err)
return nil, err
}
return resp, nil
}
// processNewOrder simulates the actual work: interacting with a database.
func (s *OrderServiceServer) processNewOrder(ctx context.Context, req *ordersv1.CreateOrderRequest) (*ordersv1.Order, error) {
// Simulate database write
log.Printf("Writing order to database for user %s", req.GetUserId())
time.Sleep(100 * time.Millisecond)
order := &ordersv1.Order{
Id: uuid.New().String(),
UserId: req.GetUserId(),
ItemIds: req.GetItemIds(),
TotalPriceCents: 19999, // Calculated in a real service
}
return order, nil
}
_type mockDB struct{}
With this server-side implementation, our CreateOrder endpoint is now idempotent. A client can send the same request with the same idempotency_key a dozen times, and only one order will ever be created. Subsequent calls will receive the cached response of the first successful call.
Step 2: Configuring Linkerd for Smart Retries
Now that the server is safe, we can configure the service mesh. Linkerd uses a ServiceProfile CRD to define per-route behavior for a given Kubernetes service. This is where we tell Linkerd when it's safe to retry a request.
Creating the ServiceProfile CRD
We will define a route for our CreateOrder RPC. The key fields are:
* isRetryable: A condition that specifies which responses should trigger a retry. We will configure this to retry on gRPC unavailable codes, which typically signify transient network issues. We will not retry on aborted (our idempotency lock) or invalid_argument.
* timeout: A per-route timeout, which is more specific and often more useful than a global client timeout.
Here is the complete ServiceProfile manifest.
k8s/serviceprofile.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: order-service.default.svc.cluster.local
namespace: default
spec:
routes:
- name: /orders.v1.OrderService/CreateOrder
condition:
method: POST
pathRegex: /orders\.v1\.OrderService/CreateOrder
isRetryable: true
retryBudget:
# The ratio of retries to requests. 0.2 means for every 5 requests,
# we can add 1 retry request.
retryRatio: 0.2
# The minimum number of retries that can be sent per second, even if
# it violates the retryRatio. This helps during cold starts.
minRetriesPerSecond: 10
# How long the retry budget is calculated over.
ttl: 10s
timeout: 250ms
# It's good practice to define a default route for other RPCs
- name: default_route
condition:
method: POST
pathRegex: /.*
# Do not retry other routes by default
isRetryable: false
timeout: 500ms
A critical note on isRetryable: By default, Linkerd retries on POST requests if the gRPC status code is unavailable, resource_exhausted, or if the response is a 503 from the underlying transport. By setting isRetryable: true, we are opting into this default safe retry policy. For more granular control, you could define a responseClasses block to specify exactly which status codes to retry on.
For our use case, the default behavior is perfect. A lost response from our server will manifest as a unavailable status to the client-side Linkerd proxy, which will then trigger a retry. Because our server is idempotent, this retry is completely safe.
Architectural View
The full, resilient flow now looks like this:
CreateOrderRequest with a unique idempotency_key.order-service Pod.order-service container.isRetryable rule in the ServiceProfile.COMPLETED key in Redis.The client application is completely unaware that a retry ever happened. The retry logic is fully encapsulated within the service mesh, and the server's idempotency guarantee ensures safety.
Advanced Considerations and Edge Cases
This pattern is robust, but in a high-throughput production environment, several edge cases must be considered.
Performance of the Idempotency Store
Redis is an excellent choice for its low latency. However, its default persistence models (RDB/AOF) can have data loss windows. For financial transactions, using a durable database like PostgreSQL for the idempotency store might be required, at the cost of higher latency. If using PostgreSQL, be sure to place a unique index on the idempotency_key column to enforce uniqueness at the database level.
Benchmark your store's latency. The P99 latency of your CheckAndSet call will be added to every single request to an idempotent endpoint. It must be extremely fast.
Idempotency Key TTL Management
Our completedTTL is set to 24 hours. This is a business decision. How long does a client's retry window last? If a mobile client can be offline for 3 days and then retry an operation upon reconnecting, your TTL needs to be longer. This has a direct impact on the storage size of your idempotency store. A robust garbage collection strategy for old keys is essential.
Propagating Idempotency Keys Across Services
Imagine the OrderService needs to call a PaymentService to complete an order. If the entire workflow needs to be idempotent, the original idempotency_key must be propagated. The best way to do this is via gRPC metadata. The OrderService would extract the key from the request body and insert it into the outgoing metadata of its call to the PaymentService.
// In OrderService, when calling PaymentService
md := metadata.Pairs("x-idempotency-key", req.GetIdempotencyKey())
ctx := metadata.NewOutgoingContext(ctx, md)
paymentResp, err := s.paymentClient.ProcessPayment(ctx, paymentReq)
The PaymentService would then have middleware to inspect incoming metadata for this key and apply its own idempotency logic.
Monitoring and Observability
How do you know this system is working? Two key metrics are essential:
linkerd viz routes deploy/order-service --to svc/order-service to see the effective success rate and retry rate for each route defined in your ServiceProfile. A high retry rate can indicate underlying network instability.server.go with Prometheus metrics:
import "github.com/prometheus/client_golang/prometheus"
var idempotencyHits = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "idempotency_cache_hits_total",
Help: "Total number of idempotent responses served from cache.",
},
[]string{"method"},
)
func init() {
prometheus.MustRegister(idempotencyHits)
}
// In CreateOrder, when a cached result is found:
if cachedResult != nil {
idempotencyHits.WithLabelValues("/orders.v1.OrderService/CreateOrder").Inc()
// ... return cached response
}
By observing both metrics, you get a complete picture of network health (from Linkerd) and application-level resilience (from your custom metric).
Conclusion
By combining application-aware, stateful idempotency logic on the server with declarative, client-agnostic retry configuration in the service mesh, we achieve a highly resilient and decoupled architecture. The application is responsible for guaranteeing the safety of retries, while the infrastructure is responsible for executing them. This separation of concerns is a hallmark of a mature microservices implementation. This pattern moves the complex and error-prone task of retry logic out of the hands of every client developer and into a single, observable, and configurable layer of the platform, enabling your services to weather the storm of transient failures gracefully and correctly.