Idempotent gRPC Retries with Linkerd Service Profiles

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Peril of Naive Retries in Distributed Systems

In any non-trivial microservices architecture, transient network failures are not an edge case; they are an inevitability. A common response is to implement client-side retry logic. However, for state-changing, non-idempotent operations, this is a dangerous path. Consider a canonical CreateOrder gRPC endpoint. A standard interaction flow that leads to disaster looks like this:

  • Client A sends a CreateOrderRequest to the OrderService.
  • OrderService receives the request, validates it, and successfully persists a new order record in its database.
  • OrderService attempts to send a CreateOrderResponse back to Client A.
    • A transient network partition occurs. The response packet is dropped and never reaches the client.
  • Client A's request times out. Its retry logic kicks in, and it sends the exact same CreateOrderRequest again.
  • OrderService receives the second request. From its perspective, this is a new, valid request. It proceeds to create a second order in the database, potentially charging the customer twice.
  • This is the classic double-spend or duplicate-operation problem. The root cause is that the CreateOrder operation is not idempotent. An operation is idempotent if making the same call multiple times produces the same result as making it once. GET requests are typically idempotent; POST or CREATE requests are not.

    To solve this, we must make our mutation operations effectively idempotent. The most robust pattern for this is the Idempotency-Key. The client generates a unique key (e.g., a UUID) for each distinct operation. It sends this key with the request. The server then tracks these keys and ensures that the logic for a given key is executed only once.

    While this server-side logic is crucial, we can elevate the architecture by offloading the decision to retry from the client to the service mesh. This is where Linkerd's ServiceProfile Custom Resource Definition (CRD) becomes an incredibly powerful tool. It allows us to declaratively configure per-route behavior, including retries, without a single line of code change in our clients. This post details the end-to-end implementation of this pattern.


    Step 1: Implementing Server-Side Idempotency

    Before we can tell Linkerd to retry anything safely, our gRPC service must be able to handle repeated requests for the same operation without causing side effects. We'll build an OrderService in Go that uses Redis to track idempotency keys.

    Defining the gRPC Service

    First, we modify our .proto definition to include the idempotency_key field. Placing it directly in the message is explicit and avoids potential ambiguity with gRPC metadata, which can be stripped by intermediaries.

    proto/orders/v1/orders.proto

    protobuf
    syntax = "proto3";
    
    package orders.v1;
    
    option go_package = "github.com/your-org/your-repo/gen/orders/v1;ordersv1";
    
    service OrderService {
      rpc CreateOrder(CreateOrderRequest) returns (CreateOrderResponse);
    }
    
    message Order {
      string id = 1;
      string user_id = 2;
      repeated string item_ids = 3;
      int64 total_price_cents = 4;
    }
    
    message CreateOrderRequest {
      string user_id = 1;
      repeated string item_ids = 2;
      // A UUIDv4 generated by the client for this specific operation.
      string idempotency_key = 3;
    }
    
    message CreateOrderResponse {
      Order order = 1;
    }

    The Idempotency Store Logic

    Our server will use Redis to store the status and result of operations associated with an idempotency key. The state machine for a key is simple:

  • Not Found: The operation has not been seen. Proceed.
  • IN_PROGRESS: The operation is currently being processed by another request. The current request should fail fast with an ABORTED status to prevent a thundering herd problem.
  • COMPLETED: The operation has already succeeded. The server should return the cached response without re-executing the business logic.
  • We'll use Redis's SETNX (SET if Not eXists) command to atomically claim an idempotency key and handle race conditions.

    internal/idempotency/store.go

    go
    package idempotency
    
    import (
    	"context"
    	"encoding/json"
    	"errors"
    	"time"
    
    	"github.com/redis/go-redis/v9"
    	"google.golang.org/grpc/codes"
    	"google.golang.org/grpc/status"
    )
    
    // Status represents the state of an idempotent operation.
    _type Status string
    
    const (
    	InProgress Status = "IN_PROGRESS"
    	Completed  Status = "COMPLETED"
    )
    
    // Result stores the outcome of a completed operation.
    _type Result struct {
    	Status Status          `json:"status"`
    	Body   json.RawMessage `json:"body"`
    }
    
    // Store provides an interface for an idempotency key store.
    _type Store struct {
    	client *redis.Client
    	// How long to hold the lock while an operation is in progress.
    	// This should be longer than the expected operation duration.
    	inProgressTTL time.Duration
    	// How long to cache the final result of a completed operation.
    	completedTTL time.Duration
    }
    
    func NewStore(client *redis.Client) *Store {
    	return &Store{
    		client:        client,
    		inProgressTTL: 5 * time.Second,
    		completedTTL:  24 * time.Hour,
    	}
    }
    
    // CheckAndSet checks the status of an idempotency key. If the key is new,
    // it sets it to IN_PROGRESS. If it's already completed, it returns the cached result.
    // It returns an error if the operation is already in progress.
    func (s *Store) CheckAndSet(ctx context.Context, key string) (*Result, error) {
    	val, err := s.client.Get(ctx, key).Result()
    	if err == nil {
    		// Key exists, decode it
    		var res Result
    		if err := json.Unmarshal([]byte(val), &res); err != nil {
    			return nil, status.Error(codes.Internal, "failed to decode idempotency result")
    		}
    
    		if res.Status == InProgress {
    			// Another request is already processing this. Fail fast.
    			return nil, status.Error(codes.Aborted, "request with this idempotency key is already in progress")
    		}
    
    		if res.Status == Completed {
    			// Operation already completed, return cached result.
    			return &res, nil
    		}
    	}
    
    	if err != redis.Nil {
    		// A real error occurred with Redis
    		return nil, status.Error(codes.Internal, "idempotency store check failed")
    	}
    
    	// Key does not exist. Try to claim it.
    	progressMarker := Result{Status: InProgress}
    	progressJSON, _ := json.Marshal(progressMarker)
    
    	// Use SETNX to atomically claim the key.
    	claimed, err := s.client.SetNX(ctx, key, progressJSON, s.inProgressTTL).Result()
    	if err != nil {
    		return nil, status.Error(codes.Internal, "failed to claim idempotency key")
    	}
    
    	if !claimed {
    		// Race condition: another process claimed it between our GET and SETNX.
    		// Treat as if it's already in progress.
    		return nil, status.Error(codes.Aborted, "request with this idempotency key is already in progress")
    	}
    
    	// We successfully claimed the key. Return nil to signal the caller to proceed.
    	return nil, nil
    }
    
    // SetCompleted marks an operation as completed and stores its result.
    func (s *Store) SetCompleted(ctx context.Context, key string, body interface{}) error {
    	bodyJSON, err := json.Marshal(body)
    	if err != nil {
    		return status.Error(codes.Internal, "failed to marshal response body")
    	}
    
    	result := Result{
    		Status: Completed,
    		Body:   json.RawMessage(bodyJSON),
    	}
    
    	resultJSON, _ := json.Marshal(result)
    
    	if err := s.client.Set(ctx, key, resultJSON, s.completedTTL).Err(); err != nil {
    		// Log this critical failure. The operation succeeded but we failed to cache it.
    		// Subsequent retries will re-execute the logic.
    		// This is a trade-off: better to re-execute than to lose the result of a successful call.
    		return status.Error(codes.Internal, "failed to save idempotency result")
    	}
    
    	return nil
    }

    The gRPC Server Implementation

    Now, we integrate this idempotency store into our CreateOrder handler.

    internal/server/server.go

    go
    package server
    
    import (
    	"context"
    	"encoding/json"
    	"log"
    
    	"github.com/google/uuid"
    	"google.golang.org/grpc/codes"
    	"google.golang.org/grpc/status"
    
    	ordersv1 "github.com/your-org/your-repo/gen/orders/v1"
    	"github.com/your-org/your-repo/internal/idempotency"
    )
    
    _type OrderServiceServer struct {
    	ordersv1.UnimplementedOrderServiceServer
    	idemStore *idempotency.Store
    	// In a real app, this would be a database connection pool.
    	db *mockDB
    }
    
    func NewOrderServiceServer(idemStore *idempotency.Store) *OrderServiceServer {
    	return &OrderServiceServer{
    		idemStore: idemStore,
    		db:        &mockDB{},
    	}
    }
    
    func (s *OrderServiceServer) CreateOrder(ctx context.Context, req *ordersv1.CreateOrderRequest) (*ordersv1.CreateOrderResponse, error) {
    	if req.GetIdempotencyKey() == "" {
    		return nil, status.Error(codes.InvalidArgument, "idempotency_key is required")
    	}
    
    	// 1. Check the idempotency store
    	cachedResult, err := s.idemStore.CheckAndSet(ctx, req.GetIdempotencyKey())
    	if err != nil {
    		return nil, err // Propagate Aborted or Internal errors
    	}
    
    	if cachedResult != nil {
    		// Operation was already completed. Return the cached response.
    		log.Printf("Idempotency hit for key: %s", req.GetIdempotencyKey())
    		var resp ordersv1.CreateOrderResponse
    		if err := json.Unmarshal(cachedResult.Body, &resp); err != nil {
    			return nil, status.Error(codes.Internal, "failed to unmarshal cached response")
    		}
    		return &resp, nil
    	}
    
    	// 2. If we are here, we have the lock. Execute the business logic.
    	log.Printf("Idempotency miss for key: %s. Processing...", req.GetIdempotencyKey())
    	newOrder, err := s.processNewOrder(ctx, req)
    	if err != nil {
    		// Note: We don't clear the idempotency key on failure. 
    		// If the client retries with the same key, they will get this same error.
    		// This prevents a client from retrying a failing call indefinitely.
    		// To allow retries on business logic failures, you would need to clear the key here.
    		return nil, status.Error(codes.Internal, "failed to process order")
    	}
    
    	resp := &ordersv1.CreateOrderResponse{Order: newOrder}
    
    	// 3. Store the successful result before returning.
    	if err := s.idemStore.SetCompleted(ctx, req.GetIdempotencyKey(), resp); err != nil {
    		// This is a critical failure. The DB commit succeeded but the idempotency 
    		// cache failed. The system is in an inconsistent state for this key.
    		// A robust system might have a background job to reconcile these.
    		log.Printf("CRITICAL: Failed to set idempotency key %s to completed: %v", req.GetIdempotencyKey(), err)
    		return nil, err
    	}
    
    	return resp, nil
    }
    
    // processNewOrder simulates the actual work: interacting with a database.
    func (s *OrderServiceServer) processNewOrder(ctx context.Context, req *ordersv1.CreateOrderRequest) (*ordersv1.Order, error) {
    	// Simulate database write
    	log.Printf("Writing order to database for user %s", req.GetUserId())
    	time.Sleep(100 * time.Millisecond)
    
    	order := &ordersv1.Order{
    		Id:              uuid.New().String(),
    		UserId:          req.GetUserId(),
    		ItemIds:         req.GetItemIds(),
    		TotalPriceCents: 19999, // Calculated in a real service
    	}
    
    	return order, nil
    }
    
    _type mockDB struct{}

    With this server-side implementation, our CreateOrder endpoint is now idempotent. A client can send the same request with the same idempotency_key a dozen times, and only one order will ever be created. Subsequent calls will receive the cached response of the first successful call.


    Step 2: Configuring Linkerd for Smart Retries

    Now that the server is safe, we can configure the service mesh. Linkerd uses a ServiceProfile CRD to define per-route behavior for a given Kubernetes service. This is where we tell Linkerd when it's safe to retry a request.

    Creating the ServiceProfile CRD

    We will define a route for our CreateOrder RPC. The key fields are:

    * isRetryable: A condition that specifies which responses should trigger a retry. We will configure this to retry on gRPC unavailable codes, which typically signify transient network issues. We will not retry on aborted (our idempotency lock) or invalid_argument.

    * timeout: A per-route timeout, which is more specific and often more useful than a global client timeout.

    Here is the complete ServiceProfile manifest.

    k8s/serviceprofile.yaml

    yaml
    apiVersion: linkerd.io/v1alpha2
    kind: ServiceProfile
    metadata:
      name: order-service.default.svc.cluster.local
      namespace: default
    spec:
      routes:
        - name: /orders.v1.OrderService/CreateOrder
          condition:
            method: POST
            pathRegex: /orders\.v1\.OrderService/CreateOrder
          isRetryable: true
          retryBudget:
            # The ratio of retries to requests. 0.2 means for every 5 requests, 
            # we can add 1 retry request.
            retryRatio: 0.2
            # The minimum number of retries that can be sent per second, even if
            # it violates the retryRatio. This helps during cold starts.
            minRetriesPerSecond: 10
            # How long the retry budget is calculated over. 
            ttl: 10s
          timeout: 250ms
        # It's good practice to define a default route for other RPCs
        - name: default_route
          condition:
            method: POST
            pathRegex: /.*
          # Do not retry other routes by default
          isRetryable: false
          timeout: 500ms

    A critical note on isRetryable: By default, Linkerd retries on POST requests if the gRPC status code is unavailable, resource_exhausted, or if the response is a 503 from the underlying transport. By setting isRetryable: true, we are opting into this default safe retry policy. For more granular control, you could define a responseClasses block to specify exactly which status codes to retry on.

    For our use case, the default behavior is perfect. A lost response from our server will manifest as a unavailable status to the client-side Linkerd proxy, which will then trigger a retry. Because our server is idempotent, this retry is completely safe.

    Architectural View

    The full, resilient flow now looks like this:

  • Client generates a CreateOrderRequest with a unique idempotency_key.
  • The request is intercepted by the client's Linkerd sidecar proxy.
  • The sidecar forwards the request to the order-service Pod.
  • The request is intercepted by the server's Linkerd sidecar proxy and forwarded to the order-service container.
  • OrderService executes the idempotency logic described in Step 1. It creates the order and saves the result to Redis.
  • OrderService sends the response back.
  • FAILURE SCENARIO: The response packet is dropped on the network between the server and client sidecars.
  • The client's Linkerd sidecar does not receive a response within its timeout window. It sees the request matches the isRetryable rule in the ServiceProfile.
  • The client's Linkerd sidecar automatically sends the exact same request again.
  • The OrderService receives the retried request. Its idempotency logic finds the COMPLETED key in Redis.
  • The OrderService immediately returns the cached response without touching the database.
  • The client's Linkerd sidecar receives the response and forwards it to the client application.
  • The client application is completely unaware that a retry ever happened. The retry logic is fully encapsulated within the service mesh, and the server's idempotency guarantee ensures safety.


    Advanced Considerations and Edge Cases

    This pattern is robust, but in a high-throughput production environment, several edge cases must be considered.

    Performance of the Idempotency Store

    Redis is an excellent choice for its low latency. However, its default persistence models (RDB/AOF) can have data loss windows. For financial transactions, using a durable database like PostgreSQL for the idempotency store might be required, at the cost of higher latency. If using PostgreSQL, be sure to place a unique index on the idempotency_key column to enforce uniqueness at the database level.

    Benchmark your store's latency. The P99 latency of your CheckAndSet call will be added to every single request to an idempotent endpoint. It must be extremely fast.

    Idempotency Key TTL Management

    Our completedTTL is set to 24 hours. This is a business decision. How long does a client's retry window last? If a mobile client can be offline for 3 days and then retry an operation upon reconnecting, your TTL needs to be longer. This has a direct impact on the storage size of your idempotency store. A robust garbage collection strategy for old keys is essential.

    Propagating Idempotency Keys Across Services

    Imagine the OrderService needs to call a PaymentService to complete an order. If the entire workflow needs to be idempotent, the original idempotency_key must be propagated. The best way to do this is via gRPC metadata. The OrderService would extract the key from the request body and insert it into the outgoing metadata of its call to the PaymentService.

    go
    // In OrderService, when calling PaymentService
    md := metadata.Pairs("x-idempotency-key", req.GetIdempotencyKey())
    ctx := metadata.NewOutgoingContext(ctx, md)
    
    paymentResp, err := s.paymentClient.ProcessPayment(ctx, paymentReq)

    The PaymentService would then have middleware to inspect incoming metadata for this key and apply its own idempotency logic.

    Monitoring and Observability

    How do you know this system is working? Two key metrics are essential:

  • Linkerd Retry Rate: Use linkerd viz routes deploy/order-service --to svc/order-service to see the effective success rate and retry rate for each route defined in your ServiceProfile. A high retry rate can indicate underlying network instability.
  • Idempotency Cache Hits: Instrument your server to export a Prometheus metric for idempotency cache hits. This tells you how often the safety mechanism is being used.
  • server.go with Prometheus metrics:

    go
    import "github.com/prometheus/client_golang/prometheus"
    
    var idempotencyHits = prometheus.NewCounterVec(
    	prometheus.CounterOpts{
    		Name: "idempotency_cache_hits_total",
    		Help: "Total number of idempotent responses served from cache.",
    	},
    	[]string{"method"},
    )
    
    func init() {
    	prometheus.MustRegister(idempotencyHits)
    }
    
    // In CreateOrder, when a cached result is found:
    if cachedResult != nil {
        idempotencyHits.WithLabelValues("/orders.v1.OrderService/CreateOrder").Inc()
        // ... return cached response
    }

    By observing both metrics, you get a complete picture of network health (from Linkerd) and application-level resilience (from your custom metric).

    Conclusion

    By combining application-aware, stateful idempotency logic on the server with declarative, client-agnostic retry configuration in the service mesh, we achieve a highly resilient and decoupled architecture. The application is responsible for guaranteeing the safety of retries, while the infrastructure is responsible for executing them. This separation of concerns is a hallmark of a mature microservices implementation. This pattern moves the complex and error-prone task of retry logic out of the hands of every client developer and into a single, observable, and configurable layer of the platform, enabling your services to weather the storm of transient failures gracefully and correctly.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles