Asynchronous Saga Pattern with Temporal for Microservice Resilience

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Challenge of Data Consistency in Microservices

In a monolithic architecture, ACID transactions, managed by the database, provide a safety net for complex business operations. A multi-step process either completes entirely or is rolled back, ensuring data integrity. When we decompose this monolith into distributed microservices, we gain scalability and team autonomy, but we lose the safety of distributed transactions. Protocols like Two-Phase Commit (2PC) are notoriously brittle in distributed environments, creating availability bottlenecks and violating the principle of service independence.

This is where the Saga pattern emerges. A Saga is a sequence of local transactions where each transaction updates data within a single service and publishes an event or message to trigger the next transaction in the chain. If a local transaction fails, the Saga executes a series of compensating transactions to semantically undo the preceding transactions.

While the concept is powerful, implementing a robust Saga is fraught with complexity. Choreography-based Sagas, relying on event-driven communication, can become a debugging nightmare. Tracking the state of a distributed transaction across multiple services, handling message ordering, and ensuring exactly-once processing creates significant accidental complexity.

This is the problem space where a durable execution engine like Temporal excels. By shifting from choreography to orchestration, we can define the entire Saga as a single, durable, stateful function—a Temporal Workflow. This approach externalizes the state management, retry logic, and failure handling, allowing developers to write straightforward business logic while the Temporal platform provides the resilience.

This article will dissect the implementation of a complex, asynchronous Saga for an e-commerce order processing system using Temporal and Go. We will bypass introductory concepts and focus on production-level patterns, edge cases, and optimizations that senior engineers must contend with.


Scenario: A Multi-Step E-commerce Order Saga

Let's define our business process. When a customer places an order, the following sequence of operations must occur across different microservices:

  • Order Service: Creates an order record in a PENDING state.
  • Payment Service: Charges the customer's credit card.
  • Inventory Service: Reserves the items in the warehouse.
  • Notification Service: Sends a confirmation email to the customer.
  • Each step is a local transaction. A failure at any point must trigger a precise set of compensating actions. For instance, if inventory reservation fails after a successful payment, the payment must be refunded, and the order status must be updated to CANCELLED.

    Our goal is to model this entire flow as a single Temporal Workflow, where each step and its compensation is an Activity.

    The Core Architecture: Workflows and Activities

    A Temporal Workflow is a deterministic, resumable function. Its state is preserved by the Temporal Cluster across any process or server failure. The code you write looks sequential, but the execution is durable.

    Activities are the functions that execute the actual business logic: making API calls, database queries, etc. They are non-deterministic and can fail. Temporal manages their execution, providing configurable retries, timeouts, and heartbeating.

    Here's the basic structure in Go:

    go
    // main.go - Worker setup
    package main
    
    import (
        "log"
        "go.temporal.io/sdk/client"
        "go.temporal.io/sdk/worker"
        "your_project/app"
    )
    
    func main() {
        c, err := client.Dial(client.Options{})
        if err != nil {
            log.Fatalln("Unable to create client", err)
        }
        defer c.Close()
    
        w := worker.New(c, app.OrderTaskQueue, worker.Options{})
    
        w.RegisterWorkflow(app.OrderProcessingWorkflow)
        w.RegisterActivity(app.CreateOrderActivity)
        w.RegisterActivity(app.ProcessPaymentActivity)
        w.RegisterActivity(app.ReserveInventoryActivity)
        w.RegisterActivity(app.SendConfirmationEmailActivity)
        w.RegisterActivity(app.CancelOrderActivity)
        w.RegisterActivity(app.RefundPaymentActivity)
        w.RegisterActivity(app.ReleaseInventoryActivity)
    
        err = w.Run(worker.InterruptCh())
        if err != nil {
            log.Fatalln("Unable to start worker", err)
        }
    }

    This worker listens on a specific task queue (app.OrderTaskQueue) and knows how to execute our defined Workflow and Activities.

    Implementing the Saga with Compensation Logic

    The most critical aspect of the Saga pattern is handling failures and compensations. In Temporal, we can achieve this elegantly using Go's defer statement within the workflow. A deferred function call is pushed onto a stack and executed when the surrounding function returns. This creates a natural mechanism for building a compensation stack.

    Here is the complete OrderProcessingWorkflow implementation.

    go
    // app/workflow.go
    package app
    
    import (
        "time"
        "go.temporal.io/sdk/workflow"
        "go.temporal.io/sdk/log"
    )
    
    const OrderTaskQueue = "ORDER_PROCESSING_QUEUE"
    
    type OrderDetails struct {
        OrderID    string
        CustomerID string
        ItemID     string
        Quantity   int
        Amount     float64
    }
    
    func OrderProcessingWorkflow(ctx workflow.Context, order OrderDetails) (string, error) {
        logger := workflow.GetLogger(ctx)
        logger.Info("Order processing workflow started", "OrderID", order.OrderID)
    
        // Set Activity options: timeouts are crucial for production systems
        ao := workflow.ActivityOptions{
            StartToCloseTimeout: 10 * time.Second,
        }
        ctx = workflow.WithActivityOptions(ctx, ao)
    
        var compensationStack []func() error
        var executionErr error
    
        defer func() {
            if executionErr != nil {
                logger.Error("Workflow failed, executing compensations.", "Error", executionErr)
                for _, compensation := range compensationStack {
                    // Execute compensations in a new, detached context to ensure they run
                    // even if the main workflow context is cancelled.
                    dCtx, _ := workflow.NewDisconnectedContext(ctx)
                    err := workflow.ExecuteActivity(dCtx, compensation).Get(dCtx, nil)
                    if err != nil {
                        // Log compensation failures but continue processing the stack
                        logger.Error("Compensation activity failed", "Error", err)
                    }
                }
            }
        }()
    
        // 1. Create Order
        var orderID string
        err := workflow.ExecuteActivity(ctx, CreateOrderActivity, order).Get(ctx, &orderID)
        if err != nil {
            executionErr = err
            return "", err
        }
        compensationStack = append(compensationStack, func() error {
            var result string
            return workflow.ExecuteActivity(ctx, CancelOrderActivity, orderID).Get(ctx, &result)
        })
    
        // 2. Process Payment
        var paymentID string
        err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, order).Get(ctx, &paymentID)
        if err != nil {
            executionErr = err
            return "", err
        }
        compensationStack = append(compensationStack, func() error {
            var result string
            return workflow.ExecuteActivity(ctx, RefundPaymentActivity, paymentID).Get(ctx, &result)
        })
    
        // 3. Reserve Inventory
        var inventoryReservationID string
        err = workflow.ExecuteActivity(ctx, ReserveInventoryActivity, order.ItemID, order.Quantity).Get(ctx, &inventoryReservationID)
        if err != nil {
            executionErr = err
            return "", err
        }
        // No explicit compensation needed here in this simplified model, but you could have a ReleaseInventoryActivity
        // compensationStack = append(compensationStack, ...)
    
        // 4. Send Confirmation Email (final step, less critical for compensation)
        err = workflow.ExecuteActivity(ctx, SendConfirmationEmailActivity, order.CustomerID, orderID).Get(ctx, nil)
        if err != nil {
            // A failure here is non-critical to the transaction's integrity.
            // We log it but don't trigger compensations.
            logger.Warn("Failed to send confirmation email, but order is complete.", "Error", err)
        }
    
        logger.Info("Workflow completed successfully", "OrderID", orderID)
        return "Order processed successfully: " + orderID, nil
    }

    Key Implementation Details:

  • Compensation Stack: We use a slice of functions (compensationStack) to store our compensation activities. After each successful forward step, we push its corresponding compensation function onto the stack.
  • defer for Failure Handling: The defer block checks if executionErr is non-nil. If any activity fails, executionErr is set, and the workflow function attempts to return. Before it returns, the defer block executes, iterating through the compensationStack in LIFO (Last-In, First-Out) order, which is the correct sequence for Saga compensation.
  • Disconnected Context: Inside the defer block, we use workflow.NewDisconnectedContext. This is a critical production pattern. If the original workflow context is cancelled (e.g., due to a timeout or external cancellation request), a disconnected context ensures the compensation activities still run. We are telling Temporal, "Even if the main workflow is dead, this cleanup logic is non-negotiable."
  • Error Handling Granularity: Note the handling of the SendConfirmationEmailActivity failure. We treat it as a non-critical error, logging a warning but not rolling back the entire transaction. This reflects real-world business requirements where not all failures are created equal.

  • Advanced Pattern 1: Ensuring Activity Idempotency

    Temporal guarantees at-least-once execution for activities. A worker could execute an activity, successfully complete it, but crash before it can report the completion back to the Temporal Cluster. When the worker recovers or another worker picks up the task, it will re-execute the same activity. If your activity is not idempotent (i.e., safe to run multiple times with the same input), this can lead to severe data corruption, like charging a customer twice.

    The solution is to design your activities and the services they call to be idempotent. A common pattern is to pass a unique idempotency key with each call.

    Let's modify our ProcessPaymentActivity.

    go
    // app/activities.go
    
    // ... other activities
    
    func ProcessPaymentActivity(ctx context.Context, order OrderDetails) (string, error) {
        // The idempotency key should be unique to this specific operation attempt.
        // A combination of a stable ID (like OrderID) and the Temporal Activity attempt number is robust.
        info := activity.GetInfo(ctx)
        idempotencyKey := fmt.Sprintf("%s-%d", order.OrderID, info.Attempt)
    
        // Assume paymentGateway is a client for our Payment Service
        paymentRequest := &PaymentRequest{
            Amount:         order.Amount,
            CreditCardInfo: "...", // from a secure source
            IdempotencyKey: idempotencyKey,
        }
    
        paymentResponse, err := paymentGateway.Charge(paymentRequest)
        if err != nil {
            // Check for a specific idempotency conflict error from the service
            if errors.Is(err, ErrIdempotencyConflict) {
                log.Printf("Idempotency conflict detected for key %s. Assuming success.", idempotencyKey)
                // We need to fetch the original transaction ID to return
                originalTxID, findErr := paymentGateway.FindTransactionByIDempotencyKey(idempotencyKey)
                if findErr != nil {
                    return "", fmt.Errorf("idempotency conflict detected but failed to retrieve original transaction: %w", findErr)
                }
                return originalTxID, nil
            }
            return "", err
        }
    
        return paymentResponse.TransactionID, nil
    }

    On the Payment Service side:

    The service must maintain a table (idempotency_keys) with columns like key, status, request_hash, and response_body.

  • When a request arrives, the service first checks if the idempotencyKey exists in the table.
  • If it exists:
  • * Compare the hash of the current request body with the stored request_hash. If they don't match, it's an invalid reuse of the key; return an error.

    * If the hashes match, check the status. If it's COMPLETED, return the stored response_body without re-processing. If it's PENDING, it means a previous attempt timed out; the service can decide whether to wait or return a conflict.

  • If it does not exist:
  • * Insert a new record with the key, request hash, and a PENDING status within a database transaction.

    * Process the payment.

    * On success, update the record's status to COMPLETED and store the response.

    * On failure, update the status to FAILED.

    * Commit the database transaction.

    This pattern guarantees that even if Temporal retries the ProcessPaymentActivity ten times, the customer is charged exactly once.


    Advanced Pattern 2: Long-Running Activities and Heartbeating

    What if an activity takes a long time to complete? For example, ReserveInventoryActivity might need to communicate with a legacy Warehouse Management System (WMS) that can take several minutes to respond. The default StartToCloseTimeout might be too short, and the activity would fail unnecessarily.

    We could set a very long timeout (e.g., 30 minutes), but what if the worker process crashes 15 minutes into the operation? Without any feedback, Temporal would wait for the full 30-minute timeout before retrying, introducing significant latency.

    This is solved with Activity Heartbeating. The activity periodically reports back to the Temporal Cluster that it is still alive and making progress.

    go
    // app/activities.go
    
    func ReserveInventoryActivity(ctx context.Context, itemID string, quantity int) (string, error) {
        logger := activity.GetLogger(ctx)
    
        // This activity has a long StartToCloseTimeout, but a short HeartbeatTimeout
        // In workflow options:
        // ao := workflow.ActivityOptions{
        //     StartToCloseTimeout: 30 * time.Minute,
        //     HeartbeatTimeout:    1 * time.Minute, // If no heartbeat in 1 min, retry
        // }
    
        wmsClient := getWMSClient()
        reservationID, err := wmsClient.InitiateReservation(itemID, quantity)
        if err != nil {
            return "", err
        }
    
        // Poll the WMS for status
        for {
            status, err := wmsClient.GetReservationStatus(reservationID)
            if err != nil {
                return "", err
            }
    
            if status == "COMPLETED" {
                logger.Info("Inventory reservation complete.")
                return reservationID, nil
            }
    
            if status == "FAILED" {
                logger.Error("Inventory reservation failed in WMS.")
                return "", errors.New("WMS failed to reserve inventory")
            }
    
            logger.Info("Inventory reservation still pending, sending heartbeat.")
            // Record a heartbeat to let Temporal know we're still alive.
            // We can also include progress details in the heartbeat.
            activity.RecordHeartbeat(ctx, status)
    
            // Check if the context is cancelled (e.g., workflow was cancelled)
            select {
            case <-ctx.Done():
                // The workflow requested cancellation. We should attempt to clean up.
                wmsClient.CancelReservation(reservationID)
                return "", ctx.Err()
            case <-time.After(15 * time.Second):
                // Continue polling
            }
        }
    }

    How it works:

  • In the workflow, we configure a long StartToCloseTimeout but a short HeartbeatTimeout.
  • Inside the activity's processing loop, we call activity.RecordHeartbeat(ctx, ...). This call informs the Temporal Cluster that the activity is still making progress.
  • If the worker crashes, the cluster will not receive a heartbeat within the HeartbeatTimeout duration. It will immediately recognize the failure and schedule a retry on another available worker, without waiting for the full StartToCloseTimeout.
  • The ctx.Done() check is crucial for graceful cancellation. If the workflow is cancelled externally, the activity context will be cancelled, and we can perform cleanup actions (like telling the WMS to cancel the reservation).

  • Advanced Pattern 3: Evolving Sagas with Workflow Versioning

    Business requirements change. Suppose we need to add a Fraud Check step between payment and inventory reservation. If we simply add a new ExecuteActivity call to our workflow code and deploy it, we risk breaking all in-flight workflows. The existing workflows' execution history will not match the new code's definition, leading to a non-deterministic error.

    Temporal provides a workflow.GetVersion API to manage this exact scenario. It allows you to create branching logic that is deterministic.

    go
    // app/workflow.go (updated)
    
    func OrderProcessingWorkflow(ctx workflow.Context, order OrderDetails) (string, error) {
        // ... setup and compensation stack logic remains the same ...
    
        // ... after ProcessPaymentActivity succeeds ...
    
        // VERSIONING: Introduce a fraud check step
        v := workflow.GetVersion(ctx, "AddFraudCheck", workflow.DefaultVersion, 1)
        if v == 1 {
            var fraudCheckResult string
            err = workflow.ExecuteActivity(ctx, FraudCheckActivity, order.CustomerID, order.Amount).Get(ctx, &fraudCheckResult)
            if err != nil {
                executionErr = err
                return "", err
            }
            if fraudCheckResult == "DENIED" {
                // Not a technical failure, but a business rule failure. Trigger compensations.
                executionErr = errors.New("fraud check denied")
                return "", executionErr
            }
        }
    
        // 3. Reserve Inventory
        // ... rest of the workflow logic ...
    }

    How it works:

  • workflow.GetVersion(ctx, "AddFraudCheck", workflow.DefaultVersion, 1) is a special, deterministic Temporal API call.
  • The first time a workflow execution encounters this line with the change ID "AddFraudCheck", it will record version 1 in its history and return 1.
  • On any subsequent replay of this workflow, the API will read the version from the history and return the same value (1), ensuring the execution path remains deterministic.
  • Workflows that were started before this code was deployed will have no entry for "AddFraudCheck" in their history. For them, GetVersion will return workflow.DefaultVersion (which is 0). The if v == 1 block will be skipped, and they will continue along the old execution path safely.
  • New workflows started after the deployment will execute the FraudCheckActivity.
  • This powerful feature allows you to evolve complex Sagas over time without ever needing to take the system down or perform complex data migrations on in-flight processes.

    Conclusion: From Fragile Choreography to Resilient Orchestration

    Implementing the Saga pattern correctly is a significant engineering challenge that goes far beyond a simple sequence of API calls. By leveraging an orchestration engine like Temporal, we elevate the implementation from a fragile, implicit state machine spread across service logs and message queues to an explicit, observable, and durable workflow.

    We've demonstrated how to:

    * Model a complex Saga as a single, comprehensible piece of code.

    * Implement robust compensation logic using standard language features (defer) combined with Temporal's guarantees.

    * Solve for at-least-once execution by designing idempotent activities and services.

    * Manage long-running tasks reliably with heartbeating, preventing unnecessary timeouts and reducing recovery time.

    * Evolve the Saga's business logic over time without disrupting live operations using workflow versioning.

    This approach allows developers to focus on the core business logic of the Saga, while the underlying platform provides the heavy lifting of state management, fault tolerance, and resilience required for modern, production-grade microservice architectures.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles