Production-Grade Saga Pattern with Temporal.io for Microservices

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Implementation Chasm of the Saga Pattern

As senior engineers designing distributed systems, we accept the Saga pattern as a canonical solution for maintaining data consistency across microservice boundaries without resorting to brittle, unavailable two-phase commits. The theory is sound: a sequence of local transactions, where each subsequent step is triggered by the completion of the previous one. If any step fails, a series of compensating transactions are executed in reverse to undo the preceding operations.

Conceptually, this is straightforward. In practice, implementing a robust Saga orchestrator from scratch is a minefield of subtle complexities. Consider the state machine you must build and persist. Where does it live? A database? A Redis instance? How do you handle orchestrator crashes and failovers without losing the Saga's state? How do you guarantee that activities are idempotent, especially when network partitions and retries are a given? What about the logic for executing compensating transactions? This often becomes a complex, error-prone state management problem that pollutes the primary business logic.

This is the implementation chasm. We know the pattern, but the tooling often fails us, forcing us to build significant infrastructure before writing a single line of business logic. This article asserts that Temporal.io is not merely a workflow engine but a durable execution framework that fundamentally solves the Saga implementation problem. It externalizes state, retries, and failure handling, allowing us to write Saga logic as simple, sequential code while Temporal ensures its durability and fault tolerance.

We will dissect a complex e-commerce order processing Saga, moving from the happy path to intricate failure and compensation scenarios, all while using production-grade Temporal patterns.


Scenario: A Multi-Service E-commerce Order Saga

Let's model a realistic order placement process involving several microservices:

  • Inventory Service: Reserves items from stock. Compensation: Releases reserved items.
  • Payment Service: Charges the customer's credit card. Compensation: Refunds the charge.
  • Shipment Service: Arranges for shipping with a third-party logistics provider. Compensation: Voids the shipping label.
  • Notification Service: Sends an order confirmation email to the customer. Compensation: Arguably none needed, but could send a "cancellation" email.
  • Our goal is to orchestrate these steps. If the Shipment Service fails (e.g., the provider's API is down), we must automatically refund the payment and release the inventory.

    The Temporal Primitives: Workflows and Activities

    Before we dive into the code, let's align on how Temporal primitives map to the Saga pattern:

    Saga Orchestrator -> Temporal Workflow: The entire end-to-end business logic of the Saga is encapsulated within a single Workflow function. Temporal makes this function durable, meaning its state is preserved across any process or server failure. The code looks* sequential, but its execution can span minutes, days, or even years.

    * Saga Step (Local Transaction) -> Temporal Activity: Each interaction with a microservice (ReserveInventory, ProcessPayment) is an Activity. Activities are where side effects happen. They are designed to be retried automatically and are guaranteed to be executed at least once.

    This mapping is the foundation of our implementation. The Workflow orchestrates, and the Activities execute.

    Implementation: The Core Order Workflow (Go)

    We'll use Go for our examples, as it's a popular choice for Temporal development, but the concepts are identical in other SDKs like TypeScript, Java, and Python.

    First, let's define the data structures and the initial "happy path" workflow.

    go
    // file: shared/types.go
    package shared
    
    type OrderDetails struct {
    	UserID   string
    	ItemID   string
    	Quantity int
    	Charge   float64
    	OrderID  string // Unique ID for this transaction
    }
    
    // file: activities/activities.go
    package activities
    
    import (
    	"context"
    	"errors"
    	"fmt"
    	"time"
    
    	"go.temporal.io/sdk/activity"
    	"your.company/saga-example/shared"
    )
    
    // Mock activities that simulate calls to external services
    func ReserveInventory(ctx context.Context, details shared.OrderDetails) (string, error) {
    	logger := activity.GetLogger(ctx)
    	logger.Info("Reserving inventory...", "OrderID", details.OrderID)
    	// Simulate a potential failure
    	if details.ItemID == "FAIL_INVENTORY" {
    		return "", errors.New("inventory service unavailable")
    	}
    	time.Sleep(500 * time.Millisecond) // Simulate network latency
    	return fmt.Sprintf("RESERVATION-%s", details.OrderID), nil
    }
    
    func ProcessPayment(ctx context.Context, details shared.OrderDetails) (string, error) {
    	logger := activity.GetLogger(ctx)
    	logger.Info("Processing payment...", "OrderID", details.OrderID, "Amount", details.Charge)
    	if details.Charge > 1000 {
    		return "", errors.New("payment declined: amount exceeds limit")
    	}
    	time.Sleep(500 * time.Millisecond)
    	return fmt.Sprintf("PAYMENT_TXN-%s", details.OrderID), nil
    }
    
    func CreateShipment(ctx context.Context, details shared.OrderDetails) (string, error) {
    	logger := activity.GetLogger(ctx)
    	logger.Info("Creating shipment...", "OrderID", details.OrderID)
    	if details.ItemID == "FAIL_SHIPMENT" {
    		return "", errors.New("shipping provider API is down")
    	}
    	time.Sleep(500 * time.Millisecond)
    	return fmt.Sprintf("SHIPMENT_ID-%s", details.OrderID), nil
    }
    
    // Compensation Activities
    func ReleaseInventory(ctx context.Context, details shared.OrderDetails) error {
    	logger := activity.GetLogger(ctx)
    	logger.Info("COMPENSATION: Releasing inventory...", "OrderID", details.OrderID)
    	time.Sleep(500 * time.Millisecond)
    	return nil
    }
    
    func RefundPayment(ctx context.Context, details shared.OrderDetails) error {
    	logger := activity.GetLogger(ctx)
    	logger.Info("COMPENSATION: Refunding payment...", "OrderID", details.OrderID)
    	time.Sleep(500 * time.Millisecond)
    	return nil
    }
    
    func VoidShipment(ctx context.Context, details shared.OrderDetails) error {
    	logger := activity.GetLogger(ctx)
    	logger.Info("COMPENSATION: Voiding shipment...", "OrderID", details.OrderID)
    	time.Sleep(500 * time.Millisecond)
    	return nil
    }

    Now, for the orchestrator—the Workflow. A naive implementation might look like this:

    go
    // file: workflows/order_workflow_naive.go
    package workflows
    
    import (
    	"time"
    
    	"go.temporal.io/sdk/workflow"
    	"your.company/saga-example/activities"
    	"your.company/saga-example/shared"
    )
    
    // DO NOT USE THIS IN PRODUCTION - LACKS COMPENSATION
    func OrderSagaWorkflow_Naive(ctx workflow.Context, order shared.OrderDetails) (string, error) {
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	var reservationID string
    	err := workflow.ExecuteActivity(ctx, activities.ReserveInventory, order).Get(ctx, &reservationID)
    	if err != nil {
    		return "", err
    	}
    
    	var paymentID string
    	err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, order).Get(ctx, &paymentID)
    	if err != nil {
    		// PROBLEM: What about the inventory we just reserved?
    		return "", err
    	}
    
    	var shipmentID string
    	err = workflow.ExecuteActivity(ctx, activities.CreateShipment, order).Get(ctx, &shipmentID)
    	if err != nil {
    		// PROBLEM: What about the inventory and payment?
    		return "", err
    	}
    
    	return "Order completed successfully!", nil
    }

    This naive approach highlights the core problem. If ProcessPayment fails, the workflow exits, but the inventory remains reserved, leading to data inconsistency. This is where we introduce the robust Saga compensation pattern.

    Advanced Implementation: Sagas with Explicit Compensation

    Temporal's deterministic execution rules mean we can't use standard try...catch...finally blocks in the same way as regular code. However, we can achieve the same goal using Go's defer statement, which is re-implemented by the Temporal SDK to be workflow-safe. A deferred function will execute whenever the workflow function exits—either by returning successfully or by returning an error.

    This is the key to our Saga implementation. We will add compensation functions to a list as each step succeeds. If the workflow function exits with an error, a deferred function will iterate through our list of completed steps and execute their corresponding compensations in reverse order.

    go
    // file: workflows/order_workflow_saga.go
    package workflows
    
    import (
    	"time"
    
    	"go.temporal.io/sdk/workflow"
    	"your.company/saga-example/activities"
    	"your.company/saga-example/shared"
    )
    
    func OrderSagaWorkflow(ctx workflow.Context, order shared.OrderDetails) (string, error) {
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    		// We will discuss RetryPolicy in more detail later
    		RetryPolicy: &workflow.RetryPolicy{
    			MaximumAttempts: 1, // Fail fast for this example
    		},
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	// Use a compensation stack
    	var compensations []func(workflow.Context)
    
    	// Use a deferred function to handle cleanup/compensation
    	defer func() {
    		// If the workflow is exiting with an error, run compensations
    		if workflow.GetInfo(ctx).IsExecutionDone() && workflow.GetInfo(ctx).GetLastCompletionResult() != nil {
    			// Execute compensations in LIFO order
    			for i := len(compensations) - 1; i >= 0; i-- {
    				// Use a disconnected context to ensure compensations run even if the main workflow context is cancelled.
    				disconnectedCtx, _ := workflow.NewDisconnectedContext(ctx)
    				compensations[i](disconnectedCtx)
    			}
    		}
    	}()
    
    	// Step 1: Reserve Inventory
    	var reservationID string
    	err := workflow.ExecuteActivity(ctx, activities.ReserveInventory, order).Get(ctx, &reservationID)
    	if err != nil {
    		workflow.GetLogger(ctx).Error("Failed to reserve inventory.", "Error", err)
    		return "", err
    	}
    	// If successful, add compensation to the stack
    	compensations = append(compensations, func(compCtx workflow.Context) {
    		workflow.ExecuteActivity(compCtx, activities.ReleaseInventory, order).Get(compCtx, nil)
    	})
    
    	// Step 2: Process Payment
    	var paymentID string
    	err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, order).Get(ctx, &paymentID)
    	if err != nil {
    		workflow.GetLogger(ctx).Error("Failed to process payment.", "Error", err)
    		return "", err
    	}
    	compensations = append(compensations, func(compCtx workflow.Context) {
    		workflow.ExecuteActivity(compCtx, activities.RefundPayment, order).Get(compCtx, nil)
    	})
    
    	// Step 3: Create Shipment
    	var shipmentID string
    	err = workflow.ExecuteActivity(ctx, activities.CreateShipment, order).Get(ctx, &shipmentID)
    	if err != nil {
    		workflow.GetLogger(ctx).Error("Failed to create shipment.", "Error", err)
    		return "", err
    	}
    	compensations = append(compensations, func(compCtx workflow.Context) {
    		workflow.ExecuteActivity(compCtx, activities.VoidShipment, order).Get(compCtx, nil)
    	})
    
    	workflow.GetLogger(ctx).Info("Order completed successfully!", "ShipmentID", shipmentID)
    	return "Order completed successfully!", nil
    }

    Let's break down the critical patterns here:

  • Compensation Stack: We use a slice of functions (compensations) to act as a Last-In, First-Out (LIFO) stack. Each time an activity completes successfully, we push its corresponding compensation function onto the stack.
  • Deferred Execution: The defer block contains our core compensation logic. It checks if the workflow is finishing with an error (GetLastCompletionResult() != nil). If so, it iterates through the compensations slice in reverse, executing each one.
  • workflow.NewDisconnectedContext: This is a crucial, advanced pattern. If our workflow was cancelled by an external request, the main ctx would also be cancelled. Any activities started with that context would fail immediately. By creating a disconnected context, we ensure that our compensation logic runs to completion regardless of the parent workflow's cancellation status. This is essential for guaranteeing cleanup.
  • Now, if we run this workflow with an order where ItemID is FAIL_SHIPMENT, the CreateShipment activity will fail. The workflow will return an error, triggering the deferred function. It will then execute RefundPayment and ReleaseInventory in that order, correctly rolling back the Saga.

    Edge Case 1: Idempotency in Activities

    Temporal guarantees at-least-once execution for activities. A worker could execute half of an activity, crash, and then another worker could pick up the same task and re-run it from the beginning. If your ProcessPayment activity doesn't account for this, you could double-charge a customer.

    Activities must be designed to be idempotent. The standard pattern is to pass a unique token with each activity execution request. The service implementing the activity then uses this token to de-duplicate requests.

    Fortunately, every Temporal Workflow Execution has a unique ID. We can combine this with an identifier for the specific action to create a perfect idempotency key.

    Let's refactor our ProcessPayment activity to be idempotent.

    go
    // file: activities/activities.go (updated)
    
    // Imagine this is our payment service's client
    type PaymentServiceClient struct{}
    
    // The client's method now requires an idempotency key
    func (c *PaymentServiceClient) Charge(ctx context.Context, amount float64, cardToken string, idempotencyKey string) (string, error) {
    	// In a real system, you would:
    	// 1. BEGIN a database transaction.
    	// 2. SELECT * FROM payment_idempotency WHERE key = idempotencyKey.
    	// 3. If a record exists, return the stored result (success or failure).
    	// 4. If no record exists, call the payment gateway API.
    	// 5. INSERT the result (transaction ID or error) into payment_idempotency with the key.
    	// 6. COMMIT the transaction.
    	// 7. Return the result.
    	fmt.Printf("Charging %f with idempotency key %s\n", amount, idempotencyKey)
    	time.Sleep(500 * time.Millisecond)
    	return fmt.Sprintf("PAYMENT_TXN-%s", idempotencyKey), nil
    }
    
    var paymentClient = &PaymentServiceClient{}
    
    func ProcessPayment(ctx context.Context, details shared.OrderDetails) (string, error) {
    	logger := activity.GetLogger(ctx)
    	info := activity.GetInfo(ctx)
    
    	// Create a stable, unique idempotency key
    	idempotencyKey := fmt.Sprintf("%s-payment-%d", info.WorkflowExecution.ID, info.Attempt)
    
    	logger.Info("Processing payment with idempotency key...", "key", idempotencyKey)
    
    	if details.Charge > 1000 {
    		return "", errors.New("payment declined: amount exceeds limit")
    	}
    
    	// Call the service with the key
    	return paymentClient.Charge(context.Background(), details.Charge, "tok_visa", idempotencyKey)
    }

    By using info.WorkflowExecution.ID and info.Attempt, we create a key that is unique and stable for each specific attempt of this activity within this specific workflow run. The backend service can now safely de-duplicate this call, making our Saga resilient to worker failures and retries.

    Edge Case 2: Failures in Compensation Logic

    What happens if our RefundPayment activity fails because the payment provider's refund API is down? In our current setup, the compensation would fail, and the defer block would finish, leaving the customer charged for a failed order.

    This is where Temporal's automatic retries for activities become critical. We should configure our compensation activities with a robust retry policy to handle transient failures.

    Let's refine our compensation execution logic:

    go
    // file: workflows/order_workflow_saga.go (updated defer block)
    
    	// ... inside defer func() ...
    		if workflow.GetInfo(ctx).IsExecutionDone() && workflow.GetInfo(ctx).GetLastCompletionResult() != nil {
    			// Use a separate, robust activity options for compensations
    			compensationAo := workflow.ActivityOptions{
    				StartToCloseTimeout: 10 * time.Second,
    				RetryPolicy: &workflow.RetryPolicy{
    					InitialInterval:    time.Second,
    					BackoffCoefficient: 2.0,
    					MaximumInterval:    time.Minute,
    					MaximumAttempts:    10, // Retry compensation up to 10 times over ~17 minutes
    				},
    			}
    
    			logger := workflow.GetLogger(ctx)
    			logger.Info("Saga failed. Starting compensation process.")
    
    			for i := len(compensations) - 1; i >= 0; i-- {
    				disconnectedCtx, _ := workflow.NewDisconnectedContext(ctx)
    				compensationCtx := workflow.WithActivityOptions(disconnectedCtx, compensationAo)
    				
    				// We use ExecuteActivity here but don't wait for the result with .Get()
    				// This is a stylistic choice; we could also wait. The important part is the retry policy.
    				compensations[i](compensationCtx)
    			}
    		}
    	// ... end of defer func()

    By defining a separate ActivityOptions with an exponential backoff retry policy specifically for our compensation context, we instruct Temporal to diligently retry any failing compensation activity. This makes it highly likely that transient issues (like a brief API outage) will be resolved and our Saga will correctly roll back, maintaining system consistency.

    Production Patterns: Interaction with a Running Saga

    A long-running Saga is not a black box. We need mechanisms to interact with it and observe its state. Temporal provides two powerful primitives for this: Signals and Queries.

    Using a Signal to Cancel an Order

    Imagine the customer calls support to cancel their order while it's being processed. We can send a Signal to the running workflow to trigger a graceful cancellation and rollback.

    First, define the signal channel in the workflow.

    go
    // file: workflows/order_workflow_saga.go (updated)
    
    func OrderSagaWorkflow(ctx workflow.Context, order shared.OrderDetails) (string, error) {
        // ... (activity options, compensation stack, defer block as before) ...
    
        // Add a signal channel to listen for cancellations
        cancellationChan := workflow.GetSignalChannel(ctx, "cancel-order-signal")
    
        // Use a selector to react to either activity completion or cancellation
        selector := workflow.NewSelector(ctx)
    
        var workflowErr error
        selector.AddReceive(cancellationChan, func(c workflow.Channel, more bool) {
            var signalMsg string
            c.Receive(ctx, &signalMsg)
            workflow.GetLogger(ctx).Info("Received cancellation signal", "Message", signalMsg)
            // Create a cancellation error to trigger the compensation logic
            workflowErr = workflow.NewCanceledError("Order cancelled by user signal")
        })
    
        // Helper function to execute an activity and add its compensation
        executeSagaStep := func(activityFunc interface{}, compensationFunc func(workflow.Context)) {
            if workflowErr != nil {
                return // Stop processing if an error (like cancellation) has occurred
            }
    
            future := workflow.ExecuteActivity(ctx, activityFunc, order)
            selector.AddFuture(future, func(f workflow.Future) {
                err := f.Get(ctx, nil) // We just care about the error here
                if err != nil {
                    workflowErr = err
                } else {
                    compensations = append(compensations, compensationFunc)
                }
            })
        }
    
        // Execute steps
        executeSagaStep(activities.ReserveInventory, func(cCtx workflow.Context) {
            workflow.ExecuteActivity(cCtx, activities.ReleaseInventory, order).Get(cCtx, nil)
        })
        workflow.Select(ctx, selector) // Wait for the step to complete or a signal to arrive
        if workflowErr != nil { return "", workflowErr }
    
        executeSagaStep(activities.ProcessPayment, func(cCtx workflow.Context) {
            workflow.ExecuteActivity(cCtx, activities.RefundPayment, order).Get(cCtx, nil)
        })
        workflow.Select(ctx, selector) // Wait again
        if workflowErr != nil { return "", workflowErr }
    
        executeSagaStep(activities.CreateShipment, func(cCtx workflow.Context) {
            workflow.ExecuteActivity(cCtx, activities.VoidShipment, order).Get(cCtx, nil)
        })
        workflow.Select(ctx, selector) // And again
        if workflowErr != nil { return "", workflowErr }
    
        return "Order completed successfully!", nil
    }

    This refactoring introduces workflow.Selector. It allows the workflow to wait on multiple events simultaneously: the completion of an activity future OR the arrival of a message on a signal channel. If the cancel-order-signal arrives, we set a workflowErr. This error will prevent subsequent steps from running and, upon exiting the function, will trigger our defer block to run the compensations for any steps that have already completed.

    Using a Query to Inspect State

    How does our frontend know what to display to the customer? "Processing"? "Payment Complete"? We can add a Query handler to our workflow to synchronously expose its internal state without affecting its execution history.

    go
    // file: workflows/order_workflow_saga.go (updated)
    
    func OrderSagaWorkflow(ctx workflow.Context, order shared.OrderDetails) (string, error) {
        // ... (setup as before) ...
        currentState := "ORDER_STARTED"
    
        // Set up a query handler
        err := workflow.SetQueryHandler(ctx, "get-order-status", func() (string, error) {
            return currentState, nil
        })
        if err != nil {
            return "", err
        }
    
        // ... (defer block and compensations) ...
    
        // Step 1: Reserve Inventory
        currentState = "RESERVING_INVENTORY"
        err = workflow.ExecuteActivity(ctx, activities.ReserveInventory, order).Get(ctx, nil)
        if err != nil { /* ... */ }
        compensations = append(compensations, /* ... */)
    
        // Step 2: Process Payment
        currentState = "PROCESSING_PAYMENT"
        err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, order).Get(ctx, nil)
        if err != nil { /* ... */ }
        compensations = append(compensations, /* ... */)
        currentState = "PAYMENT_COMPLETE"
    
        // ... and so on for other steps
    
        currentState = "ORDER_COMPLETE"
        return "Order completed successfully!", nil
    }

    Now, from a separate client application (e.g., an API backend serving the frontend), we can query the workflow's state:

    go
    // Client code to query the workflow
    resp, err := temporalClient.QueryWorkflow(context.Background(), workflowID, runID, "get-order-status")
    if err != nil {
        log.Fatalln("Unable to query workflow", err)
    }
    var status string
    if err := resp.Get(&status); err != nil {
        log.Fatalln("Unable to decode query result", err)
    }
    fmt.Printf("Current order status: %s\n", status)

    This provides a powerful, consistent way to build observability into your complex business processes.


    Conclusion: Beyond Orchestration to Durable Execution

    The Saga pattern is essential for maintaining consistency in a microservices world, but its manual implementation is a significant engineering burden. By adopting Temporal, we shift our perspective from building a complex state machine to simply writing the business logic of our Saga as a durable, fault-tolerant function.

    We've demonstrated how to:

  • Map the Saga pattern directly to Temporal's Workflow and Activity primitives.
  • Implement robust compensation logic using defer and disconnected contexts to guarantee cleanup, even in the face of workflow cancellation.
  • Address critical edge cases like activity idempotency and compensation retries using Temporal's built-in features.
  • Build interactive and observable Sagas using Signals and Queries for real-time control and state inspection.
  • By leveraging Temporal, the immense challenge of Saga orchestration is reduced to writing straightforward, sequential code. The framework handles the persistence, retries, state management, and timers, freeing senior engineers to focus on the high-value business logic that defines the Saga, not the plumbing that makes it run.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles