Saga Pattern with Temporal: A Production-Grade Implementation Guide

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Saga Pattern Reimagined: From Theory to Production with Temporal

In any non-trivial microservices architecture, the problem of maintaining data consistency across service boundaries is a principal challenge. The retirement of the two-phase commit (2PC) protocol in distributed systems, due to its tight coupling and performance bottlenecks, led to the rise of asynchronous, event-driven patterns. Among these, the Saga pattern stands out as the de facto standard for managing long-running, multi-step business transactions.

However, implementing a Saga from scratch is fraught with peril. Developers must manually manage a distributed state machine, build a durable Saga log, implement complex retry logic, and handle the intricate dance of compensating transactions. This boilerplate is not only error-prone but also distracts from the core business logic.

This is where a durable execution engine like Temporal fundamentally changes the game. Temporal abstracts away the complexities of distributed state management, allowing you to write Saga orchestration logic as straightforward, sequential code. This article is not an introduction to Sagas or Temporal. It is a deep, technical guide for senior engineers on how to implement a production-grade, orchestration-based Saga using Temporal, focusing on the advanced patterns and edge cases you will inevitably encounter.

We will dissect a complex e-commerce order processing workflow, providing complete Go code examples to illustrate:

  • Robust compensation logic using defer.
    • Ensuring idempotency at both the activity and service level.
    • Managing long-running activities with heartbeating.
    • Handling non-compensatable failures with human-in-the-loop patterns.
  • Performance considerations like workflow history and the Continue-As-New pattern.
  • Why Temporal is a Natural Fit for Orchestrated Sagas

    Sagas come in two flavors: Choreography and Orchestration. In Choreography, services react to events from other services, leading to a decentralized but often hard-to-trace and debug system. In Orchestration, a central coordinator (the orchestrator) explicitly tells services what to do.

    Temporal is purpose-built for the orchestration model. A Temporal Workflow is a durable, stateful function whose execution is preserved across process and server failures. This makes it a perfect implementation of a Saga orchestrator.

  • Durable State: A Temporal Workflow's state (local variables, execution stack) is automatically persisted by the Temporal Cluster. This completely eliminates the need for a custom Saga log database. The code is the state machine.
  • Guaranteed Execution: Activities (the individual steps of a Saga) are guaranteed to execute at least once. Temporal handles the retries with configurable backoff policies, freeing you from building this logic.
  • Simplified Compensation: The statefulness of the workflow allows for simple, powerful compensation patterns. As we'll see, Go's defer statement or a try...catch...finally block in other languages becomes a robust mechanism for queuing and executing compensation logic.
  • Scenario: A Production-Grade E-commerce Order Saga

    Let's model a realistic order processing flow that interacts with multiple microservices. A customer placing an order triggers a workflow that must successfully complete a series of steps.

    Forward Transaction (The "Happy Path")

  • CreateOrder: The Order Service creates an order record with a PENDING status.
  • ReserveInventory: The Inventory Service reserves the specified items. If inventory is insufficient, the Saga fails.
  • ProcessPayment: The Payment Service charges the customer's card via a third-party gateway.
  • UpdateOrderStatus: The Order Service updates the order status to CONFIRMED.
  • DispatchShipping: The Shipping Service is notified to begin the fulfillment process.
  • Compensating Transactions

    If any step from 2 through 5 fails, we must roll back the preceding actions to maintain a consistent state.

  • Failure at DispatchShipping: Compensate by calling UpdateOrderStatus back to PAYMENT_CONFIRMED_SHIPPING_FAILED.
  • Failure at UpdateOrderStatus: Compensate by calling RefundPayment.
  • Failure at ProcessPayment: Compensate by calling ReleaseInventory.
  • Failure at ReserveInventory: Compensate by calling UpdateOrderStatus to FAILED.
  • This sequence of compensations, executed in reverse order of the original operations, is the core of the Saga pattern.

    Implementation Deep Dive: Building the Saga Workflow in Go

    Let's translate this business process into a Temporal Workflow. We'll start with the project structure and then build out the activities and the orchestrating workflow.

    Project Structure

    text
    /order-saga
    |-- /activities
    |   |-- inventory_activity.go
    |   |-- order_activity.go
    |   |-- payment_activity.go
    |   `-- shipping_activity.go
    |-- /shared
    |   `-- types.go
    |-- /worker
    |   `-- main.go
    |-- /starter
    |   `-- main.go
    `-- workflow.go

    Defining Activities: The Saga Steps

    Activities are functions that execute the actual business logic, typically by making RPC calls to your microservices. It's crucial to define their options, such as timeouts and retry policies, correctly.

    Here's an example of the PaymentActivity interface and implementation. Note how it simulates a call to an external payment gateway.

    go
    // activities/payment_activity.go
    package activities
    
    import (
    	"context"
    	"fmt"
    	"time"
    
    	"go.temporal.io/sdk/activity"
    )
    
    // Simulating a call to a payment microservice
    func ProcessPayment(ctx context.Context, orderID string, amount float64, idempotencyKey string) (string, error) {
    	logger := activity.GetLogger(ctx)
    	logger.Info("Processing payment", "OrderID", orderID, "Amount", amount)
    
    	// In a real-world scenario, this would be an RPC call to the payment service.
    	// The idempotencyKey would be passed in the request header/body.
    
    	// Simulate a long-running API call that might time out
    	for i := 0; i < 5; i++ {
    		activity.RecordHeartbeat(ctx, i)
    		time.Sleep(1 * time.Second)
    	}
    
    	// Simulate a potential failure
    	if amount > 1000 {
    		return "", fmt.Errorf("payment gateway declined for amount > 1000")
    	}
    
    	transactionID := fmt.Sprintf("txn-%s", idempotencyKey)
    	logger.Info("Payment processed successfully", "TransactionID", transactionID)
    	return transactionID, nil
    }
    
    func RefundPayment(ctx context.Context, transactionID string, idempotencyKey string) error {
    	logger := activity.GetLogger(ctx)
    	logger.Info("Refunding payment", "TransactionID", transactionID)
    
    	// RPC call to the payment service to issue a refund.
    	// The idempotencyKey ensures we don't double-refund.
    
    	logger.Info("Payment refunded successfully")
    	return nil
    }

    The Orchestrator: Implementing the Workflow with Compensation

    The core of our Saga is the workflow function. We'll use Go's defer statement to elegantly manage the compensation stack. A deferred function call is pushed onto a stack, and the calls are executed in last-in-first-out (LIFO) order when the surrounding function returns. This perfectly mirrors the reverse-order execution required for Saga compensations.

    go
    // workflow.go
    package main
    
    import (
    	"time"
    
    	"your_project/activities"
    	"your_project/shared"
    
    	"go.temporal.io/sdk/workflow"
    )
    
    func OrderProcessingWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
    	// Configure activity options
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    		// HeartbeatTimeout is crucial for long-running activities
    		HeartbeatTimeout: 2 * time.Second, 
    		RetryPolicy: &temporal.RetryPolicy{
    			InitialInterval:    time.Second,
    			BackoffCoefficient: 2.0,
    			MaximumInterval:    100 * time.Second,
    			MaximumAttempts:    5, // Or 0 for unlimited
    		},
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	logger := workflow.GetLogger(ctx)
    	logger.Info("Order processing workflow started", "OrderID", orderDetails.OrderID)
    
    	// 1. Create Order in DB
    	var orderID string
    	err := workflow.ExecuteActivity(ctx, activities.CreateOrder, orderDetails).Get(ctx, &orderID)
    	if err != nil {
    		logger.Error("Failed to create order", "Error", err)
    		return "", err
    	}
    
    	// Compensation stack
    	var compensations []func()
    	defer func() {
    		// If the workflow failed (non-nil err), execute compensations
    		if err != nil {
    			logger.Warn("Workflow failed, starting compensation logic", "OrderID", orderID)
    			for _, compensation := range compensations {
    				// Execute compensation activities in a new, detached context
    				// This ensures they run even if the main workflow context is cancelled
    				disconnectedCtx, _ := workflow.NewDisconnectedContext(ctx)
    				compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
    				err := workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
    				if err != nil {
    					logger.Error("Compensation activity failed", "Error", err)
    					// Handle non-compensatable errors here (e.g., signal for manual intervention)
    				}
    			}
    		}
    	}()
    
    	// 2. Reserve Inventory
    	err = workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderID, orderDetails.Items).Get(ctx, nil)
    	if err != nil {
    		logger.Error("Failed to reserve inventory", "Error", err)
    		// No compensation needed yet, as only the initial order was created.
    		// We might update the order status to FAILED here.
    		workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "FAILED_INVENTORY_RESERVATION").Get(ctx, nil)
    		return "", err
    	}
    	// Add compensation to the stack
    	compensations = append(compensations, func() {
    		workflow.ExecuteActivity(ctx, activities.ReleaseInventory, orderID, orderDetails.Items).Get(ctx, nil)
    	})
    
    	// 3. Process Payment
    	var transactionID string
    	idempotencyKey := workflow.GetInfo(ctx).WorkflowExecution.ID
    	err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, orderID, orderDetails.TotalAmount, idempotencyKey).Get(ctx, &transactionID)
    	if err != nil {
    		logger.Error("Failed to process payment", "Error", err)
    		return "", err // Defer will trigger compensations
    	}
    	// Add compensation to the stack
    	compensations = append(compensations, func() {
    		refundIdempotencyKey := fmt.Sprintf("refund-%s", idempotencyKey)
    		workflow.ExecuteActivity(ctx, activities.RefundPayment, transactionID, refundIdempotencyKey).Get(ctx, nil)
    	})
    
    	// 4. Update Order Status
    	err = workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "CONFIRMED").Get(ctx, nil)
    	if err != nil {
    		logger.Error("Failed to update order status", "Error", err)
    		return "", err // Defer will trigger compensations
    	}
    	// No direct compensation for status update, as refunding is the business compensation.
    
    	// 5. Dispatch Shipping
    	err = workflow.ExecuteActivity(ctx, activities.DispatchShipping, orderID).Get(ctx, nil)
    	if err != nil {
    		logger.Error("Failed to dispatch shipping", "Error", err)
    		// This is a partial failure. The order is paid. We don't want to refund yet.
    		// We can update status and let another process handle it.
    		workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "PAYMENT_CONFIRMED_SHIPPING_FAILED").Get(ctx, nil)
    		// We don't return the error here, so compensations are NOT triggered.
    		// The workflow succeeds from a technical perspective.
    		return "Order completed with shipping failure. Manual intervention required.", nil
    	}
    
    	logger.Info("Workflow completed successfully", "OrderID", orderID)
    	return "Order processed successfully", nil
    }

    Key Patterns in this Implementation:

  • Centralized Activity Options: We define ActivityOptions once and apply them to the context. This is good for consistency, but you can override them for specific activity calls if needed.
  • Heartbeat Timeout: The HeartbeatTimeout is set to 2 seconds. Our ProcessPayment activity calls activity.RecordHeartbeat every second. If the worker process crashes during this activity, Temporal will know within 2 seconds that it's no longer making progress and will reschedule it on another worker, resuming from the last known state.
  • LIFO Compensation with defer: The defer block contains the core compensation logic. It only executes if the function returns with a non-nil err. Inside the defer, we iterate through our compensations slice, which was built in FIFO order. The defer executes them in LIFO order, which is exactly what we need.
  • Disconnected Context for Compensations: We use workflow.NewDisconnectedContext for compensation activities. This is critical. If the workflow times out or is cancelled, the original ctx becomes invalid. A disconnected context ensures that our cleanup logic (compensations) will still run to completion, regardless of the parent workflow's state.
  • Handling Advanced Edge Cases and Failure Modes

    A simple happy-path-with-rollback Saga is table stakes. Production systems demand resilience against more complex failures.

    Edge Case 1: Idempotency is Non-Negotiable

    Temporal's at-least-once execution guarantee for activities means an activity might run more than once (e.g., a worker crashes after completing the work but before reporting back to the server). If your ProcessPayment activity is not idempotent, you will double-charge customers.

    The Solution: The workflow must generate a unique key for each non-idempotent operation. A fantastic candidate for this is the WorkflowExecution.ID, which is guaranteed to be unique for each workflow run.

    In our workflow code:

    go
    // Get a unique ID for this specific workflow execution
    idempotencyKey := workflow.GetInfo(ctx).WorkflowExecution.ID
    // Pass it to the activity
    err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, orderID, orderDetails.TotalAmount, idempotencyKey).Get(ctx, &transactionID)

    Your downstream microservice (the Payment Service) MUST be designed to handle this key. A common pattern is to have a database table (idempotency_key, transaction_result).

    Pseudo-code for the Payment Service endpoint:

    go
    func (s *PaymentService) ChargeCard(request ChargeRequest) (*ChargeResponse, error) {
        // 1. Check if we've seen this idempotency key before
        dbTx, err := s.db.Begin()
        existingResult, err := dbTx.QueryRow("SELECT result FROM processed_transactions WHERE key = ? FOR UPDATE", request.IdempotencyKey)
        if err == nil && existingResult != nil {
            // We've already processed this. Return the stored result.
            return existingResult, nil
        }
        if err != sql.ErrNoRows {
            return nil, err // Some other DB error
        }
    
        // 2. Not seen before. Process the payment.
        gatewayResponse, err := s.paymentGateway.Charge(request.CardDetails, request.Amount)
        if err != nil {
            // Store the failure result so we don't retry a failing card indefinitely
            s.storeResult(dbTx, request.IdempotencyKey, "FAILED")
            dbTx.Commit()
            return nil, err
        }
    
        // 3. Store the successful result before returning
        s.storeResult(dbTx, request.IdempotencyKey, gatewayResponse.TransactionID)
        dbTx.Commit()
    
        return &ChargeResponse{TransactionID: gatewayResponse.TransactionID}, nil
    }

    This pattern ensures that even if the Temporal activity retries 10 times, the customer is only charged once.

    Edge Case 2: Non-Compensatable Failures (The Human-in-the-loop Pattern)

    What happens if your RefundPayment activity fails? Perhaps the payment gateway's refund API is down for an extended period. The workflow is now stuck in a failed state, unable to complete its compensation. This is a non-compensatable error.

    The Solution: The workflow should stop trying to compensate automatically and signal for human intervention.

    We can modify our defer block to handle this:

    go
    // ... inside the defer block ...
    for _, compensation := range compensations {
        compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
        // Use a specific retry policy for compensations. Maybe fewer retries.
        compensationAo := workflow.ActivityOptions{StartToCloseTimeout: 30 * time.Second}
        compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationAo)
    
        err := workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
        if err != nil {
            logger.Error("CRITICAL: Compensation activity failed. Manual intervention required.", "Error", err)
            // The workflow will now fail and remain in a failed state in the Temporal UI.
            // An external monitoring system should alert on workflows failing in this specific state.
            // Alternatively, we can use Signals to attempt a fix.
        }
    }

    An operations team can then be alerted. They can inspect the failed workflow in the Temporal UI, diagnose the external issue (e.g., the payment gateway outage), and once it's resolved, they could potentially use a Temporal Signal to instruct the workflow to retry the failed compensation step. This requires more advanced workflow logic to handle such signals.

    Code Example: Handling a RetryCompensation Signal

    go
    // Add this to your workflow logic
    
    // A channel to receive the signal
    retrySignalChan := workflow.GetSignalChannel(ctx, "RetryCompensationSignal")
    
    // ... inside the compensation defer block's error handling ...
    if err != nil {
        logger.Error("Compensation failed, awaiting manual signal to retry...", "Error", err)
        // Block until a signal is received
        retrySignalChan.Receive(ctx, nil)
        logger.Info("Retry signal received, re-attempting compensation.")
        // Re-run the failed activity
        workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
    }

    An operator could then send the signal using the tctl CLI:

    bash
    tctl workflow signal --workflow_id <your_workflow_id> --name RetryCompensationSignal

    Edge Case 3: Partial Success is a Business Decision

    In our example, if DispatchShipping fails, the order is already paid for. The business likely does not want to automatically refund the customer. They'd prefer to retry shipping later or contact the customer.

    Our workflow code correctly handles this:

    go
    // ...
    err = workflow.ExecuteActivity(ctx, activities.DispatchShipping, orderID).Get(ctx, nil)
    if err != nil {
        logger.Error("Failed to dispatch shipping", "Error", err)
        // Update status to a specific error state
        workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "PAYMENT_CONFIRMED_SHIPPING_FAILED").Get(ctx, nil)
        
        // DO NOT return the error. This prevents the compensation stack from running.
        return "Order completed with shipping failure. Manual intervention required.", nil
    }

    By not returning an error object, we tell Temporal the workflow has technically succeeded. The compensation defer is not triggered. The return value indicates the partial failure, and another system (e.g., a dashboard for the logistics team) can now query for workflows that ended in this state and take appropriate action.

    Performance and Scalability Considerations

    Workflow History Size and Continue-As-New

    Temporal records every event in a workflow's execution (workflow start, activity scheduled, activity completed, timer fired, etc.) in a Workflow History. This history is replayed on a worker to recover the state of a workflow. However, this history has size and length limits (typically 50MB and 50,000 events).

    A Saga that runs for a very long time (e.g., a monthly subscription workflow that runs for years) or has a huge number of steps can exceed these limits.

    The Solution: The Continue-As-New pattern. This allows a workflow to complete its execution and start a new, fresh execution of itself in a single atomic operation, carrying over state as needed. This effectively resets the history.

    Example: A Monthly Subscription Saga

    go
    func SubscriptionWorkflow(ctx workflow.Context, subDetails Subscription) error {
        // Run this month's billing logic (a mini-Saga)
        err := runBillingCycle(ctx, subDetails)
        if err != nil {
            // Handle billing failure, maybe signal for dunning process
            return err
        }
    
        // Wait for the next billing cycle
        nextBillingDate := workflow.Now(ctx).Add(30 * 24 * time.Hour)
        await_ctx, _ := workflow.NewDisconnectedContext(ctx)
        err = workflow.Sleep(await_ctx, nextBillingDate.Sub(workflow.Now(ctx)))
    	if err != nil {
    		return err
    	}
    
        // Instead of looping, continue as a new workflow
        return workflow.NewContinueAsNewError(ctx, SubscriptionWorkflow, subDetails)
    }

    This workflow runs the billing logic, sleeps for a month, and then, instead of looping (which would grow the history), it starts a new instance of itself. The history size never grows beyond one month's worth of events, allowing the subscription to run indefinitely.

    Conclusion: Beyond a Pattern, a Platform

    The Saga pattern is a powerful concept for maintaining consistency in a distributed architecture. However, its manual implementation is a significant engineering burden. By leveraging a durable execution platform like Temporal, you offload the most complex and error-prone aspects of Saga implementation—state management, retries, timeouts, and logging—to the platform itself.

    This allows you to write your Saga orchestration as clear, sequential, and highly testable code. You can focus on the core business logic and the nuanced handling of edge cases, such as idempotency and non-compensatable failures, which is where the real value lies. By adopting these production-grade patterns, you can build truly resilient, observable, and scalable long-running workflows that form the backbone of your microservices architecture.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles