Implementing the Saga Pattern with Temporal for Resilient Microservices

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Peril of Distributed Transactions and the Promise of Sagas

In a microservices architecture, the canonical challenge of maintaining data consistency across service boundaries is a constant source of complexity. The two-phase commit (2PC) protocol, a staple of monolithic systems, is notoriously brittle and unscalable in a distributed environment due to its requirement for synchronous communication and locking of resources. This is where the Saga pattern emerges as a superior alternative for managing long-running transactions.

A Saga is a sequence of local transactions where each transaction updates data within a single service and publishes a message or event to trigger the next transaction in the chain. If a local transaction fails, a series of compensating transactions are executed to semantically undo the work of the preceding successful transactions.

However, implementing a Saga from scratch is a significant engineering challenge. You become responsible for:

  • State Management: Persisting the state of the Saga at each step. Where are we in the process? Which steps succeeded? Which compensations are required?
  • Orchestration Logic: A central orchestrator or a choreographed event-driven flow needs to reliably trigger the next step or compensation.
  • Failure Handling: What happens if the orchestrator crashes? Or if a compensating transaction fails repeatedly?
  • Temporal Logic: Handling timeouts, retries with backoff, and other time-based concerns.
  • This is where Temporal.io fundamentally changes the game. Temporal is not a message queue or a simple task scheduler; it's a durable execution system. It allows you to write complex, long-running, and stateful logic as a single piece of code—a Workflow—while the Temporal cluster handles the persistence, retries, and state management. The state of your Saga lives within the durable context of a Temporal Workflow, not in a fragile, custom-built state machine in your database.

    This article will demonstrate how to implement a robust Saga pattern in Go using Temporal, focusing on production-grade patterns that address idempotency, error handling, and long-term maintainability.

    Our Scenario: An E-commerce Order Processing Saga

    We will model a classic e-commerce order workflow, which is an ideal candidate for a Saga due to its multi-service, long-running nature:

  • Create Order: The initial state in the OrderService.
  • Reserve Inventory: The InventoryService must reserve the items.
  • Process Payment: The PaymentService must charge the customer.
  • Ship Order: The ShippingService must arrange for delivery.
  • If the payment fails, we must compensate by un-reserving the inventory. If inventory reservation fails, we simply fail the workflow. This is a classic orchestration-based Saga.


    Section 1: Workflow and Activity Definitions

    First, let's define the structure of our workflow and the activities it will orchestrate. A Temporal Workflow is the orchestrator, and each step in the Saga is a Temporal Activity. Activities are where you interact with the outside world (databases, APIs, etc.).

    Project Structure

    text
    /temporal-saga-project
    ├── activities
    │   ├── inventory_activity.go
    │   ├── payment_activity.go
    │   └── shipping_activity.go
    ├── shared
    │   └── types.go
    ├── workflow
    │   └── order_workflow.go
    ├── worker
    │   └── main.go
    └── starter
        └── main.go

    Shared Data Structures

    We'll start with the data structures that will be passed between our workflow and activities.

    shared/types.go

    go
    package shared
    
    import "time"
    
    // OrderDetails contains all the information for an order.
    type OrderDetails struct {
    	OrderID    string
    	UserID     string
    	Items      []string
    	TotalPrice float64
    }
    
    // PaymentDetails for processing payment.
    type PaymentDetails struct {
    	OrderID       string
    	UserID        string
    	Amount        float64
    	TransactionID string
    }
    
    // ShippingDetails for shipping the order.
    type ShippingDetails struct {
    	OrderID string
    	UserID  string
    	Address string
    }

    The Saga Workflow Definition

    The core of our Saga is the OrderWorkflow. Notice how it reads like straight-line procedural code. The magic of Temporal is that this code can effectively 'sleep' for days or weeks between steps, and if the worker process crashes, Temporal will seamlessly resume the execution on another worker from the exact same point.

    workflow/order_workflow.go

    go
    package workflow
    
    import (
    	"fmt"
    	"time"
    
    	"go.temporal.io/sdk/workflow"
    
    	"temporal-saga-project/activities"
    	"temporal-saga-project/shared"
    )
    
    func OrderWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
    	// Configure activity options with timeouts and retry policies.
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    		RetryPolicy: &workflow.RetryPolicy{
    			InitialInterval:    time.Second,
    			BackoffCoefficient: 2.0,
    			MaximumInterval:    100 * time.Second,
    			MaximumAttempts:    3,
    		},
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	logger := workflow.GetLogger(ctx)
    	logger.Info("Order workflow started", "OrderID", orderDetails.OrderID)
    
    	// 1. Reserve Inventory
    	var inventoryResult string
    	err := workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderDetails).Get(ctx, &inventoryResult)
    	if err != nil {
    		logger.Error("Failed to reserve inventory", "Error", err)
    		return "", fmt.Errorf("inventory reservation failed: %w", err)
    	}
    
    	// Add compensation for inventory reservation.
    	// This will execute if the workflow fails at any point after this line.
    	defer func() {
    		if err != nil {
    			// Workflow failed, we need to compensate.
    			logger.Warn("Workflow failed, compensating inventory reservation.")
    			compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
    			_ = workflow.ExecuteActivity(compensationCtx, activities.CancelInventoryReservation, orderDetails).Get(compensationCtx, nil)
    		}
    	}()
    
    	// 2. Process Payment
    	paymentDetails := shared.PaymentDetails{
    		OrderID: orderDetails.OrderID,
    		UserID:  orderDetails.UserID,
    		Amount:  orderDetails.TotalPrice,
    	}
    	var paymentResult string
    	err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, paymentDetails).Get(ctx, &paymentResult)
    	if err != nil {
    		logger.Error("Failed to process payment", "Error", err)
    		return "", fmt.Errorf("payment processing failed: %w", err)
    	}
    
    	// Add compensation for payment.
    	defer func() {
    		if err != nil {
    			// Workflow failed, we need to compensate.
    			logger.Warn("Workflow failed, compensating payment.")
    			compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
    			_ = workflow.ExecuteActivity(compensationCtx, activities.RefundPayment, paymentDetails).Get(compensationCtx, nil)
    		}
    	}()
    
    	// 3. Ship Order
    	shippingDetails := shared.ShippingDetails{
    		OrderID: orderDetails.OrderID,
    		UserID:  orderDetails.UserID,
    		Address: "123 Temporal Lane", // Example address
    	}
    	var shippingResult string
    	err = workflow.ExecuteActivity(ctx, activities.ShipOrder, shippingDetails).Get(ctx, &shippingResult)
    	if err != nil {
    		logger.Error("Failed to ship order", "Error", err)
    		return "", fmt.Errorf("shipping failed: %w", err)
    	}
    
    	// No compensation for shipping in this simple model once it's shipped.
    
    	result := fmt.Sprintf("Order %s processed successfully! Inventory: %s, Payment: %s, Shipping: %s", orderDetails.OrderID, inventoryResult, paymentResult, shippingResult)
    	logger.Info("Workflow completed successfully.")
    
    	return result, nil
    }

    The Durable `defer` Pattern for Compensations

    The most critical pattern in this workflow is the use of defer. In regular Go, defer schedules a function call to be executed when the surrounding function returns. In a Temporal workflow, defer is durable. If the workflow execution fails or the worker crashes, the deferred function calls are persisted in the workflow history. When the workflow is recovered, the defer stack is restored.

    By checking if err != nil inside the deferred function, we create a compensating transaction. The err variable will be nil if the workflow completes successfully, and non-nil if any subsequent ExecuteActivity call fails. This ensures compensations run only when the Saga needs to be rolled back.

    We also use workflow.NewDisconnectedContext(ctx). This is crucial for compensations. It creates a new context that is detached from the main workflow context's cancellation. This ensures that even if the workflow is cancelled or times out, the cleanup/compensation logic will still attempt to run.


    Section 2: Implementing Idempotent Activities

    Activities are where side effects happen. A core requirement for a reliable Saga is that these side effects must be idempotent. Temporal guarantees at-least-once execution of activities. If an activity executes but the worker crashes before it can report completion back to the Temporal cluster, the cluster will time out and reschedule the activity. The downstream service must be able to handle this duplicate call without causing issues (e.g., charging a customer twice).

    Idempotency is typically achieved by passing a unique key with each request. The workflow's RunID or a combination of WorkflowID and an activity-specific identifier are good candidates.

    Here are the mock implementations for our activities, demonstrating how to handle idempotency and simulate failures.

    activities/inventory_activity.go

    go
    package activities
    
    import (
    	"context"
    	"fmt"
    
    	"temporal-saga-project/shared"
    )
    
    // In a real app, this would be a database or another service client.
    var reservedItems = make(map[string][]string)
    
    func ReserveInventory(ctx context.Context, order shared.OrderDetails) (string, error) {
    	// Idempotency Check: if already reserved for this order, just return success.
    	if _, ok := reservedItems[order.OrderID]; ok {
    		return fmt.Sprintf("Inventory for order %s already reserved (idempotent)", order.OrderID), nil
    	}
    
    	fmt.Printf("Reserving inventory for order: %s\n", order.OrderID)
    	reservedItems[order.OrderID] = order.Items
    	return fmt.Sprintf("Inventory reserved for order %s", order.OrderID), nil
    }
    
    func CancelInventoryReservation(ctx context.Context, order shared.OrderDetails) (string, error) {
    	fmt.Printf("Compensating: canceling inventory reservation for order: %s\n", order.OrderID)
    	delete(reservedItems, order.OrderID)
    	return fmt.Sprintf("Inventory reservation canceled for order %s", order.OrderID), nil
    }

    activities/payment_activity.go

    go
    package activities
    
    import (
    	"context"
    	"errors"
    	"fmt"
    
    	"temporal-saga-project/shared"
    )
    
    var processedPayments = make(map[string]bool)
    var refundedPayments = make(map[string]bool)
    
    func ProcessPayment(ctx context.Context, payment shared.PaymentDetails) (string, error) {
    	// Idempotency Check
    	if processed, ok := processedPayments[payment.OrderID]; ok && processed {
    		return fmt.Sprintf("Payment for order %s already processed (idempotent)", payment.OrderID), nil
    	}
    
    	fmt.Printf("Processing payment for order: %s for amount %.2f\n", payment.OrderID, payment.Amount)
    
    	// Simulate a payment failure for a specific user to test compensation
    	if payment.UserID == "user-who-fails-payment" {
    		return "", errors.New("payment gateway declined: insufficient funds")
    	}
    
    	processedPayments[payment.OrderID] = true
    	return fmt.Sprintf("Payment successful for order %s", payment.OrderID), nil
    }
    
    func RefundPayment(ctx context.Context, payment shared.PaymentDetails) (string, error) {
    	// Idempotency Check
    	if refunded, ok := refundedPayments[payment.OrderID]; ok && refunded {
    		return fmt.Sprintf("Payment for order %s already refunded (idempotent)", payment.OrderID), nil
    	}
    
    	fmt.Printf("Compensating: refunding payment for order: %s\n", payment.OrderID)
    	delete(processedPayments, payment.OrderID)
    	refundedPayments[payment.OrderID] = true
    	return fmt.Sprintf("Payment refunded for order %s", payment.OrderID), nil
    }

    activities/shipping_activity.go

    go
    package activities
    
    import (
    	"context"
    	"fmt"
    
    	"temporal-saga-project/shared"
    )
    
    func ShipOrder(ctx context.Context, shipping shared.ShippingDetails) (string, error) {
    	fmt.Printf("Shipping order %s to %s\n", shipping.OrderID, shipping.Address)
    	// In a real scenario, this would call a shipping provider's API.
    	return fmt.Sprintf("Order %s shipped", shipping.OrderID), nil
    }

    Section 3: The Worker and Starter

    To run this, we need two more pieces: a Worker process that hosts the workflow and activity implementations, and a Starter process that initiates a workflow execution.

    worker/main.go

    go
    package main
    
    import (
    	"log"
    
    	"go.temporal.io/sdk/client"
    	"go.temporal.io/sdk/worker"
    
    	"temporal-saga-project/activities"
    	"temporal-saga-project/workflow"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	w := worker.New(c, "order-processing", worker.Options{})
    
    	w.RegisterWorkflow(workflow.OrderWorkflow)
    	w.RegisterActivity(activities.ReserveInventory)
    	w.RegisterActivity(activities.CancelInventoryReservation)
    	w.RegisterActivity(activities.ProcessPayment)
    	w.RegisterActivity(activities.RefundPayment)
    	w.RegisterActivity(activities.ShipOrder)
    
    	err = w.Run(worker.InterruptCh())
    	if err != nil {
    		log.Fatalln("Unable to start worker", err)
    	}
    }

    starter/main.go

    go
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    	"os"
    
    	"github.com/google/uuid"
    	"go.temporal.io/sdk/client"
    
    	"temporal-saga-project/shared"
    	"temporal-saga-project/workflow"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	// Determine if we should simulate a failure
    	userID := "user-happy-path"
    	if len(os.Args) > 1 && os.Args[1] == "fail" {
    		userID = "user-who-fails-payment"
    	}
    
    	orderDetails := shared.OrderDetails{
    		OrderID:    uuid.New().String(),
    		UserID:     userID,
    		Items:      []string{"item-1", "item-2"},
    		TotalPrice: 120.50,
    	}
    
    	workflowOptions := client.StartWorkflowOptions{
    		ID:        "order-workflow-" + orderDetails.OrderID,
    		TaskQueue: "order-processing",
    	}
    
    	we, err := c.ExecuteWorkflow(context.Background(), workflowOptions, workflow.OrderWorkflow, orderDetails)
    	if err != nil {
    		log.Fatalln("Unable to execute workflow", err)
    	}
    
    	log.Println("Started workflow", "WorkflowID", we.GetID(), "RunID", we.GetRunID())
    
    	// Wait for the workflow to complete.
    	var result string
    	err = we.Get(context.Background(), &result)
    	if err != nil {
    		log.Printf("Workflow failed: %v\n", err)
    	} else {
    		log.Printf("Workflow completed: %s\n", result)
    	}
    }

    Running the Saga

  • Start Temporal: Ensure you have a Temporal cluster running (e.g., via temporal server start-dev).
  • Start the Worker: go run worker/main.go
  • Run the Happy Path: go run starter/main.go
  • Run the Failure Path: go run starter/main.go fail
  • When you run the failure path, you will see logs indicating that the payment failed, followed immediately by logs from the compensation activities for both payment (attempted refund) and inventory (cancellation), demonstrating the Saga rollback in action.


    Section 4: Advanced Edge Cases and Production Patterns

    While the above implementation is robust, real-world systems introduce more complexity. Let's discuss how to handle some advanced scenarios.

    Non-Deterministic Logic Errors

    A Temporal workflow must be deterministic. Its logic must produce the same sequence of commands given the same history of events. This is how Temporal can safely replay a workflow's history to recover its state. Using non-deterministic functions like time.Now() or iterating over a map (which has random iteration order in Go) will break this contract and cause your workflow to fail.

    Incorrect (Non-Deterministic):

    go
    // In a workflow
    if time.Now().Weekday() == time.Friday {
        // ... apply discount
    }
    
    // Ranging over a map
    for k, v := range myMap {
        // ... this order is not guaranteed
    }

    Correct (Deterministic):

    go
    // Use Temporal's deterministic time function
    if workflow.Now(ctx).Weekday() == time.Friday {
        // ... apply discount
    }
    
    // Get keys, sort them, then iterate
    keys := make([]string, 0, len(myMap))
    for k := range myMap {
        keys = append(keys, k)
    }
    sort.Strings(keys)
    for _, k := range keys {
        v := myMap[k]
        // ... this order is now guaranteed
    }

    Workflow Versioning for In-flight Sagas

    What happens when you need to change the logic of your OrderWorkflow? If you deploy new worker code, any in-flight workflows that are 'sleeping' might be resumed by the new code, potentially breaking them if the history is no longer compatible.

    Temporal provides a versioning API, workflow.GetVersion, to manage this. You can mark sections of your code with a version, allowing you to introduce new logic paths while maintaining compatibility for old workflows.

    Example: Let's add a new fraud check step between inventory and payment.

    go
    // ... inside OrderWorkflow after inventory reservation
    
    // Use a change ID and a version number
    version := workflow.GetVersion(ctx, "add-fraud-check", workflow.DefaultVersion, 1)
    
    if version == 1 {
        // This is the new logic path for workflows started on the new code.
        var fraudResult string
        err = workflow.ExecuteActivity(ctx, activities.CheckFraud, orderDetails).Get(ctx, &fraudResult)
        if err != nil {
            logger.Error("Fraud check failed", "Error", err)
            return "", fmt.Errorf("fraud check failed: %w", err)
        }
    }
    
    // The rest of the workflow (payment, shipping) continues here...

    Workflows started before this code was deployed will have workflow.DefaultVersion for the "add-fraud-check" change ID and will skip the new activity. Workflows started on the new code will get version 1 and execute the fraud check. This allows for safe, graceful migration of long-running Sagas.

    Handling Compensation Failures

    Our current model assumes compensations will succeed. What if RefundPayment fails? The Saga is now in an inconsistent state—inventory was un-reserved, but the customer was not refunded.

    This is a business logic problem, not a technical one, but Temporal gives you the tools to model the solution.

  • Infinite Retries on Compensations: You can configure the activity options for compensation activities with an infinite retry policy. This is common for critical cleanup tasks. The activity will keep retrying until it succeeds (e.g., the payment gateway comes back online).
  • go
        // Inside the defer block
        compensationAo := workflow.ActivityOptions{
            StartToCloseTimeout: 30 * time.Second,
            RetryPolicy: &workflow.RetryPolicy{
                MaximumAttempts: 0, // Infinite retries
            },
        }
        compensationCtx, _ = workflow.NewDisconnectedContext(ctx)
        compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationAo)
        _ = workflow.ExecuteActivity(compensationCtx, activities.RefundPayment, paymentDetails).Get(compensationCtx, nil)
  • Human-in-the-Loop: If a compensation fails after a set number of retries, the workflow can escalate. It can execute another activity that creates a ticket in a system like Jira or PagerDuty, signaling for manual intervention. The workflow can then block using workflow.Await until a human signals it (using a Temporal Signal) that the issue has been resolved manually, at which point the Saga can conclude.
  • Scaling with Task Queues

    In a high-volume system, you may find that payment processing is a bottleneck. You can scale the workers that handle payments independently from those that handle inventory or shipping. This is done by assigning activities to different Task Queues.

    In the Workflow:

    go
    // Use a specific task queue for payment activities
    paymentAo := workflow.ActivityOptions{
        TaskQueue:           "payment-processing-queue",
        StartToCloseTimeout: 10 * time.Second,
    }
    ctxForPayment := workflow.WithActivityOptions(ctx, paymentAo)
    err = workflow.ExecuteActivity(ctxForPayment, activities.ProcessPayment, paymentDetails).Get(ctx, &paymentResult)

    On the Worker:

    go
    // A dedicated worker pool for payments
    paymentWorker := worker.New(c, "payment-processing-queue", worker.Options{})
    paymentWorker.RegisterActivity(activities.ProcessPayment)
    paymentWorker.RegisterActivity(activities.RefundPayment)
    
    // Run this worker on a separate set of machines
    err = paymentWorker.Run(worker.InterruptCh())

    This allows you to provision more resources specifically for the payment-processing-queue without over-provisioning for the less-intensive inventory and shipping activities.

    Conclusion

    The Saga pattern is a powerful tool for maintaining data consistency in a distributed architecture. However, its manual implementation is a minefield of state management, error handling, and concurrency problems. By leveraging a durable execution engine like Temporal, you can implement complex, long-running Sagas as straightforward, readable code.

    The key takeaway is to shift your mindset from building state machines and message handlers to writing durable business logic. By using patterns like durable defer for compensations, ensuring activity idempotency, planning for workflow versioning, and thoughtfully handling compensation failures, you can build truly resilient and maintainable microservice applications. Temporal doesn't just make implementing Sagas easier; it makes them fundamentally more robust by handling the hardest parts of distributed systems for you.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles