Implementing the Saga Pattern with Temporal for Distributed Transactions

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Complexity of Distributed State

In a monolithic world, ACID transactions are our safety net. A database BEGIN, a series of operations, and a COMMIT or ROLLBACK provides a powerful atomicity guarantee. When we decompose systems into microservices, we trade this simplicity for scalability and organizational autonomy, but we lose the safety net of distributed transactions. The two-phase commit protocol (2PC), while theoretically providing atomicity, is often a non-starter in modern architectures due to its synchronous, blocking nature and its creation of tight coupling between services. A failure in any participant or the coordinator can lock resources system-wide, torpedoing the very availability we sought with microservices.

This is where the Saga pattern enters. A saga is a sequence of local transactions where each transaction updates data within a single service. If a local transaction fails, the saga executes a series of compensating transactions to semantically undo the preceding successful transactions. The critical distinction is that a saga guarantees either successful completion of all steps or the successful execution of all necessary compensations. It's an A (Atomicity) and D (Durability) guarantee at the business logic level, sacrificing C (Consistency) and I (Isolation) during its execution (i.e., using semantic locks instead of database locks).

Sagas are typically implemented in two styles:

  • Choreography: Each service publishes events that trigger subsequent services in the chain. This is decentralized but creates significant challenges. Business logic becomes distributed and implicit, making the overall flow incredibly difficult to understand, debug, and modify. Monitoring a single business transaction requires correlating events across multiple services. Cyclical dependencies are a constant risk.
  • Orchestration: A central orchestrator explicitly calls services to perform operations and is responsible for invoking compensations in case of failure. This centralizes the workflow logic, making it explicit, observable, and easier to manage.
  • While orchestration is conceptually superior for complex flows, building a resilient orchestrator from scratch is a monumental task. You need to manage state durably, handle orchestrator process failures, manage retries with backoff, handle timeouts, and provide visibility. This is precisely the problem that Temporal solves, not just for Sagas, but for any long-running, stateful business logic.

    Temporal: A Durable Execution Engine, Not Just a Workflow Orchestrator

    To understand why Temporal is a game-changer for implementing Sagas, you must first discard the notion of it being a simple task queue or a BPMN engine. Temporal provides a durable execution environment. A Temporal Workflow is not a static definition but a piece of code written in a standard programming language (Go, TypeScript, Java, Python) that executes durably.

    Here's the core concept senior engineers must grasp: the state of your workflow function—local variables, threads, execution point—is transparently and continuously persisted by the Temporal Cluster. If the worker running your workflow code crashes, the Temporal Cluster will eventually detect the timeout and reschedule the execution on another available worker. This new worker will then replay the history of events for that workflow execution, instantly rebuilding the exact state (local variables and all) and resuming execution from the exact line of code where it left off.

    This durable execution model has profound implications for Saga implementation:

    * State Management is Solved: You don't need a saga_state table in a database. The state is the workflow's execution state itself, managed by Temporal.

    * Complex Logic is Just Code: Your Saga logic is expressed in your language of choice, with loops, conditionals, and variables. No restrictive YAML or JSON definitions.

    * Failure Handling is Trivial: Retries, timeouts, and error handling are built-in primitives of the Temporal SDKs.

    * Compensation is Simplified: Standard language constructs like defer in Go or try...catch...finally in TypeScript become powerful tools for ensuring compensation logic runs.

    Let's move from theory to a production-grade implementation.

    Deep Dive: An E-commerce Order Saga with Temporal and Go

    We'll model a canonical e-commerce order placement saga. The business transaction involves three microservices:

  • Inventory Service: Reserves items for the order.
  • Payment Service: Charges the customer's credit card.
  • Shipping Service: Creates a shipment record for fulfillment.
  • If any step fails, we must compensate for the preceding successful steps. A failure in CreateShipment must trigger a refund (ProcessPayment compensation) and a stock release (ReserveInventory compensation).

    Project Structure

    bash
    /temporal-saga-example
    ├── activities
    │   ├── inventory_activity.go
    │   ├── payment_activity.go
    │   └── shipping_activity.go
    ├── shared
    │   └── types.go
    ├── workflow
    │   └── order_workflow.go
    ├── worker
    │   └── main.go
    ├── starter
    │   └── main.go
    └── go.mod

    Shared Types (`shared/types.go`)

    First, let's define the data structures that will be passed between our workflow and activities.

    go
    package shared
    
    // OrderDetails contains all the necessary information for the order saga.
    type OrderDetails struct {
    	UserID     string
    	ItemID     string
    	Quantity   int
    	TotalPrice float64
    	OrderID    string // Generated by the workflow
    }
    
    // PaymentDetails holds the result of the payment activity.
    type PaymentDetails struct {
    	TransactionID string
    }
    
    // ShipmentDetails holds the result of the shipping activity.
    type ShipmentDetails struct {
    	TrackingID string
    }

    The Saga Orchestration: `OrderWorkflow` (`workflow/order_workflow.go`)

    This is the heart of our Saga. Notice how the logic reads like simple, sequential code. The magic of Temporal is that this code can execute over minutes, hours, or even days, surviving any worker or network failure.

    The key pattern for compensation is Go's defer statement. A deferred function's execution is postponed until the surrounding function returns. By placing our compensation calls within deferred functions, we ensure they will execute if the workflow function returns, either successfully or due to a panic (error).

    go
    package workflow
    
    import (
    	"fmt"
    	"time"
    
    	"go.temporal.io/sdk/log"
    	"go.temporal.io/sdk/workflow"
    
    	"temporal-saga-example/activities"
    	"temporal-saga-example/shared"
    )
    
    const OrderProcessingTaskQueue = "ORDER_PROCESSING_TASK_QUEUE"
    
    // OrderWorkflow orchestrates the entire order process as a saga.
    func OrderWorkflow(ctx workflow.Context, order shared.OrderDetails) (string, error) {
    	// Use a logger that is workflow-aware
    	logger := workflow.GetLogger(ctx)
    	logger.Info("Order workflow started", "OrderID", order.OrderID)
    
    	// Set up activity options. Timeouts are crucial for production systems.
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    		// HeartbeatTimeout can be used for long-running activities
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	var a *activities.Activities // Activities are stateless, so a nil receiver is fine.
    	var compensationStack []func() error
    	var requiresCompensation bool = true
    
    	// Defer the compensation logic. This will run when the workflow function exits.
    	defer func() {
    		if !requiresCompensation {
    			logger.Info("Workflow completed successfully, no compensation needed.")
    			return
    		}
    
    		logger.Warn("Workflow failed or was cancelled, starting compensation logic...")
    		// Execute compensations in reverse order.
    		for i := len(compensationStack) - 1; i >= 0; i-- {
    			err := compensationStack[i]()
    			if err != nil {
    				// This is a critical failure. The compensation itself failed.
    				// In a real system, you'd alert a human operator here.
    				logger.Error("Compensation activity failed. Manual intervention required.", "Error", err)
    			}
    		}
    	}()
    
    	// 1. Reserve Inventory
    	logger.Info("Reserving inventory...")
    	err := workflow.ExecuteActivity(ctx, a.ReserveInventory, order).Get(ctx, nil)
    	if err != nil {
    		logger.Error("Failed to reserve inventory.", "Error", err)
    		return "", fmt.Errorf("inventory reservation failed: %w", err)
    	}
    	compensationStack = append(compensationStack, func() error {
    		logger.Info("Compensating: Releasing inventory...")
    		// Use a new context for compensation to ensure it runs even if the main context is cancelled.
    		compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
    		return workflow.ExecuteActivity(compensationCtx, a.ReleaseInventory, order).Get(compensationCtx, nil)
    	})
    
    	// 2. Process Payment
    	logger.Info("Processing payment...")
    	var paymentDetails shared.PaymentDetails
    	err = workflow.ExecuteActivity(ctx, a.ProcessPayment, order).Get(ctx, &paymentDetails)
    	if err != nil {
    		logger.Error("Failed to process payment.", "Error", err)
    		return "", fmt.Errorf("payment processing failed: %w", err)
    	}
    	compensationStack = append(compensationStack, func() error {
    		logger.Info("Compensating: Refunding payment...", "TransactionID", paymentDetails.TransactionID)
    		compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
    		return workflow.ExecuteActivity(compensationCtx, a.RefundPayment, paymentDetails.TransactionID).Get(compensationCtx, nil)
    	})
    
    	// 3. Create Shipment
    	logger.Info("Creating shipment...")
    	var shipmentDetails shared.ShipmentDetails
    	err = workflow.ExecuteActivity(ctx, a.CreateShipment, order).Get(ctx, &shipmentDetails)
    	if err != nil {
    		logger.Error("Failed to create shipment.", "Error", err)
    		return "", fmt.Errorf("shipment creation failed: %w", err)
    	}
    	// Note: We don't add a compensation for CreateShipment in this simple example,
    	// but a real system might have a CancelShipment activity.
    
    	// If we've reached this point, the saga is successful.
    	// We disable the deferred compensation logic.
    	requiresCompensation = false
    
    	result := fmt.Sprintf("Order completed successfully! Tracking ID: %s", shipmentDetails.TrackingID)
    	logger.Info(result)
    	return result, nil
    }
    

    The Activities (`activities/`)

    Activities are the functions that perform the actual work, like making API calls to other services. They are stateless and can be retried automatically by Temporal. Here we'll just simulate the work.

    activities/inventory_activity.go

    go
    package activities
    
    import (
    	"context"
    	"errors"
    	"fmt"
    
    	"temporal-saga-example/shared"
    )
    
    // Activities struct is a placeholder for activities.
    type Activities struct{}
    
    func (a *Activities) ReserveInventory(ctx context.Context, order shared.OrderDetails) error {
    	fmt.Printf("Reserving %d of item %s for order %s\n", order.Quantity, order.ItemID, order.OrderID)
    	// Simulate a failure for demonstration purposes
    	// if order.ItemID == "FAIL_INVENTORY" {
    	// 	 return errors.New("simulated inventory failure")
    	// }
    	return nil
    }
    
    func (a *Activities) ReleaseInventory(ctx context.Context, order shared.OrderDetails) error {
    	fmt.Printf("Releasing %d of item %s for order %s from inventory\n", order.Quantity, order.ItemID, order.OrderID)
    	return nil
    }

    activities/payment_activity.go

    go
    package activities
    
    import (
    	"context"
    	"errors"
    	"fmt"
    
    	"temporal-saga-example/shared"
    	"github.com/google/uuid"
    )
    
    func (a *Activities) ProcessPayment(ctx context.Context, order shared.OrderDetails) (*shared.PaymentDetails, error) {
    	fmt.Printf("Charging $%.2f for order %s\n", order.TotalPrice, order.OrderID)
    
    	if order.ItemID == "FAIL_PAYMENT" {
    		return nil, errors.New("simulated payment failure: insufficient funds")
    	}
    
    	transactionID := uuid.New().String()
    	return &shared.PaymentDetails{TransactionID: transactionID}, nil
    }
    
    func (a *Activities) RefundPayment(ctx context.Context, transactionID string) error {
    	fmt.Printf("Refunding payment for transaction %s\n", transactionID)
    	return nil
    }

    activities/shipping_activity.go

    go
    package activities
    
    import (
    	"context"
    	"errors"
    	"fmt"
    
    	"temporal-saga-example/shared"
    	"github.com/google/uuid"
    )
    
    func (a *Activities) CreateShipment(ctx context.Context, order shared.OrderDetails) (*shared.ShipmentDetails, error) {
    	fmt.Printf("Creating shipment for order %s\n", order.OrderID)
    
    	if order.ItemID == "FAIL_SHIPPING" {
    		return nil, errors.New("simulated shipping failure: address not verifiable")
    	}
    
    	trackingID := fmt.Sprintf("1Z%s", uuid.New().String())
    	return &shared.ShipmentDetails{TrackingID: trackingID}, nil
    }

    The Worker and Starter

    Finally, we need a Worker process to host and execute our workflow and activity code, and a Starter process to initiate a workflow execution.

    worker/main.go

    go
    package main
    
    import (
    	"log"
    
    	"go.temporal.io/sdk/client"
    	"go.temporal.io/sdk/worker"
    
    	"temporal-saga-example/activities"
    	"temporal-saga-example/workflow"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	w := worker.New(c, workflow.OrderProcessingTaskQueue, worker.Options{})
    
    	w.RegisterWorkflow(workflow.OrderWorkflow)
    	w.RegisterActivity(&activities.Activities{})
    
    	err = w.Run(worker.InterruptCh())
    	if err != nil {
    		log.Fatalln("Unable to start worker", err)
    	}
    }

    starter/main.go

    go
    package main
    
    import (
    	"context"
    	"fmt"
    	"log"
    
    	"github.com/google/uuid"
    	"go.temporal.io/sdk/client"
    
    	"temporal-saga-example/shared"
    	"temporal-saga-example/workflow"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	// Change ItemID to "FAIL_PAYMENT" or "FAIL_SHIPPING" to test compensation.
    	order := shared.OrderDetails{
    		UserID:     "user-123",
    		ItemID:     "item-456", // or "FAIL_PAYMENT", "FAIL_SHIPPING"
    		Quantity:   1,
    		TotalPrice: 29.99,
    		OrderID:    uuid.New().String(),
    	}
    
    	options := client.StartWorkflowOptions{
    		ID:        "order-workflow-" + order.OrderID,
    		TaskQueue: workflow.OrderProcessingTaskQueue,
    	}
    
    	we, err := c.ExecuteWorkflow(context.Background(), options, workflow.OrderWorkflow, order)
    	if err != nil {
    		log.Fatalln("Unable to execute workflow", err)
    	}
    	log.Println("Started workflow", "WorkflowID", we.GetID(), "RunID", we.GetRunID())
    
    	var result string
    	err = we.Get(context.Background(), &result)
    	if err != nil {
    		log.Println("Workflow failed.", "Error", err)
    		return
    	}
    	log.Println("Workflow completed", "Result", result)
    }

    Running this code against a local Temporal development server will demonstrate the happy path. If you change the ItemID in the starter to FAIL_PAYMENT, you will see the logs indicating the payment activity failed, followed immediately by the Releasing inventory... log from the compensation activity.

    Advanced Patterns and Edge Case Handling

    This simple implementation is a great start, but production systems require more rigor.

    1. Activity Idempotency: The Cornerstone of Resilience

    Temporal guarantees that a workflow will execute effectively-once, but activities execute at-least-once. A worker could execute an activity (e.g., charge a credit card), crash before it can report completion to the Temporal server, and then another worker could be assigned the same task. This would result in a double charge.

    Your downstream services must be idempotent. The standard pattern is to generate an idempotency key within the workflow and pass it to all activities. The workflow is deterministic, so the same key will be generated on any replay.

    go
    // In OrderWorkflow
    idempotencyKey := workflow.NewUUID().String() // Stable across replays
    
    // Pass this key to your activities
    err := workflow.ExecuteActivity(ctx, a.ProcessPayment, order, idempotencyKey).Get(ctx, &paymentDetails)

    The ProcessPayment gRPC endpoint in your Payment Service would then look something like this:

    go
    // In Payment Service (gRPC implementation)
    func (s *server) ProcessPayment(ctx context.Context, req *pb.PaymentRequest) (*pb.PaymentResponse, error) {
        // 1. Check a cache (e.g., Redis) for the idempotency key.
        // If found, return the cached response.
    
        // 2. If not in cache, begin a database transaction.
        tx, err := s.db.BeginTx(ctx, nil)
        if err != nil { return nil, status.Error(codes.Internal, "..." ) }
        defer tx.Rollback() // Rollback on any error
    
        // 3. Check for the idempotency key in a dedicated DB table within the transaction.
        // If it exists, another request is in-flight or completed. Return an error or the stored result.
    
        // 4. Perform the actual payment processing logic...
        
        // 5. Store the result and the idempotency key in the idempotency table.
    
        // 6. Commit the transaction.
        // 7. Cache the result in Redis with the idempotency key.
    
        return response, nil
    }

    This pattern ensures that even if Temporal retries the activity, the actual business logic is only performed once.

    2. Non-Compensatable Failures: When Automation Fails

    What if RefundPayment itself fails? This is a non-compensatable failure. The default Temporal activity retry policy is exponential backoff, effectively forever. While this is great for transient network issues, a permanent failure (e.g., a bug in the refund logic, a hard-down dependent service) will stall the compensation.

    Your strategy here must involve a human. You can customize the activity's retry policy to give up after a certain number of attempts and then escalate.

    go
    // In OrderWorkflow, for a compensation activity
    compensationCtx, _ = workflow.NewDisconnectedContext(ctx)
    
    // Custom retry policy for critical compensation
    retryPolicy := &temporal.RetryPolicy{
        InitialInterval:    time.Second,
        BackoffCoefficient: 2.0,
        MaximumInterval:    time.Minute,
        MaximumAttempts:    5, // Give up after 5 attempts
    }
    
    compensationActivityOptions := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy:         retryPolicy,
    }
    compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationActivityOptions)
    
    err := workflow.ExecuteActivity(compensationCtx, a.RefundPayment, ...).Get(compensationCtx, nil)
    if err != nil {
        logger.Error("CRITICAL: Compensation for refund failed permanently. Paging on-call.", "Error", err)
        // This is where you would execute another activity to send an alert to PagerDuty, Slack, etc.
        _ = workflow.ExecuteActivity(ctx, a.AlertHumanOperator, "Failed to refund payment for order " + order.OrderID).Get(ctx, nil)
    }

    Another advanced pattern is to use Temporal Signals. A signal is an external, asynchronous event sent to a running workflow. You could design the workflow to enter a waiting state after a compensation failure, where it waits for a signal from an operator who has manually fixed the underlying issue. The signal could contain data on how to proceed.

    3. Performance and Scalability: Task Queues and Workers

    Temporal uses Task Queues to decouple workflows/activities from workers. When a worker starts, it subscribes to a specific task queue.

    In a large-scale system, you wouldn't run all activities on one generic worker pool. You'd use different task queues for different workloads:

    * inventory-tasks: Workers for this queue might be co-located with the inventory database for low latency.

    * payment-tasks: These workers might have special IAM roles to access a payment provider's API.

    * high-memory-tasks: A fleet of workers with high memory for data processing activities.

    This is configured in the ActivityOptions:

    go
    // In OrderWorkflow
    paymentActivityOptions := workflow.ActivityOptions{
        TaskQueue:           "PAYMENT_PROCESSING_TASK_QUEUE",
        StartToCloseTimeout: 15 * time.Second,
    }
    paymentCtx := workflow.WithActivityOptions(ctx, paymentActivityOptions)
    err := workflow.ExecuteActivity(paymentCtx, a.ProcessPayment, order).Get(paymentCtx, &paymentDetails)

    Scaling is as simple as deploying more worker processes listening to the same task queue. Temporal's server-side load balancing will distribute the tasks among them.

    Production Considerations: Observability and Testing

    Code that you can't test or monitor is a liability.

    Observability

    * Logging: Use workflow.GetLogger(ctx) inside workflows. This logger automatically includes workflow metadata and correctly handles replay logic (log statements are only written to the output on the first execution, not on replay).

    * Tracing: The Temporal SDKs have built-in support for OpenTelemetry. You can configure it on the client to automatically propagate tracing context from the workflow starter, through the Temporal cluster, to the workflow, and down into each activity. This gives you a single trace that spans your entire distributed transaction.

    * Metrics: Temporal emits a rich set of metrics (e.g., temporal_activity_execution_latency, temporal_workflow_task_schedule_to_start_latency) that can be scraped by Prometheus for monitoring the health of your Temporal cluster and workers.

    * Visibility: Use the Temporal Web UI or tctl (the command-line tool) to inspect the state, inputs, and history of any workflow execution. This is invaluable for debugging.

    Testing

    One of Temporal's most powerful features is its test framework. It allows you to run your entire workflow logic in a simulated, in-memory environment, enabling true unit tests for your orchestration logic.

    Here's a test for our OrderWorkflow that simulates a payment failure and verifies that the inventory compensation is called.

    go
    // In workflow/order_workflow_test.go
    package workflow
    
    import (
    	"errors"
    	"testing"
    
    	"github.com/stretchr/testify/mock"
    	"github.com/stretchr/testify/require"
    	"go.temporal.io/sdk/testsuite"
    
    	"temporal-saga-example/activities"
    	"temporal-saga-example/shared"
    )
    
    func TestOrderWorkflow_PaymentFails(t *testing.T) {
    	suite := testsuite.WorkflowTestSuite{}
    	env := suite.NewTestWorkflowEnvironment()
    
    	var a *activities.Activities
    	order := shared.OrderDetails{
    		UserID:     "test-user",
    		ItemID:     "FAIL_PAYMENT",
    		Quantity:   2,
    		TotalPrice: 50.0,
    		OrderID:    "test-order-id",
    	}
    
    	// Mock the activities
    	env.OnActivity(a.ReserveInventory, mock.Anything, order).Return(nil)
    	env.OnActivity(a.ProcessPayment, mock.Anything, order).Return(nil, errors.New("insufficient funds"))
    	// We expect the compensation activity to be called
    	env.OnActivity(a.ReleaseInventory, mock.Anything, order).Return(nil).Once()
    
    	env.ExecuteWorkflow(OrderWorkflow, order)
    
    	// Assertions
    	require.True(t, env.IsWorkflowCompleted())
    	require.Error(t, env.GetWorkflowError())
    
    	// Verify that the mocked activities were called as expected
    	env.AssertExpectations(t)
    }

    This test executes the entire OrderWorkflow function, including the defer logic, without any external dependencies. You can simulate any failure scenario and assert that the correct compensation path was taken. This allows you to achieve extremely high confidence in your complex orchestration logic before deploying.

    Conclusion: Durable Execution is the Future of Distributed Systems

    The Saga pattern is a powerful concept for managing consistency across microservice boundaries. However, traditional implementations often trade one form of complexity (distributed transactions) for another (brittle event choreography or complex, hand-rolled state machines).

    Temporal's durable execution model provides a paradigm shift. By abstracting away the hard problems of distributed systems—state persistence, retries, timeouts, and failure recovery—it allows senior engineers to write complex, long-running, fault-tolerant orchestration logic as simple, sequential code. The result is a system that is more resilient, easier to understand, simpler to debug, and vastly more testable than any alternative. For building mission-critical, stateful applications in a microservices world, this approach is not just an improvement; it's a fundamental leap forward.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles