Fault-Tolerant Sagas with Temporal for Distributed Transactions

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Contradiction: Atomicity in a Decoupled World

In a monolithic architecture, ACID transactions managed by a single database are the bedrock of data consistency. As we decompose systems into microservices, we gain scalability and team autonomy, but we sacrifice this transactional simplicity. A single business operation, like booking a vacation, may now span multiple services—Flight, Hotel, Car Rental—each with its own private database. The classic solution, two-phase commit (2PC), is often a non-starter in these environments due to its chatty nature, reliance on synchronous communication, and tendency to hold locks, creating a single point of failure and reducing the very availability we sought with microservices.

This leaves us with a critical challenge: how do we maintain data consistency across service boundaries without compromising the principles of loose coupling and resilience? The answer lies in embracing eventual consistency through advanced architectural patterns. The Saga pattern emerges as the dominant solution for managing long-running, multi-step business processes in a distributed environment.

This article assumes you are familiar with the fundamental trade-offs between 2PC and Sagas. We will not re-litigate that debate. Instead, we will focus on the practical, and often treacherous, implementation details of an orchestrated Saga, and demonstrate why a durable execution engine like Temporal.io is a superior tool for this task compared to homegrown solutions built on message queues and databases.

Sagas: From Choreography Pitfalls to Orchestration Power

The Saga pattern executes a sequence of local transactions across different services. If any local transaction fails, the Saga executes a series of compensating transactions to semantically undo the work of the preceding successful transactions. There are two primary coordination strategies:

  • Choreography: Services subscribe to events from other services and react accordingly. A FlightBooked event might trigger the Hotel service to book a room. This is highly decoupled but suffers from poor visibility. Understanding the state of a business transaction requires tracing a chain of events across multiple services, making debugging and reasoning about failure modes exponentially difficult.
  • Orchestration: A central orchestrator explicitly calls services to perform their local transactions and is responsible for invoking compensating transactions in case of failure. This centralizes the business logic, making the workflow explicit, observable, and easier to manage.
  • The primary challenge with orchestration is the orchestrator itself. A naive implementation—a simple service that calls other services and stores its state in a database—is a fragile single point of failure. If the orchestrator crashes mid-saga, how do you reliably resume from the correct step? How do you handle timeouts and retries without complex, error-prone state machines and locking?

    This is precisely the problem Temporal solves. A Temporal Workflow is a durable, resumable function. Its state, execution stack, and local variables are persisted transparently by the Temporal Cluster. A worker can crash, the network can fail, but the workflow execution will not be lost. It will be seamlessly resumed on another worker from its exact state once resources are available. This transforms the Saga orchestrator from a complex, stateful service into a straightforward piece of code that looks deceptively simple but is incredibly resilient.

    A Production Scenario: The Trip Booking Saga

    Let's model a trip booking process that involves three distinct microservices:

  • Flight Service: Books a flight.
  • Hotel Service: Reserves a hotel room.
  • Car Service: Rents a car.
  • Our Saga will orchestrate these steps. If booking the flight succeeds, but the hotel reservation fails, the Saga must execute a compensating transaction to cancel the flight. Our goal is to ensure that at the end of the process, either all three bookings are successful, or all successfully completed bookings are successfully canceled.

    We will implement this in Go.

    Step 1: Defining the Activities - The Building Blocks

    In Temporal, any interaction with the outside world (API calls, database queries, etc.) must happen within an Activity. Activities are normal functions that can be retried, have timeouts, and report heartbeats. They represent a single step in our Saga.

    First, we define the interfaces for our external service clients. In a real application, these would make gRPC or REST calls.

    go
    // services/clients.go
    package services
    
    import "context"
    
    // A unique request ID for idempotency
    type IdempotencyKey string
    
    // --- Flight Service ---
    type BookFlightInput struct {
    	IdempotencyKey IdempotencyKey
    	UserID string
    	FlightNumber string
    }
    type BookFlightOutput struct {
    	BookingID string
    }
    
    // ... inputs/outputs for CancelFlight, BookHotel, etc.
    
    type FlightClient interface {
    	Book(ctx context.Context, input *BookFlightInput) (*BookFlightOutput, error)
    	Cancel(ctx context.Context, input *CancelFlightInput) (*CancelFlightOutput, error)
    }
    
    // --- Hotel Service ---
    type HotelClient interface {
    	Book(ctx context.Context, input *BookHotelInput) (*BookHotelOutput, error)
    	Cancel(ctx context.Context, input *CancelHotelInput) (*CancelHotelOutput, error)
    }
    
    // --- Car Service ---
    type CarClient interface {
    	Book(ctx context.Context, input *BookCarInput) (*BookCarOutput, error)
    	Cancel(ctx context.Context, input *CancelCarInput) (*CancelCarOutput, error)
    }

    Now, we implement the Temporal Activities that use these clients. Notice how we inject the service clients into our Activities struct. This is a standard pattern for managing dependencies.

    go
    // activities/trip_activities.go
    package activities
    
    import (
    	"context"
    	"fmt"
    	"gihub.com/my-app/services"
    )
    
    type Activities struct {
    	FlightSvc services.FlightClient
    	HotelSvc  services.HotelClient
    	CarSvc    services.CarClient
    }
    
    // BookFlight is the activity that calls the Flight service.
    func (a *Activities) BookFlight(ctx context.Context, input *services.BookFlightInput) (*services.BookFlightOutput, error) {
    	fmt.Println("Booking flight...", "FlightNumber", input.FlightNumber)
    	output, err := a.FlightSvc.Book(ctx, input)
    	if err != nil {
    		return nil, fmt.Errorf("failed to book flight: %w", err)
    	}
    	return output, nil
    }
    
    // CancelFlight is the compensating activity for BookFlight.
    func (a *Activities) CancelFlight(ctx context.Context, input *services.CancelFlightInput) (*services.CancelFlightOutput, error) {
    	fmt.Println("Canceling flight...", "BookingID", input.BookingID)
    	// In a real system, you might have different error handling for compensation.
    	output, err := a.FlightSvc.Cancel(ctx, input)
    	if err != nil {
    		return nil, fmt.Errorf("failed to cancel flight: %w", err)
    	}
    	return output, nil
    }
    
    // ... Implement BookHotel, CancelHotel, BookCar, CancelCar activities similarly ...

    Step 2: The Core Logic - Orchestrating the Saga Workflow

    This is where the power of Temporal becomes evident. The workflow code reads like a simple, synchronous function, but it's durable and fault-tolerant. We'll use Go's defer statement to elegantly handle compensation.

    go
    // workflows/trip_workflow.go
    package workflows
    
    import (
    	"time"
    	"github.com/my-app/activities"
    	"github.com/my-app/services"
    	"go.temporal.io/sdk/workflow"
    )
    
    func TripBookingWorkflow(ctx workflow.Context, tripID string) (err error) {
    	// Configure activity options with timeouts and retry policies.
    	// Business-critical operations should have a long timeout.
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: time.Minute * 2,
    		// Add a retry policy for transient errors.
    		RetryPolicy: &temporal.RetryPolicy{
    			InitialInterval:    time.Second,
    			BackoffCoefficient: 2.0,
    			MaximumInterval:    time.Minute,
    			NonRetryableErrorTypes: []string{"InvalidInputError"}, // Don't retry on bad input
    		},
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	var acts *activities.Activities
    
    	// This slice will hold the compensating functions.
    	compensations := []func(){}
    	// Use a deferred function to execute compensations on any failure.
    	// This is the core of the Saga's rollback logic.
    	defer func() {
    		if err != nil {
    			// A failure occurred, run all registered compensations in reverse order.
    			wfLogger := workflow.GetLogger(ctx)
    			wfLogger.Error("Workflow failed, starting compensation.", "Error", err)
    
    			// Execute compensations. We don't want the workflow to fail on compensation failure,
    			// so we create a disconnected context.
    			compensationCtx := workflow.NewDisconnectedContext(ctx)
    
    			for i := len(compensations) - 1; i >= 0; i-- {
    				compensations[i](compensationCtx)
    			}
    		}
    	}()
    
    	// --- Step 1: Book Flight ---
    	flightInput := &services.BookFlightInput{
    		IdempotencyKey: services.IdempotencyKey(fmt.Sprintf("%s-flight", tripID)),
    		// ... other params
    	}
    	var flightResult services.BookFlightOutput
    	err = workflow.ExecuteActivity(ctx, acts.BookFlight, flightInput).Get(ctx, &flightResult)
    	if err != nil {
    		return err
    	}
    	// If successful, add the compensation to the list.
    	compensations = append(compensations, func(compensationCtx workflow.Context) {
    		cancelInput := &services.CancelFlightInput{BookingID: flightResult.BookingID}
    		// We don't care about the result of the compensation activity future here.
    		// The activity itself should be configured to retry indefinitely.
    		_ = workflow.ExecuteActivity(compensationCtx, acts.CancelFlight, cancelInput)
    	})
    
    	// --- Step 2: Book Hotel ---
    	hotelInput := &services.BookHotelInput{
    		IdempotencyKey: services.IdempotencyKey(fmt.Sprintf("%s-hotel", tripID)),
    		// ... other params
    	}
    	var hotelResult services.BookHotelOutput
    	err = workflow.ExecuteActivity(ctx, acts.BookHotel, hotelInput).Get(ctx, &hotelResult)
    	if err != nil {
    		return err // This will trigger the defer block
    	}
    	compensations = append(compensations, func(compensationCtx workflow.Context) {
    		cancelInput := &services.CancelHotelInput{BookingID: hotelResult.BookingID}
    		_ = workflow.ExecuteActivity(compensationCtx, acts.CancelHotel, cancelInput)
    	})
    
    	// --- Step 3: Book Car ---
    	carInput := &services.BookCarInput{
    		IdempotencyKey: services.IdempotencyKey(fmt.Sprintf("%s-car", tripID)),
    		// ... other params
    	}
    	var carResult services.BookCarOutput
    	err = workflow.ExecuteActivity(ctx, acts.BookCar, carInput).Get(ctx, &carResult)
    	if err != nil {
    		return err // This will trigger the defer block
    	}
    	// No need to add a compensation for the final step unless it can also be part of a larger workflow.
    
    	workflow.GetLogger(ctx).Info("Trip booked successfully!", "Flight", flightResult.BookingID, "Hotel", hotelResult.BookingID, "Car", carResult.BookingID)
    	return nil
    }

    Key takeaways from this implementation:

  • defer for Compensation: Using a defer block with a closure that executes a list of compensation functions is a clean, idiomatic Go pattern for ensuring cleanup logic runs. It guarantees that if err is non-nil upon function exit, all registered compensations will fire.
  • Reverse Order Execution: The for loop iterates backward through the compensations slice, correctly implementing the Saga rollback logic (undoing C, then B, then A).
  • Disconnected Context for Compensations: This is a critical and advanced pattern. If the main workflow context is canceled (e.g., due to a timeout), we still want our compensation logic to run to completion. workflow.NewDisconnectedContext creates a new context that is not a child of the main workflow context, preventing it from being canceled along with its parent. We then configure the compensation activities with a very long or infinite timeout/retry policy, as their failure is a catastrophic event requiring manual intervention.
  • Idempotency Keys: We generate a deterministic idempotency key for each activity call, combining the unique tripID with a step identifier. This is crucial for resilience.
  • Step 3: The Worker and Client

    Finally, we need code to run a Temporal Worker that will poll for tasks and execute our workflow and activities, and a client to start the workflow.

    go
    // worker/main.go
    package main
    
    import (
    	"log"
    	"github.com/my-app/activities"
    	"github.com/my-app/workflows"
    	"go.temporal.io/sdk/client"
    	"go.temporal.io/sdk/worker"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	w := worker.New(c, "trip-booking-task-queue", worker.Options{})
    
    	// Create mock service clients for this example
    	acts := &activities.Activities{
    		FlightSvc: &services.MockFlightClient{},
    		HotelSvc:  &services.MockHotelClient{FailOnBook: true}, // Simulate a failure
    		CarSvc:    &services.MockCarClient{},
    	}
    
    	w.RegisterWorkflow(workflows.TripBookingWorkflow)
    	w.RegisterActivities(acts)
    
    	err = w.Run(worker.InterruptCh())
    	if err != nil {
    		log.Fatalln("Unable to start worker", err)
    	}
    }
    
    // client/main.go
    package main
    
    import (
    	"context"
    	"log"
    	"github.com/my-app/workflows"
    	"go.temporal.io/sdk/client"
    	"github.com/google/uuid"
    )
    
    func main() {
    	c, err := client.Dial(client.Options{})
    	if err != nil {
    		log.Fatalln("Unable to create client", err)
    	}
    	defer c.Close()
    
    	workflowOptions := client.StartWorkflowOptions{
    		ID:        "trip-booking-" + uuid.New().String(),
    		TaskQueue: "trip-booking-task-queue",
    	}
    
    	we, err := c.ExecuteWorkflow(context.Background(), workflowOptions, workflows.TripBookingWorkflow, "my-trip-123")
    	if err != nil {
    		log.Fatalln("Unable to execute workflow", err)
    	}
    
    	log.Println("Started workflow", "WorkflowID", we.GetID(), "RunID", we.GetRunID())
    
    	// Wait for the workflow to complete.
    	var result string // We don't expect a result, but we need to wait for completion.
    	err = we.Get(context.Background(), &result)
    	if err != nil {
    		log.Println("Workflow completed with an error (as expected due to compensation):", err)
    	} else {
    		log.Println("Workflow completed successfully.")
    	}
    }

    When you run this code with MockHotelClient configured to fail, you will see logs indicating that the flight was booked, the hotel booking failed, and then the flight cancellation activity was executed. The workflow will ultimately fail, but the application state will be consistent.

    Advanced Edge Cases and Production Considerations

    The implementation above is robust, but real-world systems present further challenges.

    Deeper on Idempotency

    Temporal guarantees at-least-once execution for activities. If an activity executes, but the acknowledgment is lost, Temporal will re-run it. Your downstream services must be idempotent. Passing the IdempotencyKey is only half the battle. The receiving service (e.g., Flight Service) must implement logic like this:

  • Receive request with IdempotencyKey.
    • Check a cache (e.g., Redis) or a dedicated database table for this key.
    • If key exists and a response is stored, return the stored response immediately without re-processing.
    • If key does not exist, begin processing.
  • Before returning the response, store the IdempotencyKey and the response with a reasonable TTL.
  • This ensures that a retried BookFlight call doesn't book two flights.

    The Nightmare: Compensation Failure

    What happens if a compensating activity, like CancelFlight, fails? This is the most critical failure mode of a Saga. The system is now in an inconsistent state (a flight is booked, but the rest of the trip is not) and the automated recovery has failed.

    Strategy: Compensating activities must be designed for maximum resilience. Configure their ActivityOptions in the disconnectedContext with an aggressive, potentially infinite retry policy.

    go
    // Inside the defer block
    compensationCtx := workflow.NewDisconnectedContext(ctx)
    
    // Configure options for activities that MUST succeed.
    compensationAO := workflow.ActivityOptions{
        StartToCloseTimeout: time.Hour, // Give it plenty of time
        RetryPolicy: &temporal.RetryPolicy{
            InitialInterval:    time.Second * 5,
            BackoffCoefficient: 2.0,
            MaximumInterval:    time.Minute * 10,
            // No MaximumAttempts, retry forever.
        },
    }
    compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationAO)
    
    // ... execute activity with this context
    _ = workflow.ExecuteActivity(compensationCtx, acts.CancelFlight, cancelInput)

    If the activity continues to fail (e.g., the cancellation service is down for days), the activity will keep retrying. The workflow will remain in a running state. This is now an operational issue. Your monitoring should alert on long-running Sagas or activities that have been retrying for an extended period. This allows an operator to intervene manually, and once the underlying issue is fixed, the activity will succeed and the Saga will complete its compensation.

    Performance and Scalability Tuning

    Temporal is highly scalable, but workers need to be tuned for your workload.

    * MaxConcurrentActivityExecutionSize: Controls how many activities a single worker can execute concurrently. If your activities are I/O-bound (like most API calls), you can set this to a high number (e.g., 1000). If they are CPU-bound, it should be closer to the number of available cores.

    * WorkerActivitiesPerSecond: A rate limit on the worker. Useful for preventing your workers from overwhelming a downstream service.

    * Task Queue Partitions: For very high-throughput workloads, you can partition your task queue to increase concurrency at the cluster level.

    * Workflow History Size: A workflow's history grows with each event. Very long-running workflows with thousands of steps can have a large history, which can impact performance. For Sagas that could theoretically be very long, use Temporal's Continue-As-New feature to periodically start a new workflow run with the same ID, effectively truncating the history.

    Conclusion: Sagas as a Solved Problem

    The Saga pattern is a powerful conceptual tool for managing consistency in microservices. However, a naive, homegrown implementation quickly becomes a complex, brittle distributed system in its own right. You end up building a poor, incomplete version of a workflow engine.

    By leveraging a durable execution engine like Temporal, the most challenging aspects of Saga orchestration—state management, retries, timeouts, and recoverability—are handled by the platform. This allows senior engineers to focus on modeling the business logic of the workflow and its compensations, rather than the plumbing of its execution. The resulting code is not only more reliable and observable but also dramatically simpler to write, read, and maintain. Sagas, when powered by the right abstraction, are no longer a dreaded complexity but a solved, production-ready pattern for building resilient distributed applications.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles