Production-Grade Sagas with Temporal: Compensation & Idempotency

October 13, 2025

15 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Brittle Reality of Choreographed Sagas

For senior engineers building distributed systems, the problem of maintaining data consistency across microservice boundaries is a familiar and persistent challenge. The classic two-phase commit (2PC) is often dismissed for its tight coupling and blocking nature, making it a poor fit for modern, scalable architectures. This leads us to the Saga pattern, a design choice that favors eventual consistency through a sequence of local transactions, each with a corresponding compensating action.

Sagas come in two flavors: Choreography and Orchestration. The choreography approach, often implemented with a message bus like Kafka or RabbitMQ, is deceptively simple at first glance. Each service performs its transaction and publishes an event, which triggers the next service in the chain.

Consider a simple trip booking system:

OrderService receives a request, creates an order, and emits OrderCreated.

PaymentService listens for OrderCreated, processes payment, and emits PaymentProcessed.

FlightService listens for PaymentProcessed, books a flight, and emits FlightBooked.

This seems elegant. Services are decoupled, right? But what happens when the FlightService fails? The PaymentService must now listen for a FlightBookingFailed event to trigger a refund. The OrderService must listen for PaymentFailed or FlightBookingFailed to mark the order as failed.

This leads to a combinatorial explosion of event listeners and a tangled web of dependencies that is anything but decoupled. The key problems with choreographed sagas in production are:

* Implicit State Management: The overall state of the saga is not explicitly stored anywhere. To understand the status of a trip booking, you must correlate events across multiple topics and services, a significant observability challenge.

* Cyclic Dependencies: Service A emits an event that triggers B, which might emit an event that requires A to take a compensating action. This creates hidden cyclic dependencies at the business logic level, making the system incredibly difficult to reason about.

* Complex Error Handling: Implementing a complete rollback requires each service to know the entire business process to correctly interpret failure events. A single unhandled event or a bug in a consumer can lead to permanent data inconsistency.

* Difficult Debugging and Testing: Tracing a single failed transaction through a dozen event logs is a nightmare. Integration testing requires spinning up the entire ecosystem of services and the message bus.

Let's illustrate with a code snippet representing this flawed pattern:

// DO NOT USE THIS PATTERN IN PRODUCTION
// Example of a fragile, choreographed saga participant

package main

import (
	"fmt"
	"github.com/streadway/amqp"
)

// PaymentService listens for OrderCreated events
func handleOrderCreated(d amqp.Delivery) {
	orderID := string(d.Body)
	fmt.Printf("Processing payment for order %s\n", orderID)

	// 1. Process payment via some gateway
	err := processPayment(orderID)

	// 2. Publish result to the event bus
	if err != nil {
		fmt.Printf("Payment failed for order %s: %v\n", orderID, err)
		// What if this publish fails? The system is now in an inconsistent state.
		publishEvent("payment.events", "PaymentFailed", []byte(orderID))
	} else {
		fmt.Printf("Payment successful for order %s\n", orderID)
		// What if this publish fails after a successful payment? Money is taken, but no flight is booked.
		publishEvent("payment.events", "PaymentProcessed", []byte(orderID))
	}
}

// ... imagine similar handlers for FlightService, HotelService, etc.
// ... and then more handlers for compensation events like PaymentFailed, FlightBookingFailed, etc.

The comments highlight the core issue: the state of the business transaction is intertwined with the reliability of the message bus. A failed publishEvent call after a successful database transaction leaves the system in a broken state, a problem often addressed with the complex Transactional Outbox pattern, which adds even more infrastructure and complexity.

This is where orchestration, supercharged by a durable execution engine like Temporal, provides a fundamentally more robust and maintainable solution.

Temporal's Orchestration: State as Code

Temporal isn't just a workflow engine; it's a paradigm shift. It offers durable execution, meaning your workflow code's state—including local variables, call stacks, and blocked threads—is automatically and continuously persisted. A Temporal Worker can crash, the server can go down, or a deployment can happen, and when a worker comes back online, your workflow code resumes from the exact line it left off, with all state intact.

For senior engineers, this means you can stop thinking about state machines, database tables for tracking progress, and complex message queue topologies. You can write your business logic as a straightforward, sequential piece of code.

Key concepts that make this possible:

* Workflow: This is your orchestrator function. It's written in a standard programming language (like Go) but is deterministic and sandboxed. It cannot perform I/O directly (no HTTP calls, no database access). Its purpose is to orchestrate the business logic.

* Activity: This is a regular function that executes your non-deterministic, side-effect-heavy code. This is where you call other microservices, interact with databases, or talk to third-party APIs. Activities are where the real work happens.

* Worker: A process you run that hosts your Workflow and Activity implementations. It polls the Temporal Cluster for tasks to execute.

By separating the orchestration logic (Workflow) from the implementation details (Activities), Temporal provides a powerful abstraction. The Workflow code is durable and resumable, while the Activities can be retried, have timeouts, and are designed to fail and be handled gracefully.

This model directly addresses the flaws of choreography:

* Explicit State Management: The state of the saga is the state of the Workflow Execution itself, managed entirely by Temporal. You can query it, inspect its history, and know exactly which step it's on.

* Clear Dependencies: The data flow is explicit. The workflow calls activities sequentially, passing data as function arguments and return values. There are no hidden event listeners.

* Centralized Error Handling: All errors from Activities are returned to the Workflow, allowing for centralized, coherent error handling and compensation logic in one place.

Building a Production-Grade Booking Saga with Temporal

Let's rebuild our trip booking system using Temporal and Go. The scenario: book a flight, a hotel, and a car rental. If any step fails, we must compensate for all previously successful steps.

Project Structure:

text

/trip-booking-saga
|-- /activities
|   |-- flight_activity.go
|   |-- hotel_activity.go
|   |-- car_activity.go
|-- /workflow
|   |-- trip_workflow.go
|-- /worker
|   |-- main.go
|-- /starter
|   |-- main.go
|-- go.mod
|-- go.sum

Step 1: Defining the Activities

Activities are the bridge to the outside world. They are standard Go functions that take a context.Context and any required parameters. They return a result and an error.

activities/hotel_activity.go

package activities

import (
	"context"
	"fmt"
	"time"

	"go.temporal.io/sdk/activity"
)

// MockHotelService simulates interacting with an external hotel booking service.
type MockHotelService struct{}

func (s *MockHotelService) BookHotel(ctx context.Context, bookingID, guestName string) (string, error) {
	logger := activity.GetLogger(ctx)
	logger.Info("Booking hotel", "BookingID", bookingID, "GuestName", guestName)

	// Simulate a failure for specific cases to test compensation
	if guestName == "Failing Guest" {
		logger.Error("Hotel service failed to book for Failing Guest")
		return "", fmt.Errorf("hotel booking failed for guest: %s", guestName)
	}

	// Simulate network latency
	time.Sleep(1 * time.Second)

	confirmationID := fmt.Sprintf("HOTEL-CONF-%s", bookingID)
	logger.Info("Hotel booked successfully", "ConfirmationID", confirmationID)
	return confirmationID, nil
}

func (s *MockHotelService) CancelHotel(ctx context.Context, bookingID, confirmationID string) error {
	logger := activity.GetLogger(ctx)
	logger.Info("Cancelling hotel booking", "BookingID", bookingID, "ConfirmationID", confirmationID)

	// In a real app, this would be an API call. Compensation actions MUST be idempotent.
	time.Sleep(500 * time.Millisecond)

	logger.Info("Hotel booking cancelled successfully")
	return nil
}

We would define similar flight_activity.go and car_activity.go files, each with a Book and a Cancel method.

Step 2: Implementing the Workflow

This is where the orchestration logic lives. Note the use of workflow.ExecuteActivity and the configuration of timeouts. This code is deterministic.

workflow/trip_workflow.go

package workflow

import (
	"time"

	"go.temporal.io/sdk/workflow"
	"trip-booking-saga/activities"
)

// TripBookingWorkflow orchestrates the entire booking process.
func TripBookingWorkflow(ctx workflow.Context, guestName string) (string, error) {
	wfLogger := workflow.GetLogger(ctx)
	wfLogger.Info("Trip booking workflow started", "GuestName", guestName)

	// Configure activity options: timeouts are crucial for production systems.
	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 10 * time.Second, // Max time for a single attempt.
	}
	ctx = workflow.WithActivityOptions(ctx, ao)

	// ... Compensation logic will be added here in the next section ...

	// Step 1: Book Flight
	var flightConfirmationID string
	err := workflow.ExecuteActivity(ctx, activities.MockFlightService{}.BookFlight, "TRIP-123", guestName).Get(ctx, &flightConfirmationID)
	if err != nil {
		wfLogger.Error("Failed to book flight", "Error", err)
		return "", err
	}

	// Step 2: Book Hotel
	var hotelConfirmationID string
	err = workflow.ExecuteActivity(ctx, activities.MockHotelService{}.BookHotel, "TRIP-123", guestName).Get(ctx, &hotelConfirmationID)
	if err != nil {
		wfLogger.Error("Failed to book hotel", "Error", err)
		// Naive compensation attempt (we will improve this)
		cancelErr := workflow.ExecuteActivity(ctx, activities.MockFlightService{}.CancelFlight, "TRIP-123", flightConfirmationID).Get(ctx, nil)
		if cancelErr != nil {
			wfLogger.Error("Failed to cancel flight during hotel booking failure compensation", "SubError", cancelErr)
			// The saga is now in an inconsistent state! What do we do?
		}
		return "", err
	}

	// Step 3: Book Car
	var carConfirmationID string
	err = workflow.ExecuteActivity(ctx, activities.MockCarService{}.BookCar, "TRIP-123", guestName).Get(ctx, &carConfirmationID)
	if err != nil {
		wfLogger.Error("Failed to book car", "Error", err)
		// ... even more complex compensation logic needed here ...
		return "", err
	}

	result := fmt.Sprintf("Trip booked successfully! Flight: %s, Hotel: %s, Car: %s", flightConfirmationID, hotelConfirmationID, carConfirmationID)
	wfLogger.Info("Workflow completed successfully.")
	return result, nil
}

The naive error handling above already shows the cracks. A chain of if err != nil blocks becomes unwieldy and itself is a source of bugs. What if the compensation call fails? This approach is not robust.

The Core Pattern: Bulletproof Compensation with `defer`

Temporal's durable execution model allows us to use standard language constructs for powerful patterns. In Go, the defer statement is perfect for building a reliable compensation stack. A deferred function call is pushed onto a stack and executed when the surrounding function returns.

Because Temporal persists the entire workflow state, this defer stack is also persisted. If a worker crashes after successfully booking a flight but before booking a hotel, when the workflow resumes on a new worker, the defer statement to cancel the flight is still on the stack, ready to be executed if the workflow eventually fails.

Here is the robust, production-grade implementation of the workflow:

workflow/trip_workflow.go (Revised and Production-Ready)

package workflow

import (
	"fmt"
	"time"

	"go.temporal.io/sdk/workflow"
	"trip-booking-saga/activities"
)

type BookingConfirmations struct {
	Flight string
	Hotel  string
	Car    string
}

func TripBookingWorkflow(ctx workflow.Context, guestName string) (string, error) {
	wfLogger := workflow.GetLogger(ctx)
	wfLogger.Info("Trip booking workflow started", "GuestName", guestName)

	ao := workflow.ActivityOptions{
		StartToCloseTimeout: 10 * time.Second,
		// We will add retry policies in the idempotency section
	}
	ctx = workflow.WithActivityOptions(ctx, ao)

	var confirmations BookingConfirmations
	var compensationStack []func(workflow.Context)
	var err error

	// Use a deferred function to execute compensations on any failure.
	// This is guaranteed to run because Temporal persists the defer stack.
	defer func() {
		if err != nil {
			wfLogger.Warn("Workflow failed, starting compensation logic.")
			// Run compensations in reverse order
			for i := len(compensationStack) - 1; i >= 0; i-- {
				compensationStack[i](ctx)
			}
		}
	}()

	// Step 1: Book Flight
	err = workflow.ExecuteActivity(ctx, activities.MockFlightService{}.BookFlight, "TRIP-123", guestName).Get(ctx, &confirmations.Flight)
	if err != nil {
		wfLogger.Error("Failed to book flight", "Error", err)
		return "", err
	}
	// If successful, add the compensation to the stack.
	compensationStack = append(compensationStack, func(ctx workflow.Context) {
		wfLogger.Info("Executing compensation: CancelFlight")
		// Use a new context for compensation to avoid cancellation from the main workflow.
		// Set a longer timeout for compensation activities as they are critical.
		cao := workflow.ActivityOptions{StartToCloseTimeout: 30 * time.Second}
		ctxComp := workflow.WithActivityOptions(ctx, cao)
		_ = workflow.ExecuteActivity(ctxComp, activities.MockFlightService{}.CancelFlight, "TRIP-123", confirmations.Flight).Get(ctxComp, nil)
		// Note: We are ignoring the error here. This is a critical design decision.
		// We discuss this in the 'Edge Cases' section.
	})

	// Step 2: Book Hotel
	err = workflow.ExecuteActivity(ctx, activities.MockHotelService{}.BookHotel, "TRIP-123", guestName).Get(ctx, &confirmations.Hotel)
	if err != nil {
		wfLogger.Error("Failed to book hotel", "Error", err)
		return "", err
	}
	compensationStack = append(compensationStack, func(ctx workflow.Context) {
		wfLogger.Info("Executing compensation: CancelHotel")
		cao := workflow.ActivityOptions{StartToCloseTimeout: 30 * time.Second}
		ctxComp := workflow.WithActivityOptions(ctx, cao)
		_ = workflow.ExecuteActivity(ctxComp, activities.MockHotelService{}.CancelHotel, "TRIP-123", confirmations.Hotel).Get(ctxComp, nil)
	})

	// Step 3: Book Car
	err = workflow.ExecuteActivity(ctx, activities.MockCarService{}.BookCar, "TRIP-123", guestName).Get(ctx, &confirmations.Car)
	if err != nil {
		wfLogger.Error("Failed to book car", "Error", err)
		return "", err
	}

	result := fmt.Sprintf("Trip booked successfully! Flight: %s, Hotel: %s, Car: %s", confirmations.Flight, confirmations.Hotel, confirmations.Car)
	wfLogger.Info("Workflow completed successfully.")
	return result, nil // err is nil here, so the defer func does nothing
}

This pattern is vastly superior:

Atomicity: The logic for an action and its compensation are defined together, improving code locality and readability.

Reliability: The compensation stack is durably persisted. The defer block will execute if the function returns with an error, regardless of worker failures.

Centralization: The decision to compensate is made once, at the end of the workflow, based on the final err value.

Advanced Edge Cases: Idempotency and Compensation Failures

We've built a robust compensation model, but two critical production concerns remain: what if a compensation activity fails, and how do we handle retries without causing duplicate actions?

The Idempotency Imperative

Temporal guarantees at-least-once execution for Activities. A network partition could occur after your activity successfully completes but before the result is acknowledged by the Temporal server. The server will time out and reschedule the activity on another worker. If your activity is not idempotent, this will result in a duplicate action (e.g., booking the same flight twice).

Both your primary actions (BookFlight) and your compensation actions (CancelFlight) MUST be idempotent.

The standard pattern is to use an idempotency key. The workflow, being the reliable orchestrator, is the perfect place to generate this key.

Implementation Pattern:

Generate a key in the Workflow: Use workflow.SideEffect to generate a unique ID for each logical action. SideEffect ensures the code runs only once and its result is recorded in the workflow history, making it available on replay.

Pass the key to the Activity: Include the key in the activity's parameters.

Implement a check in your Microservice: The receiving service (e.g., FlightService) must check for the idempotency key before performing the action.

Workflow Code with Idempotency Keys:

// Inside TripBookingWorkflow...

// Generate a unique, deterministic key for the flight booking action
var flightBookingIdempotencyKey string
encodedRandom := workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} {
	return uuid.NewString()
})
_ = encodedRandom.Get(&flightBookingIdempotencyKey)

// Pass the key to the activity
err = workflow.ExecuteActivity(ctx, activities.MockFlightService{}.BookFlight, "TRIP-123", guestName, flightBookingIdempotencyKey).Get(ctx, &confirmations.Flight)
// ...

Microservice Implementation (Pseudo-code):

// In the FlightService API handler
func (s *FlightService) BookFlightHandler(w http.ResponseWriter, r *http.Request) {
	idempotencyKey := r.Header.Get("Idempotency-Key")

	// 1. Check if we've processed this key before (e.g., in Redis or a DB table)
	processed, err := s.idempotencyStore.IsProcessed(idempotencyKey)
	if err != nil {
		// Handle storage error
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		return
	}

	if processed {
		// 2. If yes, return the previously stored successful response
		response := s.idempotencyStore.GetResponse(idempotencyKey)
		w.WriteHeader(http.StatusOK)
		json.NewEncoder(w).Encode(response)
		return
	}

	// 3. If no, begin a transaction
	tx, _ := s.db.Begin()
	// 4. Perform the booking logic
	result, err := s.performBooking(tx, r.Body)
	// 5. Store the idempotency key and the result *within the same transaction*
	_ = s.idempotencyStore.StoreResponse(tx, idempotencyKey, result)
	// 6. Commit the transaction
	tx.Commit()

	// 7. Return the result
	w.WriteHeader(http.StatusCreated)
	json.NewEncoder(w).Encode(result)
}

This server-side implementation ensures that even if Temporal retries the BookFlight activity, the flight is booked only once.

Handling Compensation Failures

In our defer block, we ignored the error from the compensation activity: _ = workflow.ExecuteActivity(...). This is a deliberate and critical design choice.

Why? A saga's goal is to achieve eventual consistency. If a compensation (CancelHotel) fails, what is the workflow supposed to do? It cannot undo the other cancellations. The entire saga has entered a state that requires manual intervention. Retrying the CancelHotel activity indefinitely might not solve the problem if the hotel's system is down for an extended period.

The correct pattern is to retry the compensation activity aggressively for a period, and if it still fails, escalate to a human.

Implementation Pattern:

Configure a robust RetryPolicy: For compensation activities, use a custom RetryPolicy with a long maximum interval and unlimited attempts.

Log the error: If the activity ultimately fails after all retries, the error will be returned. The workflow should log this critical failure.

Escalate: The workflow can then execute another activity to notify an on-call engineer via PagerDuty, create a Jira ticket, or push a message to a dead-letter queue for manual review.

Revised Compensation Code:

// Inside the defer block's compensation function for CancelHotel

cao := workflow.ActivityOptions{
	StartToCloseTimeout: 5 * time.Minute, // Give each attempt more time
	RetryPolicy: &temporal.RetryPolicy{
		InitialInterval:    1 * time.Second,
		BackoffCoefficient: 2.0,
		MaximumInterval:    1 * time.Minute,
		MaximumAttempts:    0, // Unlimited retries
	},
}
ctxComp := workflow.WithActivityOptions(ctx, cao)

err := workflow.ExecuteActivity(ctxComp, activities.MockHotelService{}.CancelHotel, "TRIP-123", confirmations.Hotel).Get(ctxComp, nil)
if err != nil {
	wfLogger.Error("CRITICAL: FAILED TO COMPENSATE HOTEL BOOKING. MANUAL INTERVENTION REQUIRED.", "BookingID", "TRIP-123", "ConfirmationID", confirmations.Hotel, "Error", err)
	// Escalate to a human operator
	_ = workflow.ExecuteActivity(ctx, activities.AlertOncallActivity, "Hotel cancellation failed for TRIP-123").Get(ctx, nil)
}

This pattern ensures the system does its best to self-heal via retries. When it can't, it reliably hands off the problem for human resolution, preventing the failure from being silently lost.

Performance and Scalability Considerations

* Worker Tuning: Your Temporal Workers are stateless. You can scale them horizontally. The key parameters to tune are the number of concurrent task pollers for workflows and activities (Worker.Options.MaxConcurrentActivityTaskPollers). If your activities are I/O-bound, you can increase this number significantly to improve throughput.

* Activity Heartbeating: For activities that might run for a long time (e.g., processing a large file), the StartToCloseTimeout might be too coarse. If the worker running the activity crashes 30 minutes into a 1-hour task, you don't want to wait another 30 minutes for the timeout to fire. Activities can periodically call activity.RecordHeartbeat. If the server doesn't receive a heartbeat within the configured HeartbeatTimeout, it will quickly fail and reschedule the activity.

* Workflow History Size: Every action in a workflow (activity scheduled, timer fired, signal received) is recorded in its history. For very long-running workflows or workflows with many thousands of steps, this history can become large, impacting performance. For sagas that could potentially be very long or run forever (e.g., a subscription management workflow), use the workflow.ContinueAsNew API. This completes the current workflow and starts a new one with the same ID, effectively truncating the history while carrying over the state.

Conclusion: Sagas as First-Class Citizens

By moving from a brittle, choreographed event-driven model to a durable, orchestrated one with Temporal, we elevate the Saga from a conceptual pattern to a first-class, testable, and maintainable component of our architecture. The state of the business transaction is no longer implicitly spread across message logs but is explicitly managed and visible within the workflow execution.

Using advanced patterns like durable defer for compensation, generating idempotency keys within the workflow, and designing robust retry/escalation policies for compensation failures allows us to build truly resilient distributed systems. Temporal provides the primitives to solve these hard problems, letting senior engineers focus on the business logic itself, rather than on the complex and error-prone plumbing of distributed state management.