Temporal Sagas: Fault-Tolerant Workflows Beyond Two-Phase Commit
The Inescapable Consistency Problem in Microservices
In a monolithic world, ACID transactions, managed by the database, were our bedrock of consistency. The all-or-nothing guarantee of a database transaction simplified application logic immensely. As we've decomposed our systems into distributed microservices, we've gained scalability and team autonomy, but we've lost this fundamental tool. The Two-Phase Commit (2PC) protocol, often touted as a solution for distributed transactions, is notoriously problematic in modern architectures. It introduces synchronous coupling, is a performance bottleneck, and is incredibly fragile in the face of network partitions, making it a non-starter for most high-availability systems.
This leaves us with a critical question: how do we ensure a business process that spans multiple services either completes successfully or is properly undone, leaving the system in a consistent state? The answer lies in embracing eventual consistency through the Saga pattern.
A Saga is a sequence of local transactions where each transaction updates data within a single service. After each step, the Saga publishes an event or calls the next service to trigger the subsequent local transaction. If any local transaction fails, a series of compensating transactions are executed to undo the preceding transactions. While powerful, implementing a Saga orchestrator from scratch is a significant engineering challenge. You need to manage state, handle retries, track timeouts, and ensure compensations are reliably executed—often leading to a complex, bespoke state machine built on message queues and databases.
This is where Temporal.io enters the picture. Temporal is not just a workflow engine; it's a durable execution platform. It treats your workflow logic—your Saga orchestrator—as a durable, fault-tolerant, and stateful function that can run for seconds or years. It abstracts away the complexity of state management, retries, and failure handling, allowing us to write Saga logic as straightforward, sequential code. This article will demonstrate how to leverage Temporal to build a robust, production-ready Saga, moving beyond theory and into the weeds of implementation.
Section 1: Mapping Saga Concepts to Temporal Primitives
Before diving into code, it's crucial to understand how the abstract concepts of the Saga pattern map directly to Temporal's powerful primitives. We'll focus on the Orchestration-based Saga, where a central coordinator is responsible for sequencing the steps and their compensations. This model is a natural fit for Temporal's workflow-as-code paradigm.
* Saga Orchestrator: This is the brain of the operation, defining the sequence of steps and the corresponding compensation logic. In Temporal, this is your Workflow Definition. The entire state of the orchestrator is durably persisted by the Temporal Cluster, meaning the workflow code can be simple, stateful, and resilient to process failures.
* Saga Transaction (Step): A single business operation within a service (e.g., reserving inventory, processing a payment). In Temporal, this is an Activity. Activities are where you perform all your side effects: database calls, API requests, etc. Temporal guarantees that an activity will be executed at least once.
Compensating Transaction: An operation that semantically undoes a corresponding Saga transaction (e.g., releasing inventory, refunding a payment). In Temporal, this is simply another Activity. The key is how* the workflow orchestrates its execution.
* Saga Log: In traditional Saga implementations, you need a database table to persist the state of the Saga at each step. This is a critical point of failure and complexity. Temporal completely eliminates the need for a manual Saga Log. The workflow's execution history, managed internally by the Temporal Cluster, serves as the durable, auditable log.
By using Temporal, our primary task shifts from building complex state management infrastructure to simply defining the business logic of our workflow. The most elegant way to handle compensations is by leveraging the structured error handling features of your chosen language.
In Go, the defer statement is exceptionally well-suited for this. We can schedule a compensation activity in a defer block immediately after successfully invoking the primary activity. If the workflow function exits—either by completing successfully or by failing—the deferred functions are executed in last-in, first-out (LIFO) order. This naturally sequences the compensations correctly without complex conditional logic.
// Simplified conceptual example
func MySagaWorkflow(ctx workflow.Context, input MyInput) error {
// ... setup
var compensationFuncs []func()
var err error
// Defer the execution of all compensations
defer func() {
if err != nil { // Only run compensations on failure
for _, cf := range compensationFuncs {
// The workflow.Go is fire-and-forget for compensations.
// We need more robust error handling in a real implementation.
workflow.Go(ctx, func(ctx workflow.Context) {
cf()
})
}
}
}()
// Step 1: Reserve Inventory
err = workflow.ExecuteActivity(ctx, ReserveInventoryActivity, input).Get(ctx, nil)
if err != nil { return err }
compensationFuncs = append(compensationFuncs, func() {
_ = workflow.ExecuteActivity(ctx, ReleaseInventoryActivity, input).Get(ctx, nil)
})
// Step 2: Process Payment
err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, input).Get(ctx, nil)
if err != nil { return err }
compensationFuncs = append(compensationFuncs, func() {
_ = workflow.ExecuteActivity(ctx, RefundPaymentActivity, input).Get(ctx, nil)
})
// ... more steps
return nil
}
This pattern is the foundation of our robust Saga. Now, let's build a complete, production-grade example.
Section 2: Production-Grade Implementation: An E-Commerce Order Saga
Let's model a realistic e-commerce order processing flow. When a user places an order, the following must happen:
If any step fails, all preceding steps must be compensated. For example, if payment fails, we must release the reserved inventory.
The Workflow Orchestrator
Here is the complete Go code for the workflow orchestrator. Note the use of a Saga helper struct to manage compensation logic cleanly.
package order
import (
"time"
"go.temporal.io/sdk/workflow"
)
// SagaState holds the state for our Saga operations.
type SagaState struct {
compensations []func(workflow.Context)
}
// AddCompensation adds a compensation function to the stack.
func (s *SagaState) AddCompensation(f func(workflow.Context)) {
s.compensations = append(s.compensations, f)
}
// Compensate runs all registered compensation functions in LIFO order.
func (s *SagaState) Compensate(ctx workflow.Context) {
for i := len(s.compensations) - 1; i >= 0; i-- {
// Compensations should be non-blocking and retry indefinitely until success.
// We use a detached context to ensure they run even if the main workflow context is cancelled.
detachedCtx, _ := workflow.NewDisconnectedContext(ctx)
compensationCtx := workflow.WithActivityOptions(detachedCtx, workflow.ActivityOptions{
StartToCloseTimeout: 5 * time.Minute,
// Use a robust retry policy for compensations.
RetryPolicy: &workflow.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
NonRetryableErrorTypes: []string{}, // Retry on all errors
},
})
// Execute compensation. We don't wait for it to complete.
// It will retry in the background until it succeeds.
workflow.ExecuteActivity(compensationCtx, s.compensations[i], compensationCtx).Get(compensationCtx, nil)
}
}
// OrderWorkflow implements the e-commerce order Saga.
func OrderWorkflow(ctx workflow.Context, orderID string, items []string, customerID string) (string, error) {
// Set a long timeout for the entire workflow.
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
// A standard retry policy for forward-moving activities.
RetryPolicy: &workflow.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
MaximumAttempts: 3,
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
saga := &SagaState{}
var err error
// Use defer to ensure compensation logic runs on any failure.
defer func() {
if err != nil {
wfLogger := workflow.GetLogger(ctx)
wfLogger.Error("Workflow failed, starting compensation.", "Error", err)
saga.Compensate(ctx)
}
}()
// 1. Reserve Inventory
var reservationID string
err = workflow.ExecuteActivity(ctx, ReserveInventoryActivity, orderID, items).Get(ctx, &reservationID)
if err != nil {
return "", err
}
saga.AddCompensation(func(cctx workflow.Context) {
_ = workflow.ExecuteActivity(cctx, ReleaseInventoryActivity, orderID, reservationID).Get(cctx, nil)
})
// 2. Process Payment
var transactionID string
err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, orderID, customerID).Get(ctx, &transactionID)
if err != nil {
return "", err
}
saga.AddCompensation(func(cctx workflow.Context) {
_ = workflow.ExecuteActivity(cctx, RefundPaymentActivity, orderID, transactionID).Get(cctx, nil)
})
// 3. Create Shipment
var shipmentID string
err = workflow.ExecuteActivity(ctx, CreateShipmentActivity, orderID, items).Get(ctx, &shipmentID)
if err != nil {
return "", err
}
saga.AddCompensation(func(cctx workflow.Context) {
_ = workflow.ExecuteActivity(cctx, CancelShipmentActivity, orderID, shipmentID).Get(cctx, nil)
})
// 4. Notify User (non-critical, no compensation)
// We use a short timeout and don't fail the whole workflow if this fails.
notifyCtx, _ := workflow.NewDisconnectedContext(ctx)
notifyAo := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &workflow.RetryPolicy{MaximumAttempts: 2},
}
notifyCtx = workflow.WithActivityOptions(notifyCtx, notifyAo)
_ = workflow.ExecuteActivity(notifyCtx, NotifyUserActivity, customerID, "Your order is confirmed!").Get(notifyCtx, nil)
wfLogger := workflow.GetLogger(ctx)
wfLogger.Info("Order workflow completed successfully.", "OrderID", orderID)
return "Order completed: " + orderID, nil
}
This workflow implementation demonstrates several production-ready patterns:
SagaState struct provides a clean API (AddCompensation, Compensate) for managing the compensation stack.defer: The defer block ensures that if err is non-nil upon function exit, the Compensate function is called, which correctly executes compensations in reverse order.Compensate, we create a DisconnectedContext. This is critical. It ensures that even if the workflow times out or is cancelled, the compensation activities will continue to run. Their retry policies are configured to be extremely aggressive, as a failed compensation leaves the system in an inconsistent state.NotifyUserActivity is treated as non-critical. It's run in a disconnected context with a shorter timeout and fewer retries. Its failure does not trigger the Saga rollback.Section 3: Handling Advanced Edge Cases and Failures
The real world is more complex than a happy-path scenario. A robust Saga implementation must anticipate and handle difficult edge cases.
Edge Case 1: Non-Idempotent Activities
Temporal's at-least-once guarantee means an activity might execute more than once (e.g., a worker crashes after completing the work but before reporting back). If your activity is not idempotent (like charging a credit card), this can be disastrous.
Solution: The workflow must generate a unique idempotency key for each activity execution attempt. This key is then passed to the activity, which uses it to de-duplicate requests on the service side.
// In the Workflow:
func OrderWorkflow(...) (string, error) {
// ...
// Generate a unique ID for this specific payment attempt.
paymentAttemptID := workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} {
return uuid.NewString()
})
var transactionID string
err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, orderID, customerID, paymentAttemptID).Get(ctx, &transactionID)
// ...
}
// In the Activity:
func ProcessPaymentActivity(ctx context.Context, orderID string, customerID string, idempotencyKey string) (string, error) {
// 1. Check a database table (e.g., `payment_attempts`) for the idempotencyKey.
// 2. If found, return the result of the previous successful call.
// 3. If not found, proceed with the payment API call.
// 4. Store the result and the idempotencyKey in the database within the same local transaction.
// 5. Return the result.
// Example call to payment service
chargeResult, err := paymentServiceClient.Charge(customerID, amount, idempotencyKey)
if err != nil {
return "", err
}
return chargeResult.TransactionID, nil
}
We use workflow.SideEffect to generate the UUID. This is crucial because workflow code must be deterministic. SideEffect ensures the random UUID is generated only once and recorded in the workflow history, so it remains the same on replay.
Edge Case 2: Compensation Failures
This is the nightmare scenario: your Saga has failed, and now the compensation activity itself is failing. For example, RefundPaymentActivity fails because the payment gateway is down.
Solution: Compensation activities must be designed to eventually succeed. They should be idempotent and have a retry policy that effectively runs forever. The Compensate function we wrote earlier already configures this with an aggressive retry policy and by clearing NonRetryableErrorTypes. If a compensation still fails after hours or days, it signifies a critical infrastructure problem that requires human intervention. You should have monitoring and alerting in place for activities that have been retrying for an extended period.
Edge Case 3: Long-Running Activities and Heartbeating
What if a Saga step involves a human approval that could take days? A StartToCloseTimeout of several days is impractical and masks true failures.
Solution: Use Activity Heartbeating. The long-running activity periodically reports back to the Temporal Cluster that it is still alive and making progress.
// A long-running activity that requires manual approval
func WaitForHumanApprovalActivity(ctx context.Context, orderID string) (bool, error) {
// Poll an external system for approval status
for {
select {
case <-ctx.Done():
// Context was cancelled or timed out
return false, ctx.Err()
default:
// Check for approval
isApproved, err := checkApprovalSystem(orderID)
if err != nil {
return false, err
}
if isApproved {
return true, nil
}
// Record a heartbeat to show we're still alive.
// The details can be used to checkpoint progress.
activity.RecordHeartbeat(ctx, "Still waiting for approval...")
// Wait before polling again
time.Sleep(30 * time.Second)
}
}
}
In the workflow, you would set a short HeartbeatTimeout (e.g., 2 minutes) along with a very long StartToCloseTimeout. If the Temporal Cluster doesn't receive a heartbeat within the HeartbeatTimeout window, it will fail the activity and retry it, potentially on a different worker. This provides rapid failure detection for long-running tasks.
Section 4: Performance and Scalability Considerations
Temporal is designed for high throughput, but understanding its mechanics is key to scaling your Sagas.
* Worker Scaling: Temporal Workers are stateless. To increase throughput, you simply run more instances of your worker process. They poll a specific Task Queue for work. You can scale your worker pool up or down based on the load on a given task queue.
* Task Queue Partitioning: For very high-load systems, you can partition your workers. For instance, you could have a high-priority-orders task queue and a standard-orders task queue, served by different pools of workers with different resource allocations. This isolates workloads and prevents a surge in standard orders from impacting high-priority ones.
* History Size Limits: Every action in a workflow (activity started, timer fired, signal received) is recorded in its history. Temporal has a default history size limit (typically 50,000 events or 50MB). A Saga with thousands of steps or one that runs in a loop could exceed this limit.
Solution: For Sagas that are extremely long or cyclical, use workflow.ContinueAsNew. This powerful feature completes the current workflow execution and immediately starts a new one with the same workflow ID, carrying over a subset of the state. This effectively resets the history while maintaining the logical continuity of the Saga.
Section 5: Versioning Sagas in Production
Your business logic will change. What happens when you need to add a new step to your order Saga, but thousands of old versions are still in-flight?
Temporal's determinism requirement means you cannot simply add a new activity call to your workflow code. A replaying workflow would see the new call and fail because it wasn't in the original history.
Solution: Use Temporal's built-in workflow versioning API, workflow.GetVersion.
Let's say we want to add an ApplyDiscountActivity step after processing the payment.
const (
ApplyDiscountChangeID = "apply-discount-feature"
InitialVersion = workflow.DefaultVersion
ApplyDiscountVersion = 1
)
func OrderWorkflow(ctx workflow.Context, ...) (string, error) {
// ... existing code up to payment processing
err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, ...).Get(ctx, &transactionID)
if err != nil {
return "", err
}
saga.AddCompensation(func(cctx workflow.Context) { ... })
// VERSIONING BLOCK: Safely add the new step
version := workflow.GetVersion(ctx, ApplyDiscountChangeID, InitialVersion, ApplyDiscountVersion)
if version == ApplyDiscountVersion {
// This code will only execute for new workflows started after this code is deployed.
// In-flight workflows with InitialVersion will skip this block entirely.
var discountApplied bool
err = workflow.ExecuteActivity(ctx, ApplyDiscountActivity, orderID).Get(ctx, &discountApplied)
if err != nil {
return "", err
}
// Note: A real implementation would also need a compensation for the discount.
}
// ... continue with Create Shipment activity, etc.
return "Order completed", nil
}
workflow.GetVersion is a deterministic way to branch your code. When a new workflow starts, it will record ApplyDiscountVersion in its history and execute the new block. When an old workflow replays, GetVersion will see InitialVersion in the history and return that, causing the if block to be skipped, thus maintaining determinism.
This mechanism allows you to safely evolve your complex Sagas over time without ever needing to take down your system or perform a complex data migration on your running workflows.
Conclusion: Sagas as First-Class Citizens
The Saga pattern is a powerful tool for maintaining consistency in a distributed world, but its traditional implementations are fraught with complexity. Building and maintaining a custom orchestration engine is a significant, undifferentiated effort that distracts from core business logic.
By leveraging a durable execution platform like Temporal, we elevate the Saga from a low-level, error-prone infrastructure task to a first-class citizen of our business logic. The ability to write orchestrations as simple, sequential code—while Temporal transparently handles state persistence, retries, timeouts, and failures—is a paradigm shift. Advanced features like heartbeating, ContinueAsNew, and GetVersion provide the tools necessary to move beyond simple examples and build Sagas that are truly scalable, resilient, and maintainable in a production environment. For senior engineers tasked with building reliable distributed systems, this approach isn't just an improvement; it's a fundamental change in how we solve the problem of consistency.