Saga Pattern with Temporal: A Production-Grade Implementation Guide
The Saga Pattern Reimagined: From Theory to Production with Temporal
In any non-trivial microservices architecture, the problem of maintaining data consistency across service boundaries is a principal challenge. The retirement of the two-phase commit (2PC) protocol in distributed systems, due to its tight coupling and performance bottlenecks, led to the rise of asynchronous, event-driven patterns. Among these, the Saga pattern stands out as the de facto standard for managing long-running, multi-step business transactions.
However, implementing a Saga from scratch is fraught with peril. Developers must manually manage a distributed state machine, build a durable Saga log, implement complex retry logic, and handle the intricate dance of compensating transactions. This boilerplate is not only error-prone but also distracts from the core business logic.
This is where a durable execution engine like Temporal fundamentally changes the game. Temporal abstracts away the complexities of distributed state management, allowing you to write Saga orchestration logic as straightforward, sequential code. This article is not an introduction to Sagas or Temporal. It is a deep, technical guide for senior engineers on how to implement a production-grade, orchestration-based Saga using Temporal, focusing on the advanced patterns and edge cases you will inevitably encounter.
We will dissect a complex e-commerce order processing workflow, providing complete Go code examples to illustrate:
defer.- Ensuring idempotency at both the activity and service level.
- Managing long-running activities with heartbeating.
- Handling non-compensatable failures with human-in-the-loop patterns.
Continue-As-New pattern.Why Temporal is a Natural Fit for Orchestrated Sagas
Sagas come in two flavors: Choreography and Orchestration. In Choreography, services react to events from other services, leading to a decentralized but often hard-to-trace and debug system. In Orchestration, a central coordinator (the orchestrator) explicitly tells services what to do.
Temporal is purpose-built for the orchestration model. A Temporal Workflow is a durable, stateful function whose execution is preserved across process and server failures. This makes it a perfect implementation of a Saga orchestrator.
defer statement or a try...catch...finally block in other languages becomes a robust mechanism for queuing and executing compensation logic.Scenario: A Production-Grade E-commerce Order Saga
Let's model a realistic order processing flow that interacts with multiple microservices. A customer placing an order triggers a workflow that must successfully complete a series of steps.
Forward Transaction (The "Happy Path")
PENDING status.CONFIRMED.Compensating Transactions
If any step from 2 through 5 fails, we must roll back the preceding actions to maintain a consistent state.
DispatchShipping: Compensate by calling UpdateOrderStatus back to PAYMENT_CONFIRMED_SHIPPING_FAILED.UpdateOrderStatus: Compensate by calling RefundPayment.ProcessPayment: Compensate by calling ReleaseInventory.ReserveInventory: Compensate by calling UpdateOrderStatus to FAILED.This sequence of compensations, executed in reverse order of the original operations, is the core of the Saga pattern.
Implementation Deep Dive: Building the Saga Workflow in Go
Let's translate this business process into a Temporal Workflow. We'll start with the project structure and then build out the activities and the orchestrating workflow.
Project Structure
/order-saga
|-- /activities
| |-- inventory_activity.go
| |-- order_activity.go
| |-- payment_activity.go
| `-- shipping_activity.go
|-- /shared
| `-- types.go
|-- /worker
| `-- main.go
|-- /starter
| `-- main.go
`-- workflow.go
Defining Activities: The Saga Steps
Activities are functions that execute the actual business logic, typically by making RPC calls to your microservices. It's crucial to define their options, such as timeouts and retry policies, correctly.
Here's an example of the PaymentActivity interface and implementation. Note how it simulates a call to an external payment gateway.
// activities/payment_activity.go
package activities
import (
"context"
"fmt"
"time"
"go.temporal.io/sdk/activity"
)
// Simulating a call to a payment microservice
func ProcessPayment(ctx context.Context, orderID string, amount float64, idempotencyKey string) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("Processing payment", "OrderID", orderID, "Amount", amount)
// In a real-world scenario, this would be an RPC call to the payment service.
// The idempotencyKey would be passed in the request header/body.
// Simulate a long-running API call that might time out
for i := 0; i < 5; i++ {
activity.RecordHeartbeat(ctx, i)
time.Sleep(1 * time.Second)
}
// Simulate a potential failure
if amount > 1000 {
return "", fmt.Errorf("payment gateway declined for amount > 1000")
}
transactionID := fmt.Sprintf("txn-%s", idempotencyKey)
logger.Info("Payment processed successfully", "TransactionID", transactionID)
return transactionID, nil
}
func RefundPayment(ctx context.Context, transactionID string, idempotencyKey string) error {
logger := activity.GetLogger(ctx)
logger.Info("Refunding payment", "TransactionID", transactionID)
// RPC call to the payment service to issue a refund.
// The idempotencyKey ensures we don't double-refund.
logger.Info("Payment refunded successfully")
return nil
}
The Orchestrator: Implementing the Workflow with Compensation
The core of our Saga is the workflow function. We'll use Go's defer statement to elegantly manage the compensation stack. A deferred function call is pushed onto a stack, and the calls are executed in last-in-first-out (LIFO) order when the surrounding function returns. This perfectly mirrors the reverse-order execution required for Saga compensations.
// workflow.go
package main
import (
"time"
"your_project/activities"
"your_project/shared"
"go.temporal.io/sdk/workflow"
)
func OrderProcessingWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
// Configure activity options
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
// HeartbeatTimeout is crucial for long-running activities
HeartbeatTimeout: 2 * time.Second,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: 100 * time.Second,
MaximumAttempts: 5, // Or 0 for unlimited
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
logger := workflow.GetLogger(ctx)
logger.Info("Order processing workflow started", "OrderID", orderDetails.OrderID)
// 1. Create Order in DB
var orderID string
err := workflow.ExecuteActivity(ctx, activities.CreateOrder, orderDetails).Get(ctx, &orderID)
if err != nil {
logger.Error("Failed to create order", "Error", err)
return "", err
}
// Compensation stack
var compensations []func()
defer func() {
// If the workflow failed (non-nil err), execute compensations
if err != nil {
logger.Warn("Workflow failed, starting compensation logic", "OrderID", orderID)
for _, compensation := range compensations {
// Execute compensation activities in a new, detached context
// This ensures they run even if the main workflow context is cancelled
disconnectedCtx, _ := workflow.NewDisconnectedContext(ctx)
compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
err := workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
if err != nil {
logger.Error("Compensation activity failed", "Error", err)
// Handle non-compensatable errors here (e.g., signal for manual intervention)
}
}
}
}()
// 2. Reserve Inventory
err = workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderID, orderDetails.Items).Get(ctx, nil)
if err != nil {
logger.Error("Failed to reserve inventory", "Error", err)
// No compensation needed yet, as only the initial order was created.
// We might update the order status to FAILED here.
workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "FAILED_INVENTORY_RESERVATION").Get(ctx, nil)
return "", err
}
// Add compensation to the stack
compensations = append(compensations, func() {
workflow.ExecuteActivity(ctx, activities.ReleaseInventory, orderID, orderDetails.Items).Get(ctx, nil)
})
// 3. Process Payment
var transactionID string
idempotencyKey := workflow.GetInfo(ctx).WorkflowExecution.ID
err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, orderID, orderDetails.TotalAmount, idempotencyKey).Get(ctx, &transactionID)
if err != nil {
logger.Error("Failed to process payment", "Error", err)
return "", err // Defer will trigger compensations
}
// Add compensation to the stack
compensations = append(compensations, func() {
refundIdempotencyKey := fmt.Sprintf("refund-%s", idempotencyKey)
workflow.ExecuteActivity(ctx, activities.RefundPayment, transactionID, refundIdempotencyKey).Get(ctx, nil)
})
// 4. Update Order Status
err = workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "CONFIRMED").Get(ctx, nil)
if err != nil {
logger.Error("Failed to update order status", "Error", err)
return "", err // Defer will trigger compensations
}
// No direct compensation for status update, as refunding is the business compensation.
// 5. Dispatch Shipping
err = workflow.ExecuteActivity(ctx, activities.DispatchShipping, orderID).Get(ctx, nil)
if err != nil {
logger.Error("Failed to dispatch shipping", "Error", err)
// This is a partial failure. The order is paid. We don't want to refund yet.
// We can update status and let another process handle it.
workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "PAYMENT_CONFIRMED_SHIPPING_FAILED").Get(ctx, nil)
// We don't return the error here, so compensations are NOT triggered.
// The workflow succeeds from a technical perspective.
return "Order completed with shipping failure. Manual intervention required.", nil
}
logger.Info("Workflow completed successfully", "OrderID", orderID)
return "Order processed successfully", nil
}
Key Patterns in this Implementation:
ActivityOptions once and apply them to the context. This is good for consistency, but you can override them for specific activity calls if needed.HeartbeatTimeout is set to 2 seconds. Our ProcessPayment activity calls activity.RecordHeartbeat every second. If the worker process crashes during this activity, Temporal will know within 2 seconds that it's no longer making progress and will reschedule it on another worker, resuming from the last known state.defer: The defer block contains the core compensation logic. It only executes if the function returns with a non-nil err. Inside the defer, we iterate through our compensations slice, which was built in FIFO order. The defer executes them in LIFO order, which is exactly what we need.workflow.NewDisconnectedContext for compensation activities. This is critical. If the workflow times out or is cancelled, the original ctx becomes invalid. A disconnected context ensures that our cleanup logic (compensations) will still run to completion, regardless of the parent workflow's state.Handling Advanced Edge Cases and Failure Modes
A simple happy-path-with-rollback Saga is table stakes. Production systems demand resilience against more complex failures.
Edge Case 1: Idempotency is Non-Negotiable
Temporal's at-least-once execution guarantee for activities means an activity might run more than once (e.g., a worker crashes after completing the work but before reporting back to the server). If your ProcessPayment activity is not idempotent, you will double-charge customers.
The Solution: The workflow must generate a unique key for each non-idempotent operation. A fantastic candidate for this is the WorkflowExecution.ID, which is guaranteed to be unique for each workflow run.
In our workflow code:
// Get a unique ID for this specific workflow execution
idempotencyKey := workflow.GetInfo(ctx).WorkflowExecution.ID
// Pass it to the activity
err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, orderID, orderDetails.TotalAmount, idempotencyKey).Get(ctx, &transactionID)
Your downstream microservice (the Payment Service) MUST be designed to handle this key. A common pattern is to have a database table (idempotency_key, transaction_result).
Pseudo-code for the Payment Service endpoint:
func (s *PaymentService) ChargeCard(request ChargeRequest) (*ChargeResponse, error) {
// 1. Check if we've seen this idempotency key before
dbTx, err := s.db.Begin()
existingResult, err := dbTx.QueryRow("SELECT result FROM processed_transactions WHERE key = ? FOR UPDATE", request.IdempotencyKey)
if err == nil && existingResult != nil {
// We've already processed this. Return the stored result.
return existingResult, nil
}
if err != sql.ErrNoRows {
return nil, err // Some other DB error
}
// 2. Not seen before. Process the payment.
gatewayResponse, err := s.paymentGateway.Charge(request.CardDetails, request.Amount)
if err != nil {
// Store the failure result so we don't retry a failing card indefinitely
s.storeResult(dbTx, request.IdempotencyKey, "FAILED")
dbTx.Commit()
return nil, err
}
// 3. Store the successful result before returning
s.storeResult(dbTx, request.IdempotencyKey, gatewayResponse.TransactionID)
dbTx.Commit()
return &ChargeResponse{TransactionID: gatewayResponse.TransactionID}, nil
}
This pattern ensures that even if the Temporal activity retries 10 times, the customer is only charged once.
Edge Case 2: Non-Compensatable Failures (The Human-in-the-loop Pattern)
What happens if your RefundPayment activity fails? Perhaps the payment gateway's refund API is down for an extended period. The workflow is now stuck in a failed state, unable to complete its compensation. This is a non-compensatable error.
The Solution: The workflow should stop trying to compensate automatically and signal for human intervention.
We can modify our defer block to handle this:
// ... inside the defer block ...
for _, compensation := range compensations {
compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
// Use a specific retry policy for compensations. Maybe fewer retries.
compensationAo := workflow.ActivityOptions{StartToCloseTimeout: 30 * time.Second}
compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationAo)
err := workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
if err != nil {
logger.Error("CRITICAL: Compensation activity failed. Manual intervention required.", "Error", err)
// The workflow will now fail and remain in a failed state in the Temporal UI.
// An external monitoring system should alert on workflows failing in this specific state.
// Alternatively, we can use Signals to attempt a fix.
}
}
An operations team can then be alerted. They can inspect the failed workflow in the Temporal UI, diagnose the external issue (e.g., the payment gateway outage), and once it's resolved, they could potentially use a Temporal Signal to instruct the workflow to retry the failed compensation step. This requires more advanced workflow logic to handle such signals.
Code Example: Handling a RetryCompensation Signal
// Add this to your workflow logic
// A channel to receive the signal
retrySignalChan := workflow.GetSignalChannel(ctx, "RetryCompensationSignal")
// ... inside the compensation defer block's error handling ...
if err != nil {
logger.Error("Compensation failed, awaiting manual signal to retry...", "Error", err)
// Block until a signal is received
retrySignalChan.Receive(ctx, nil)
logger.Info("Retry signal received, re-attempting compensation.")
// Re-run the failed activity
workflow.ExecuteActivity(compensationCtx, compensation).Get(compensationCtx, nil)
}
An operator could then send the signal using the tctl CLI:
tctl workflow signal --workflow_id <your_workflow_id> --name RetryCompensationSignal
Edge Case 3: Partial Success is a Business Decision
In our example, if DispatchShipping fails, the order is already paid for. The business likely does not want to automatically refund the customer. They'd prefer to retry shipping later or contact the customer.
Our workflow code correctly handles this:
// ...
err = workflow.ExecuteActivity(ctx, activities.DispatchShipping, orderID).Get(ctx, nil)
if err != nil {
logger.Error("Failed to dispatch shipping", "Error", err)
// Update status to a specific error state
workflow.ExecuteActivity(ctx, activities.UpdateOrderStatus, orderID, "PAYMENT_CONFIRMED_SHIPPING_FAILED").Get(ctx, nil)
// DO NOT return the error. This prevents the compensation stack from running.
return "Order completed with shipping failure. Manual intervention required.", nil
}
By not returning an error object, we tell Temporal the workflow has technically succeeded. The compensation defer is not triggered. The return value indicates the partial failure, and another system (e.g., a dashboard for the logistics team) can now query for workflows that ended in this state and take appropriate action.
Performance and Scalability Considerations
Workflow History Size and Continue-As-New
Temporal records every event in a workflow's execution (workflow start, activity scheduled, activity completed, timer fired, etc.) in a Workflow History. This history is replayed on a worker to recover the state of a workflow. However, this history has size and length limits (typically 50MB and 50,000 events).
A Saga that runs for a very long time (e.g., a monthly subscription workflow that runs for years) or has a huge number of steps can exceed these limits.
The Solution: The Continue-As-New pattern. This allows a workflow to complete its execution and start a new, fresh execution of itself in a single atomic operation, carrying over state as needed. This effectively resets the history.
Example: A Monthly Subscription Saga
func SubscriptionWorkflow(ctx workflow.Context, subDetails Subscription) error {
// Run this month's billing logic (a mini-Saga)
err := runBillingCycle(ctx, subDetails)
if err != nil {
// Handle billing failure, maybe signal for dunning process
return err
}
// Wait for the next billing cycle
nextBillingDate := workflow.Now(ctx).Add(30 * 24 * time.Hour)
await_ctx, _ := workflow.NewDisconnectedContext(ctx)
err = workflow.Sleep(await_ctx, nextBillingDate.Sub(workflow.Now(ctx)))
if err != nil {
return err
}
// Instead of looping, continue as a new workflow
return workflow.NewContinueAsNewError(ctx, SubscriptionWorkflow, subDetails)
}
This workflow runs the billing logic, sleeps for a month, and then, instead of looping (which would grow the history), it starts a new instance of itself. The history size never grows beyond one month's worth of events, allowing the subscription to run indefinitely.
Conclusion: Beyond a Pattern, a Platform
The Saga pattern is a powerful concept for maintaining consistency in a distributed architecture. However, its manual implementation is a significant engineering burden. By leveraging a durable execution platform like Temporal, you offload the most complex and error-prone aspects of Saga implementation—state management, retries, timeouts, and logging—to the platform itself.
This allows you to write your Saga orchestration as clear, sequential, and highly testable code. You can focus on the core business logic and the nuanced handling of edge cases, such as idempotency and non-compensatable failures, which is where the real value lies. By adopting these production-grade patterns, you can build truly resilient, observable, and scalable long-running workflows that form the backbone of your microservices architecture.