Resilient Sagas with Temporal: A Production-Ready Implementation
The Fragility of State in Traditional Saga Orchestration
As senior engineers building distributed systems, we're all intimately familiar with the saga pattern as a mechanism for maintaining data consistency across microservices without resorting to distributed locks or two-phase commits. The core idea—a sequence of local transactions where each step has a compensating transaction for rollback—is straightforward. The implementation, however, is notoriously fraught with peril.
Traditional saga orchestrators, whether custom-built or using a message broker like RabbitMQ or Kafka, force you to manage state explicitly. You end up building a state machine where the saga's progress is persisted in a database. This approach introduces significant accidental complexity:
Temporal fundamentally changes this paradigm. It's a durable execution engine that turns multi-step, asynchronous, and fallible business processes into straightforward, testable code. The workflow's state, including local variables, call stack, and scheduled timers, is durably persisted by the Temporal cluster. A worker process can die and restart mid-execution, and Temporal will seamlessly resume the workflow from its exact last state. This lets us implement a saga not as a state machine, but as a single piece of code that looks deceptively sequential.
This article dives deep into a production-grade implementation of the saga pattern using Temporal and Go. We will not cover the basics of Temporal. We assume you understand workflows, activities, workers, and task queues. Instead, we'll focus on the nuanced implementation details, edge cases, and performance patterns required for mission-critical systems.
Scenario: A Multi-Step E-Commerce Order Fulfillment Saga
Let's model a complex e-commerce order process. A successful order requires coordinating four distinct services:
Each of these steps can fail. If any step after the first one fails, we must execute compensating actions to revert the preceding successful steps.
- If payment fails, we must release the inventory.
- If the loyalty service is down, we must refund the payment and release the inventory.
- If shipping fails, we must refund payment, release inventory, and revert the loyalty points.
Our goal is to implement this complex, stateful, and error-prone logic in a way that is readable, resilient, and highly testable.
The Core Implementation: `workflow.Saga`
Temporal's Go SDK provides a built-in Saga utility that simplifies compensation management. The core principle is to co-locate the registration of a compensation action with the successful execution of the forward action. Let's build our OrderFulfillmentWorkflow.
1. Defining the Workflow and Activities
First, we define the inputs and the activities that represent our service interactions. Notice how each forward action has a corresponding compensation action.
// file: shared/activities.go
package shared
import (
"context"
"errors"
"time"
)
// Dummy external service structs
type InventoryService struct{}
type PaymentService struct{}
type LoyaltyService struct{}
type ShippingService struct{}
// Activity inputs/outputs
type OrderDetails struct {
UserID string
ItemID string
OrderID string
Amount int
}
// --- Inventory Activities ---
func (s *InventoryService) ReserveInventory(ctx context.Context, details OrderDetails) (string, error) {
// Production: Make idempotent call to inventory service
// For this example, we'll simulate potential failures
if details.ItemID == "FAIL_INVENTORY" {
return "", errors.New("inventory service unavailable")
}
reservationID := "INV-" + details.OrderID
return reservationID, nil
}
func (s *InventoryService) ReleaseInventory(ctx context.Context, reservationID string, orderID string) error {
// Production: Idempotent call to release inventory
// This should be designed to succeed even if called multiple times.
return nil
}
// --- Payment Activities ---
func (s *PaymentService) ProcessPayment(ctx context.Context, details OrderDetails) (string, error) {
if details.Amount > 1000 {
return "", errors.New("payment declined: insufficient funds")
}
transactionID := "PAY-" + details.OrderID
return transactionID, nil
}
func (s *PaymentService) RefundPayment(ctx context.Context, transactionID string, orderID string) error {
return nil
}
// --- Loyalty Activities ---
func (s *LoyaltyService) UpdateLoyaltyPoints(ctx context.Context, details OrderDetails) (string, error) {
if details.UserID == "FAIL_LOYALTY" {
// Simulate a transient failure
return "", errors.New("loyalty service timeout")
}
loyaltyUpdateID := "LOY-" + details.OrderID
return loyaltyUpdateID, nil
}
func (s *LoyaltyService) RevertLoyaltyPoints(ctx context.Context, loyaltyUpdateID string, orderID string) error {
return nil
}
// --- Shipping Activities ---
func (s *ShippingService) DispatchShipping(ctx context.Context, details OrderDetails) (string, error) {
if details.ItemID == "FAIL_SHIPPING" {
return "", errors.New("invalid shipping address")
}
trackingID := "SHIP-" + details.OrderID
return trackingID, nil
}
func (s *ShippingService) CancelShipping(ctx context.Context, trackingID string, orderID string) error {
return nil
}
2. The Saga Workflow Implementation
Now for the core workflow logic. We'll use a defer block to ensure our compensation logic runs if any error occurs during the workflow execution. This is a powerful and clean pattern for ensuring cleanup.
// file: workflow/workflow.go
package workflow
import (
"time"
"go.temporal.io/sdk/workflow"
"your-project/shared"
)
func OrderFulfillmentWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
// Set a short timeout for activities for this example.
// In production, this would be longer and configurable.
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
}
ctx = workflow.WithActivityOptions(ctx, ao)
// Instantiate our activity clients
var activities shared.Activities
// Create a new Saga instance.
// The default is parallel compensation. We'll set it to sequential for predictable rollback.
saga := workflow.NewSaga(&workflow.SagaOptions{ParallelCompensation: false})
// The core of the resilience pattern: a deferred function that compensates on any error.
// The 'err' variable is a named return value, crucial for this pattern to work correctly.
var err error
defer func() {
if err != nil {
// An error occurred in the main workflow logic.
// Execute all registered compensations.
wfErr := saga.Compensate(ctx)
if wfErr != nil {
workflow.GetLogger(ctx).Error("Saga compensation failed.", "Error", wfErr)
// You might want to emit a critical metric here.
}
}
}()
// Step 1: Reserve Inventory
var reservationID string
err = workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderDetails).Get(ctx, &reservationID)
if err != nil {
return "", err
}
// If successful, add the compensation to the saga stack.
saga.AddCompensation(func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, activities.ReleaseInventory, reservationID, orderDetails.OrderID).Get(ctx, nil)
})
// Step 2: Process Payment
var transactionID string
err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, orderDetails).Get(ctx, &transactionID)
if err != nil {
return "", err
}
saga.AddCompensation(func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, activities.RefundPayment, transactionID, orderDetails.OrderID).Get(ctx, nil)
})
// Step 3: Update Loyalty Points
var loyaltyUpdateID string
err = workflow.ExecuteActivity(ctx, activities.UpdateLoyaltyPoints, orderDetails).Get(ctx, &loyaltyUpdateID)
if err != nil {
return "", err
}
saga.AddCompensation(func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, activities.RevertLoyaltyPoints, loyaltyUpdateID, orderDetails.OrderID).Get(ctx, nil)
})
// Step 4: Dispatch Shipping
var trackingID string
err = workflow.ExecuteActivity(ctx, activities.DispatchShipping, orderDetails).Get(ctx, &trackingID)
if err != nil {
return "", err
}
// No compensation for shipping, as it's the final step in this model.
// If it fails, the previous steps will be compensated.
return "Order completed successfully with tracking ID: " + trackingID, nil
}
Key Implementation Details:
* defer func() { ... }(): This is the heart of the pattern. The defer statement schedules a function call to be run immediately before the surrounding function returns. By checking if err != nil, we ensure saga.Compensate(ctx) is only called on a failure path.
* Named Return err: The err variable must be a named return value for the defer block to inspect its final value. The function signature is effectively ... (result string, err error). When an activity fails and we return "", err, the defer block executes and sees the non-nil err.
saga.AddCompensation(...): This is called immediately after* a forward step succeeds. It registers a function (a closure in this case) to be executed if saga.Compensate() is called. The compensations are added to a stack, so they will be executed in Last-In, First-Out (LIFO) order, which is the correct rollback sequence.
* Ignoring Compensation Errors: Inside the compensation functions, we deliberately ignore the error from Get(ctx, nil). Why? We'll cover this in the advanced edge cases section. For now, the key is that a failing compensation should not stop other compensations from running.
Advanced Edge Cases and Production Patterns
The simple implementation above is powerful, but production systems present more complex challenges.
1. Handling Compensation Failures
What if RefundPayment itself fails? A traditional orchestrator might get stuck in a catastrophic state. Temporal's resilience shines here.
Compensations are just closures that execute activities. Those activities are subject to the same retry policies as any other activity. If a RefundPayment activity fails due to a transient network issue, Temporal will automatically retry it according to its RetryPolicy.
Let's define a robust retry policy for our critical financial activities:
// In workflow.go, inside OrderFulfillmentWorkflow
// Retry policy for critical, idempotent operations like payment.
// Exponential backoff, max 5 attempts, non-retriable on specific business errors.
retryPolicy := &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: 100 * time.Second,
MaximumAttempts: 5,
NonRetryableErrorTypes: []string{"PaymentDeclinedError"}, // Custom error type
}
paymentActivityOptions := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: retryPolicy,
}
paymentCtx := workflow.WithActivityOptions(ctx, paymentActivityOptions)
// ... later in the workflow ...
// Step 2: Process Payment
var transactionID string
err = workflow.ExecuteActivity(paymentCtx, activities.ProcessPayment, orderDetails).Get(paymentCtx, &transactionID)
// ...
saga.AddCompensation(func(ctx workflow.Context) {
// Use the same robust context for the compensation!
_ = workflow.ExecuteActivity(paymentCtx, activities.RefundPayment, transactionID, orderDetails.OrderID).Get(ctx, nil)
})
By applying the same robust ActivityOptions to the compensation activity, we ensure that a refund will be retried just as vigorously as the initial charge. This is a massive simplification over manual retry loops and state management.
But what if it still fails after all retries? The saga.Compensate() method itself will return an error. Our defer block logs this critical failure. This is an exceptional state that requires manual intervention. The workflow will be marked as Failed, and the full history will be available in the Temporal UI for debugging. You should have monitoring and alerting set up for workflows that fail during compensation.
2. Parallel Execution within a Saga
Suppose reserving inventory and updating loyalty points can happen concurrently to speed up the process. We can use workflow.Go to achieve this, but we must be careful about how we register compensations.
// ... inside workflow ...
// Step 1 and 3 in parallel
var reservationID string
var loyaltyUpdateID string
errChan := workflow.NewChannel(ctx)
// Goroutine for inventory
workflow.Go(ctx, func(ctx workflow.Context) {
var invErr error
invErr = workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderDetails).Get(ctx, &reservationID)
if invErr == nil {
saga.AddCompensation(func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, activities.ReleaseInventory, reservationID, orderDetails.OrderID).Get(ctx, nil)
})
}
errChan.Send(ctx, invErr)
})
// Goroutine for loyalty
workflow.Go(ctx, func(ctx workflow.Context) {
var loyaltyErr error
loyaltyErr = workflow.ExecuteActivity(ctx, activities.UpdateLoyaltyPoints, orderDetails).Get(ctx, &loyaltyUpdateID)
if loyaltyErr == nil {
saga.AddCompensation(func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, activities.RevertLoyaltyPoints, loyaltyUpdateID, orderDetails.OrderID).Get(ctx, nil)
})
}
errChan.Send(ctx, loyaltyErr)
})
// Wait for both to complete and check for errors
for i := 0; i < 2; i++ {
var activityErr error
errChan.Receive(ctx, &activityErr)
if activityErr != nil {
err = activityErr
}
}
if err != nil {
return "", err // This will trigger the deferred compensation
}
// Now proceed with Step 2 (Payment), which depends on the previous steps
// ...
Critical Insight: The saga.AddCompensation call is made inside the goroutine, conditional on that specific activity's success. This ensures that if the ReserveInventory activity succeeds but UpdateLoyaltyPoints fails, only the compensation for ReleaseInventory is registered before the workflow fails and triggers the saga.Compensate() call.
3. Idempotency in Activities and Compensations
Temporal guarantees at-least-once execution for activities. This means your activity logic must be idempotent. If an activity worker processes a task but crashes before it can acknowledge completion, Temporal will reschedule the task on another worker. Your system must be able to handle receiving the same request twice.
Common idempotency patterns include:
* Using a Transaction ID: Pass a unique ID (like workflow.GetInfo(ctx).RunID + an activity sequence number) with your request. The downstream service can then use this ID to de-duplicate requests.
* Pessimistic Locking: The service can lock the relevant record (e.g., SELECT ... FOR UPDATE on the order row) at the start of its transaction.
* Conditional Updates: Use database features to perform conditional updates (e.g., UPDATE orders SET status = 'PAID' WHERE order_id = ? AND status = 'PENDING').
This is even more critical for compensation activities. A RefundPayment call must be safe to execute multiple times. The payment service should check if the transaction ID has already been refunded and, if so, return a success response without processing it again.
4. Long-Running Sagas and `ContinueAsNew`
A saga might need to wait for an external event that could take days, such as waiting for a third-party shipment confirmation. A single, long-running workflow history can become very large, hitting Temporal's history size limits.
The workflow.ContinueAsNew feature is the solution. It allows a workflow to complete its execution and start a new one with the same Workflow ID, effectively resetting the history size while carrying over the state.
// A saga that waits for payment confirmation for up to 3 days
func AwaitingPaymentWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
// ... initial saga steps ...
// Wait for a signal (e.g., from a webhook) for up to 72 hours
signalChan := workflow.GetSignalChannel(ctx, "payment-confirmation-signal")
timer := workflow.NewTimer(ctx, 72 * time.Hour)
var signalData PaymentConfirmation
selector := workflow.NewSelector(ctx)
selector.AddReceive(signalChan, func(c workflow.Channel, more bool) {
c.Receive(ctx, &signalData)
})
selector.AddFuture(timer, func(f workflow.Future) {
// Timeout occurred
})
selector.Select(ctx) // Blocks until signal or timer
if timer.IsReady() {
// Timeout: fail the workflow, which will trigger compensation
return "", errors.New("payment not confirmed within 72 hours")
}
// Check if history is getting large
if workflow.GetInfo(ctx).GetHistoryLength() > 20000 { // Threshold is configurable
// Preserve state and continue as new
return "", workflow.NewContinueAsNewError(ctx, AwaitingPaymentWorkflow, orderDetails)
}
// ... continue with the rest of the saga ...
}
Testing the Saga
One of the most significant advantages of Temporal is the testability of complex logic. The testsuite package allows for complete in-memory testing of your workflow, including failure scenarios.
// file: workflow/workflow_test.go
package workflow
import (
"errors"
"testing"
"github.com/stretchr/testify/mock"
"github.com/stretchr/testify/require"
"go.temporal.io/sdk/testsuite"
"your-project/shared"
)
type testActivities struct {
mock.Mock
}
// Implement mock activities that match the real signatures
func (a *testActivities) ReserveInventory(ctx context.Context, details shared.OrderDetails) (string, error) {
args := a.Called(details)
return args.String(0), args.Error(1)
}
// ... mock implementations for all other activities ...
func (a *testActivities) RefundPayment(ctx context.Context, transactionID string, orderID string) error {
args := a.Called(transactionID, orderID)
return args.Error(0)
}
func TestOrderFulfillment_PaymentFails_CompensationCalled(t *testing.T) {
suite := testsuite.WorkflowTestSuite{}
env := suite.NewTestWorkflowEnvironment()
activities := &testActivities{}
env.RegisterActivity(activities)
orderDetails := shared.OrderDetails{ /* ... */ }
// === Mock Setup ===
// 1. Inventory succeeds
activities.On("ReserveInventory", orderDetails).Return("INV-123", nil).Once()
// 2. Payment fails
activities.On("ProcessPayment", orderDetails).Return("", errors.New("payment declined")).Once()
// EXPECT COMPENSATIONS
// 3. Refund should be called because payment was attempted (even if failed, we assume a charge could be pending)
activities.On("RefundPayment", "PAY-123", orderDetails.OrderID).Return(nil).Once()
// 4. Release inventory should be called
activities.On("ReleaseInventory", "INV-123", orderDetails.OrderID).Return(nil).Once()
// === Execute Workflow ===
env.ExecuteWorkflow(OrderFulfillmentWorkflow, orderDetails)
// === Assertions ===
require.True(t, env.IsWorkflowCompleted())
require.Error(t, env.GetWorkflowError())
// Verify that all expected mock calls were made
activities.AssertExpectations(t)
}
This test deterministically verifies that when the ProcessPayment activity fails, the correct compensation activities (ReleaseInventory and RefundPayment) are called in the correct LIFO order. You can write similar tests for every conceivable failure path without any external dependencies, a feat that is nearly impossible with traditional message-based orchestrators.
Conclusion: Beyond State Machines
Implementing sagas with Temporal is a paradigm shift. You are no longer building a distributed state machine; you are writing business logic. The workflow.Saga pattern, combined with defer, provides a robust and readable way to manage complex compensation chains. The underlying Temporal platform handles the truly difficult parts of distributed systems: state persistence, retries, timeouts, and recovering from process failures.
By embracing this model, you trade a small amount of platform dependency for a massive reduction in application-level complexity. Your code becomes a direct expression of the business process, is resilient by default, and is fully testable. For senior engineers tasked with building reliable, mission-critical systems, this is not just a convenience—it's a strategic advantage.