Implementing the Saga Pattern with Temporal for Resilient Microservices
The Peril of Distributed Transactions and the Promise of Sagas
In a microservices architecture, the canonical challenge of maintaining data consistency across service boundaries is a constant source of complexity. The two-phase commit (2PC) protocol, a staple of monolithic systems, is notoriously brittle and unscalable in a distributed environment due to its requirement for synchronous communication and locking of resources. This is where the Saga pattern emerges as a superior alternative for managing long-running transactions.
A Saga is a sequence of local transactions where each transaction updates data within a single service and publishes a message or event to trigger the next transaction in the chain. If a local transaction fails, a series of compensating transactions are executed to semantically undo the work of the preceding successful transactions.
However, implementing a Saga from scratch is a significant engineering challenge. You become responsible for:
This is where Temporal.io fundamentally changes the game. Temporal is not a message queue or a simple task scheduler; it's a durable execution system. It allows you to write complex, long-running, and stateful logic as a single piece of code—a Workflow—while the Temporal cluster handles the persistence, retries, and state management. The state of your Saga lives within the durable context of a Temporal Workflow, not in a fragile, custom-built state machine in your database.
This article will demonstrate how to implement a robust Saga pattern in Go using Temporal, focusing on production-grade patterns that address idempotency, error handling, and long-term maintainability.
Our Scenario: An E-commerce Order Processing Saga
We will model a classic e-commerce order workflow, which is an ideal candidate for a Saga due to its multi-service, long-running nature:
OrderService.InventoryService must reserve the items.PaymentService must charge the customer.ShippingService must arrange for delivery.If the payment fails, we must compensate by un-reserving the inventory. If inventory reservation fails, we simply fail the workflow. This is a classic orchestration-based Saga.
Section 1: Workflow and Activity Definitions
First, let's define the structure of our workflow and the activities it will orchestrate. A Temporal Workflow is the orchestrator, and each step in the Saga is a Temporal Activity. Activities are where you interact with the outside world (databases, APIs, etc.).
Project Structure
/temporal-saga-project
├── activities
│ ├── inventory_activity.go
│ ├── payment_activity.go
│ └── shipping_activity.go
├── shared
│ └── types.go
├── workflow
│ └── order_workflow.go
├── worker
│ └── main.go
└── starter
└── main.go
Shared Data Structures
We'll start with the data structures that will be passed between our workflow and activities.
shared/types.go
package shared
import "time"
// OrderDetails contains all the information for an order.
type OrderDetails struct {
OrderID string
UserID string
Items []string
TotalPrice float64
}
// PaymentDetails for processing payment.
type PaymentDetails struct {
OrderID string
UserID string
Amount float64
TransactionID string
}
// ShippingDetails for shipping the order.
type ShippingDetails struct {
OrderID string
UserID string
Address string
}
The Saga Workflow Definition
The core of our Saga is the OrderWorkflow. Notice how it reads like straight-line procedural code. The magic of Temporal is that this code can effectively 'sleep' for days or weeks between steps, and if the worker process crashes, Temporal will seamlessly resume the execution on another worker from the exact same point.
workflow/order_workflow.go
package workflow
import (
"fmt"
"time"
"go.temporal.io/sdk/workflow"
"temporal-saga-project/activities"
"temporal-saga-project/shared"
)
func OrderWorkflow(ctx workflow.Context, orderDetails shared.OrderDetails) (string, error) {
// Configure activity options with timeouts and retry policies.
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
RetryPolicy: &workflow.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: 100 * time.Second,
MaximumAttempts: 3,
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
logger := workflow.GetLogger(ctx)
logger.Info("Order workflow started", "OrderID", orderDetails.OrderID)
// 1. Reserve Inventory
var inventoryResult string
err := workflow.ExecuteActivity(ctx, activities.ReserveInventory, orderDetails).Get(ctx, &inventoryResult)
if err != nil {
logger.Error("Failed to reserve inventory", "Error", err)
return "", fmt.Errorf("inventory reservation failed: %w", err)
}
// Add compensation for inventory reservation.
// This will execute if the workflow fails at any point after this line.
defer func() {
if err != nil {
// Workflow failed, we need to compensate.
logger.Warn("Workflow failed, compensating inventory reservation.")
compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
_ = workflow.ExecuteActivity(compensationCtx, activities.CancelInventoryReservation, orderDetails).Get(compensationCtx, nil)
}
}()
// 2. Process Payment
paymentDetails := shared.PaymentDetails{
OrderID: orderDetails.OrderID,
UserID: orderDetails.UserID,
Amount: orderDetails.TotalPrice,
}
var paymentResult string
err = workflow.ExecuteActivity(ctx, activities.ProcessPayment, paymentDetails).Get(ctx, &paymentResult)
if err != nil {
logger.Error("Failed to process payment", "Error", err)
return "", fmt.Errorf("payment processing failed: %w", err)
}
// Add compensation for payment.
defer func() {
if err != nil {
// Workflow failed, we need to compensate.
logger.Warn("Workflow failed, compensating payment.")
compensationCtx, _ := workflow.NewDisconnectedContext(ctx)
_ = workflow.ExecuteActivity(compensationCtx, activities.RefundPayment, paymentDetails).Get(compensationCtx, nil)
}
}()
// 3. Ship Order
shippingDetails := shared.ShippingDetails{
OrderID: orderDetails.OrderID,
UserID: orderDetails.UserID,
Address: "123 Temporal Lane", // Example address
}
var shippingResult string
err = workflow.ExecuteActivity(ctx, activities.ShipOrder, shippingDetails).Get(ctx, &shippingResult)
if err != nil {
logger.Error("Failed to ship order", "Error", err)
return "", fmt.Errorf("shipping failed: %w", err)
}
// No compensation for shipping in this simple model once it's shipped.
result := fmt.Sprintf("Order %s processed successfully! Inventory: %s, Payment: %s, Shipping: %s", orderDetails.OrderID, inventoryResult, paymentResult, shippingResult)
logger.Info("Workflow completed successfully.")
return result, nil
}
The Durable `defer` Pattern for Compensations
The most critical pattern in this workflow is the use of defer. In regular Go, defer schedules a function call to be executed when the surrounding function returns. In a Temporal workflow, defer is durable. If the workflow execution fails or the worker crashes, the deferred function calls are persisted in the workflow history. When the workflow is recovered, the defer stack is restored.
By checking if err != nil inside the deferred function, we create a compensating transaction. The err variable will be nil if the workflow completes successfully, and non-nil if any subsequent ExecuteActivity call fails. This ensures compensations run only when the Saga needs to be rolled back.
We also use workflow.NewDisconnectedContext(ctx). This is crucial for compensations. It creates a new context that is detached from the main workflow context's cancellation. This ensures that even if the workflow is cancelled or times out, the cleanup/compensation logic will still attempt to run.
Section 2: Implementing Idempotent Activities
Activities are where side effects happen. A core requirement for a reliable Saga is that these side effects must be idempotent. Temporal guarantees at-least-once execution of activities. If an activity executes but the worker crashes before it can report completion back to the Temporal cluster, the cluster will time out and reschedule the activity. The downstream service must be able to handle this duplicate call without causing issues (e.g., charging a customer twice).
Idempotency is typically achieved by passing a unique key with each request. The workflow's RunID or a combination of WorkflowID and an activity-specific identifier are good candidates.
Here are the mock implementations for our activities, demonstrating how to handle idempotency and simulate failures.
activities/inventory_activity.go
package activities
import (
"context"
"fmt"
"temporal-saga-project/shared"
)
// In a real app, this would be a database or another service client.
var reservedItems = make(map[string][]string)
func ReserveInventory(ctx context.Context, order shared.OrderDetails) (string, error) {
// Idempotency Check: if already reserved for this order, just return success.
if _, ok := reservedItems[order.OrderID]; ok {
return fmt.Sprintf("Inventory for order %s already reserved (idempotent)", order.OrderID), nil
}
fmt.Printf("Reserving inventory for order: %s\n", order.OrderID)
reservedItems[order.OrderID] = order.Items
return fmt.Sprintf("Inventory reserved for order %s", order.OrderID), nil
}
func CancelInventoryReservation(ctx context.Context, order shared.OrderDetails) (string, error) {
fmt.Printf("Compensating: canceling inventory reservation for order: %s\n", order.OrderID)
delete(reservedItems, order.OrderID)
return fmt.Sprintf("Inventory reservation canceled for order %s", order.OrderID), nil
}
activities/payment_activity.go
package activities
import (
"context"
"errors"
"fmt"
"temporal-saga-project/shared"
)
var processedPayments = make(map[string]bool)
var refundedPayments = make(map[string]bool)
func ProcessPayment(ctx context.Context, payment shared.PaymentDetails) (string, error) {
// Idempotency Check
if processed, ok := processedPayments[payment.OrderID]; ok && processed {
return fmt.Sprintf("Payment for order %s already processed (idempotent)", payment.OrderID), nil
}
fmt.Printf("Processing payment for order: %s for amount %.2f\n", payment.OrderID, payment.Amount)
// Simulate a payment failure for a specific user to test compensation
if payment.UserID == "user-who-fails-payment" {
return "", errors.New("payment gateway declined: insufficient funds")
}
processedPayments[payment.OrderID] = true
return fmt.Sprintf("Payment successful for order %s", payment.OrderID), nil
}
func RefundPayment(ctx context.Context, payment shared.PaymentDetails) (string, error) {
// Idempotency Check
if refunded, ok := refundedPayments[payment.OrderID]; ok && refunded {
return fmt.Sprintf("Payment for order %s already refunded (idempotent)", payment.OrderID), nil
}
fmt.Printf("Compensating: refunding payment for order: %s\n", payment.OrderID)
delete(processedPayments, payment.OrderID)
refundedPayments[payment.OrderID] = true
return fmt.Sprintf("Payment refunded for order %s", payment.OrderID), nil
}
activities/shipping_activity.go
package activities
import (
"context"
"fmt"
"temporal-saga-project/shared"
)
func ShipOrder(ctx context.Context, shipping shared.ShippingDetails) (string, error) {
fmt.Printf("Shipping order %s to %s\n", shipping.OrderID, shipping.Address)
// In a real scenario, this would call a shipping provider's API.
return fmt.Sprintf("Order %s shipped", shipping.OrderID), nil
}
Section 3: The Worker and Starter
To run this, we need two more pieces: a Worker process that hosts the workflow and activity implementations, and a Starter process that initiates a workflow execution.
worker/main.go
package main
import (
"log"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
"temporal-saga-project/activities"
"temporal-saga-project/workflow"
)
func main() {
c, err := client.Dial(client.Options{})
if err != nil {
log.Fatalln("Unable to create client", err)
}
defer c.Close()
w := worker.New(c, "order-processing", worker.Options{})
w.RegisterWorkflow(workflow.OrderWorkflow)
w.RegisterActivity(activities.ReserveInventory)
w.RegisterActivity(activities.CancelInventoryReservation)
w.RegisterActivity(activities.ProcessPayment)
w.RegisterActivity(activities.RefundPayment)
w.RegisterActivity(activities.ShipOrder)
err = w.Run(worker.InterruptCh())
if err != nil {
log.Fatalln("Unable to start worker", err)
}
}
starter/main.go
package main
import (
"context"
"fmt"
"log"
"os"
"github.com/google/uuid"
"go.temporal.io/sdk/client"
"temporal-saga-project/shared"
"temporal-saga-project/workflow"
)
func main() {
c, err := client.Dial(client.Options{})
if err != nil {
log.Fatalln("Unable to create client", err)
}
defer c.Close()
// Determine if we should simulate a failure
userID := "user-happy-path"
if len(os.Args) > 1 && os.Args[1] == "fail" {
userID = "user-who-fails-payment"
}
orderDetails := shared.OrderDetails{
OrderID: uuid.New().String(),
UserID: userID,
Items: []string{"item-1", "item-2"},
TotalPrice: 120.50,
}
workflowOptions := client.StartWorkflowOptions{
ID: "order-workflow-" + orderDetails.OrderID,
TaskQueue: "order-processing",
}
we, err := c.ExecuteWorkflow(context.Background(), workflowOptions, workflow.OrderWorkflow, orderDetails)
if err != nil {
log.Fatalln("Unable to execute workflow", err)
}
log.Println("Started workflow", "WorkflowID", we.GetID(), "RunID", we.GetRunID())
// Wait for the workflow to complete.
var result string
err = we.Get(context.Background(), &result)
if err != nil {
log.Printf("Workflow failed: %v\n", err)
} else {
log.Printf("Workflow completed: %s\n", result)
}
}
Running the Saga
temporal server start-dev).go run worker/main.gogo run starter/main.gogo run starter/main.go failWhen you run the failure path, you will see logs indicating that the payment failed, followed immediately by logs from the compensation activities for both payment (attempted refund) and inventory (cancellation), demonstrating the Saga rollback in action.
Section 4: Advanced Edge Cases and Production Patterns
While the above implementation is robust, real-world systems introduce more complexity. Let's discuss how to handle some advanced scenarios.
Non-Deterministic Logic Errors
A Temporal workflow must be deterministic. Its logic must produce the same sequence of commands given the same history of events. This is how Temporal can safely replay a workflow's history to recover its state. Using non-deterministic functions like time.Now() or iterating over a map (which has random iteration order in Go) will break this contract and cause your workflow to fail.
Incorrect (Non-Deterministic):
// In a workflow
if time.Now().Weekday() == time.Friday {
// ... apply discount
}
// Ranging over a map
for k, v := range myMap {
// ... this order is not guaranteed
}
Correct (Deterministic):
// Use Temporal's deterministic time function
if workflow.Now(ctx).Weekday() == time.Friday {
// ... apply discount
}
// Get keys, sort them, then iterate
keys := make([]string, 0, len(myMap))
for k := range myMap {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
v := myMap[k]
// ... this order is now guaranteed
}
Workflow Versioning for In-flight Sagas
What happens when you need to change the logic of your OrderWorkflow? If you deploy new worker code, any in-flight workflows that are 'sleeping' might be resumed by the new code, potentially breaking them if the history is no longer compatible.
Temporal provides a versioning API, workflow.GetVersion, to manage this. You can mark sections of your code with a version, allowing you to introduce new logic paths while maintaining compatibility for old workflows.
Example: Let's add a new fraud check step between inventory and payment.
// ... inside OrderWorkflow after inventory reservation
// Use a change ID and a version number
version := workflow.GetVersion(ctx, "add-fraud-check", workflow.DefaultVersion, 1)
if version == 1 {
// This is the new logic path for workflows started on the new code.
var fraudResult string
err = workflow.ExecuteActivity(ctx, activities.CheckFraud, orderDetails).Get(ctx, &fraudResult)
if err != nil {
logger.Error("Fraud check failed", "Error", err)
return "", fmt.Errorf("fraud check failed: %w", err)
}
}
// The rest of the workflow (payment, shipping) continues here...
Workflows started before this code was deployed will have workflow.DefaultVersion for the "add-fraud-check" change ID and will skip the new activity. Workflows started on the new code will get version 1 and execute the fraud check. This allows for safe, graceful migration of long-running Sagas.
Handling Compensation Failures
Our current model assumes compensations will succeed. What if RefundPayment fails? The Saga is now in an inconsistent state—inventory was un-reserved, but the customer was not refunded.
This is a business logic problem, not a technical one, but Temporal gives you the tools to model the solution.
// Inside the defer block
compensationAo := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &workflow.RetryPolicy{
MaximumAttempts: 0, // Infinite retries
},
}
compensationCtx, _ = workflow.NewDisconnectedContext(ctx)
compensationCtx = workflow.WithActivityOptions(compensationCtx, compensationAo)
_ = workflow.ExecuteActivity(compensationCtx, activities.RefundPayment, paymentDetails).Get(compensationCtx, nil)
workflow.Await until a human signals it (using a Temporal Signal) that the issue has been resolved manually, at which point the Saga can conclude.Scaling with Task Queues
In a high-volume system, you may find that payment processing is a bottleneck. You can scale the workers that handle payments independently from those that handle inventory or shipping. This is done by assigning activities to different Task Queues.
In the Workflow:
// Use a specific task queue for payment activities
paymentAo := workflow.ActivityOptions{
TaskQueue: "payment-processing-queue",
StartToCloseTimeout: 10 * time.Second,
}
ctxForPayment := workflow.WithActivityOptions(ctx, paymentAo)
err = workflow.ExecuteActivity(ctxForPayment, activities.ProcessPayment, paymentDetails).Get(ctx, &paymentResult)
On the Worker:
// A dedicated worker pool for payments
paymentWorker := worker.New(c, "payment-processing-queue", worker.Options{})
paymentWorker.RegisterActivity(activities.ProcessPayment)
paymentWorker.RegisterActivity(activities.RefundPayment)
// Run this worker on a separate set of machines
err = paymentWorker.Run(worker.InterruptCh())
This allows you to provision more resources specifically for the payment-processing-queue without over-provisioning for the less-intensive inventory and shipping activities.
Conclusion
The Saga pattern is a powerful tool for maintaining data consistency in a distributed architecture. However, its manual implementation is a minefield of state management, error handling, and concurrency problems. By leveraging a durable execution engine like Temporal, you can implement complex, long-running Sagas as straightforward, readable code.
The key takeaway is to shift your mindset from building state machines and message handlers to writing durable business logic. By using patterns like durable defer for compensations, ensuring activity idempotency, planning for workflow versioning, and thoughtfully handling compensation failures, you can build truly resilient and maintainable microservice applications. Temporal doesn't just make implementing Sagas easier; it makes them fundamentally more robust by handling the hardest parts of distributed systems for you.