Implementing Resilient Sagas with Temporal for Distributed Transactions
The Fallacy of Two-Phase Commit in Modern Architectures
In a monolithic world, ACID transactions managed by a single relational database were the bedrock of data consistency. As we've decomposed systems into distributed microservices, the seductive simplicity of BEGIN TRANSACTION...COMMIT has been replaced by the harsh reality of network partitions, service unavailability, and independent data stores. The classic solution for distributed transactions, the Two-Phase Commit (2PC) protocol, while theoretically sound, is practically unworkable in most high-throughput, loosely-coupled environments. Its requirements for locking resources across multiple services for the duration of a transaction introduce tight coupling, create performance bottlenecks, and drastically reduce system availability. A failure in the transaction coordinator or a single participant can stall the entire operation, a catastrophic outcome for business-critical processes.
This is where the Saga pattern emerges. A saga is a sequence of local transactions where each transaction updates data within a single service. The key distinction is that after each local transaction completes, it immediately publishes an event or calls the next service in the chain. If a local transaction fails, the saga executes a series of compensating transactions to semantically undo the preceding successful transactions. This approach favors availability and eventual consistency over the strict, immediate consistency of 2PC.
However, implementing a Saga is non-trivial. A choreographed saga, driven by event-passing between services, can quickly become an untraceable, un-debuggable web of dependencies. An orchestrated saga, which uses a central coordinator, is a more robust pattern, but building that orchestrator is a significant engineering challenge. You need to manage state, handle retries with exponential backoff, ensure idempotency, manage timeouts, and provide visibility into long-running processes. Building this from scratch using a database as a state machine is a path fraught with boilerplate, subtle bugs, and operational overhead.
This is the precise problem domain where a durable execution system like Temporal.io excels. Temporal provides a programming model that abstracts away the complexities of state management for distributed processes. A Temporal Workflow is a durable, resumable function whose entire state, including local variables and call stack, is persisted by the Temporal Cluster. This allows you to write complex orchestration logic, including sagas, as straightforward code, while the platform handles the underlying mechanics of retries, persistence, and recovery from failure. In this post, we will bypass the introductory concepts and dive straight into building a production-grade, resilient saga using Temporal, exploring the advanced patterns necessary to handle real-world complexity.
Core Implementation: An E-Commerce Order Saga
Let's model a canonical e-commerce order fulfillment process. When a user places an order, a series of actions must occur across different microservices:
If any of these steps fail, we must roll back the completed steps. For example, if payment fails, we must release the inventory reservation. If shipment creation fails after a successful payment, we must refund the payment and release the inventory. This is our saga.
The Forward and Compensating Actions:
| Step | Forward Action (Activity) | Compensation Action (Activity) |
|---|---|---|
| 1. Reserve Items | ReserveInventory | ReleaseInventory |
| 2. Process Payment | ProcessPayment | RefundPayment |
| 3. Create Shipment | CreateShipment | CancelShipment |
We will implement this orchestration using a Temporal Workflow. The core pattern for a saga in Temporal is to execute forward actions sequentially. As each action succeeds, we add its corresponding compensation action to a list. If a subsequent forward action fails, we iterate through our list of compensations in reverse order and execute them.
Here is a complete, production-ready implementation in Go.
Project Structure
/temporal-saga-ecommerce
├── activities.go // Implementations of our business logic (stubs for now)
├── shared.go // Shared data structures (OrderDetails, etc.)
├── worker/
│ └── main.go // The Temporal Worker process
├── workflow.go // The core Saga orchestration logic
└── starter/
└── main.go // A client to start the workflow
`shared.go`: Data Structures
package ecommerce
// OrderDetails contains all information for the order saga.
type OrderDetails struct {
OrderID string
Items []string
CardToken string
Amount float64
Address string
}
`activities.go`: Business Logic Stubs
In a real system, these functions would make gRPC or REST calls to other services. For this example, we'll simulate their behavior, including potential failures.
package ecommerce
import (
"context"
"errors"
"fmt"
"go.temporal.io/sdk/activity"
)
// Activities struct holds dependencies for our activities.
type Activities struct{
// In a real app, you'd have DB connections, clients, etc.
}
func (a *Activities) ReserveInventory(ctx context.Context, items []string) error {
logger := activity.GetLogger(ctx)
logger.Info("Reserving inventory for items", "Items", items)
// Simulate business logic
return nil
}
func (a *Activities) ReleaseInventory(ctx context.Context, items []string) error {
logger := activity.GetLogger(ctx)
logger.Info("Releasing inventory for items", "Items", items)
// This compensation should be designed to never fail or be idempotent.
return nil
}
func (a *Activities) ProcessPayment(ctx context.Context, cardToken string, amount float64) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("Processing payment", "Amount", amount)
// Simulate a failing payment gateway
if cardToken == "fail-payment" {
logger.Error("Payment processing failed due to invalid card token.")
return "", errors.New("invalid card token")
}
transactionID := fmt.Sprintf("txn-%d", activity.GetInfo(ctx).Attempt)
return transactionID, nil
}
func (a *Activities) RefundPayment(ctx context.Context, transactionID string) error {
logger := activity.GetLogger(ctx)
logger.Info("Refunding payment", "TransactionID", transactionID)
return nil
}
func (a *Activities) CreateShipment(ctx context.Context, address string) error {
logger := activity.GetLogger(ctx)
logger.Info("Creating shipment", "Address", address)
if address == "fail-shipment" {
logger.Error("Shipment creation failed.")
return errors.New("invalid address for shipment")
}
return nil
}
func (a *Activities) CancelShipment(ctx context.Context, address string) error {
logger := activity.GetLogger(ctx)
logger.Info("Cancelling shipment", "Address", address)
return nil
}
`workflow.go`: The Saga Orchestrator
This is the heart of our implementation. We use a defer block with a closure to manage the compensation logic. This is a clean and robust pattern. If the workflow function returns an error at any point, the deferred function will execute, running the compensations.
package ecommerce
import (
"time"
"go.temporal.io/sdk/workflow"
)
func OrderSagaWorkflow(ctx workflow.Context, order OrderDetails) error {
// Set up activity options. Timeouts are crucial for production systems.
ao := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
}
ctx = workflow.WithActivityOptions(ctx, ao)
// Use a compensation stack.
var compensations []func(workflow.Context)
var executionErr error
// The defer block ensures compensations run if an error occurs.
defer func() {
if executionErr != nil {
wfLogger := workflow.GetLogger(ctx)
wfLogger.Error("Workflow failed, starting compensation.", "Error", executionErr)
// Execute compensations in reverse order.
for i := len(compensations) - 1; i >= 0; i-- {
// ExecuteActivity does not block, so we need to wait on the future.
compensations[i](ctx)
}
}
}()
// Step 1: Reserve Inventory
executionErr = workflow.ExecuteActivity(ctx, "ReserveInventory", order.Items).Get(ctx, nil)
if executionErr != nil {
return executionErr
}
compensations = append(compensations, func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, "ReleaseInventory", order.Items).Get(ctx, nil)
})
// Step 2: Process Payment
var transactionID string
executionErr = workflow.ExecuteActivity(ctx, "ProcessPayment", order.CardToken, order.Amount).Get(ctx, &transactionID)
if executionErr != nil {
return executionErr
}
compensations = append(compensations, func(ctx workflow.Context) {
_ = workflow.ExecuteActivity(ctx, "RefundPayment", transactionID).Get(ctx, nil)
})
// Step 3: Create Shipment
executionErr = workflow.ExecuteActivity(ctx, "CreateShipment", order.Address).Get(ctx, nil)
if executionErr != nil {
return executionErr
}
// No compensation for the last step if it's successful.
workflow.GetLogger(ctx).Info("Order workflow completed successfully.")
return nil
}
Running the System
The worker registers the workflow and activity functions and polls a task queue. The starter client submits a new workflow execution.
worker/main.go
package main
import (
"log"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
"temporal-saga-ecommerce"
)
func main() {
c, err := client.Dial(client.Options{})
if err != nil {
log.Fatalln("Unable to create client", err)
}
defer c.Close()
w := worker.New(c, "order-saga-task-queue", worker.Options{})
activities := &ecommerce.Activities{}
w.RegisterWorkflow(ecommerce.OrderSagaWorkflow)
w.RegisterActivity(activities)
err = w.Run(worker.InterruptCh())
if err != nil {
log.Fatalln("Unable to start worker", err)
}
}
starter/main.go
package main
import (
"context"
"log"
"github.com/google/uuid"
"go.temporal.io/sdk/client"
"temporal-saga-ecommerce"
)
func main() {
c, err := client.Dial(client.Options{})
if err != nil {
log.Fatalln("Unable to create client", err)
}
defer c.Close()
// Scenario 1: Successful order
// orderSuccess := ecommerce.OrderDetails{
// OrderID: uuid.New().String(),
// Items: []string{"item-1", "item-2"},
// CardToken: "valid-token",
// Amount: 199.99,
// Address: "123 Temporal Lane",
// }
// Scenario 2: Failed payment
orderFailPayment := ecommerce.OrderDetails{
OrderID: uuid.New().String(),
Items: []string{"item-3", "item-4"},
CardToken: "fail-payment",
Amount: 50.00,
Address: "456 Saga Street",
}
options := client.StartWorkflowOptions{
ID: "order-saga-workflow-" + orderFailPayment.OrderID,
TaskQueue: "order-saga-task-queue",
}
we, err := c.ExecuteWorkflow(context.Background(), options, ecommerce.OrderSagaWorkflow, orderFailPayment)
if err != nil {
log.Fatalln("Unable to execute workflow", err)
}
log.Println("Started workflow", "WorkflowID", we.GetID(), "RunID", we.GetRunID())
// Wait for the workflow to complete.
var result error
err = we.Get(context.Background(), &result)
if err != nil {
log.Printf("Workflow resulted in an error: %v", err)
} else {
log.Println("Workflow completed successfully.")
}
}
When you run the starter with the orderFailPayment scenario, you will see logs indicating that ReserveInventory was called, then ProcessPayment failed, and finally, ReleaseInventory was called as a compensation. The workflow correctly rolled back its state.
Advanced Patterns and Edge Cases
The basic implementation is solid, but production systems introduce more complexity. Let's address critical edge cases.
1. Activity Idempotency: The Cornerstone of Reliability
Temporal guarantees at-least-once execution for activities. This means an activity might run more than once due to worker crashes, network issues, or timeouts followed by a retry. If your ProcessPayment activity is not idempotent, you could double-charge a customer. This is unacceptable.
The solution is to design your downstream services to be idempotent. The standard pattern is to pass a unique idempotency key with each request. The workflow is the perfect place to generate this key.
Let's modify our ProcessPayment activity to accept an idempotency key.
Modified OrderSagaWorkflow:
// ... inside OrderSagaWorkflow
// Step 2: Process Payment
// Generate a deterministic idempotency key for this specific action.
// This ensures that if the activity retries, it uses the same key.
paymentIdempotencyKey := order.OrderID + "-payment"
var transactionID string
executionErr = workflow.ExecuteActivity(ctx, "ProcessPayment", paymentIdempotencyKey, order.CardToken, order.Amount).Get(ctx, &transactionID)
// ... rest of the logic
Modified ProcessPayment Activity and Service:
// activities.go
func (a *Activities) ProcessPayment(ctx context.Context, idempotencyKey, cardToken string, amount float64) (string, error) {
// ... this would be a gRPC/HTTP call
// paymentServiceClient.Charge(ctx, &pb.ChargeRequest{IdempotencyKey: idempotencyKey, ...})
// ...
}
// In your Payment Service (pseudocode)
func (s *PaymentService) Charge(ctx context.Context, req *pb.ChargeRequest) (*pb.ChargeResponse, error) {
// 1. Check if we've already processed this idempotency key
dbTx, err := s.db.BeginTx(ctx, nil)
// ... error handling
defer dbTx.Rollback()
var existingTxID string
err = dbTx.QueryRowContext(ctx, "SELECT transaction_id FROM processed_payments WHERE idempotency_key = $1", req.IdempotencyKey).Scan(&existingTxID)
if err == nil {
// Key found, return the result of the original operation
return &pb.ChargeResponse{TransactionId: existingTxID}, nil
}
if err != sql.ErrNoRows {
return nil, status.Error(codes.Internal, "database error")
}
// 2. Key not found, proceed with the operation
transactionID, err := s.paymentGateway.Charge(req.CardToken, req.Amount)
if err != nil {
return nil, status.Error(codes.Internal, "payment gateway failure")
}
// 3. Store the result and idempotency key atomically
_, err = dbTx.ExecContext(ctx, "INSERT INTO processed_payments (idempotency_key, transaction_id) VALUES ($1, $2)", req.IdempotencyKey, transactionID)
if err != nil {
return nil, status.Error(codes.Internal, "failed to save idempotency key")
}
err = dbTx.Commit()
if err != nil {
return nil, status.Error(codes.Internal, "commit failed")
}
return &pb.ChargeResponse{TransactionId: transactionID}, nil
}
This pattern ensures that no matter how many times the ProcessPayment activity is retried by Temporal, the customer is charged exactly once.
2. Asynchronous Activity Completion for Human-in-the-Loop
What if a step in your saga requires manual intervention? For example, an order over $10,000 might need to be reviewed by a fraud analyst. The workflow cannot simply wait for hours or days. This is a perfect use case for Asynchronous Activity Completion.
activity.ErrResultPending.- The worker reports to the Temporal server that the activity is pending. The workflow execution is now effectively paused at this step.
- The fraud analyst reviews the case. When they approve or deny it, their system uses the Temporal client to send a completion signal back to the specific activity, providing its result.
- The Temporal server receives this signal, unblocks the activity, and the workflow continues.
Implementation:
// activities.go
func (a *Activities) ManualFraudCheck(ctx context.Context, orderID string) (bool, error) {
logger := activity.GetLogger(ctx)
// Get the task token, which uniquely identifies this activity execution
taskToken := activity.GetInfo(ctx).TaskToken
logger.Info("Starting manual fraud check", "OrderID", orderID, "TaskToken", string(taskToken))
// In a real system, you would now call an external system (Jira, etc.)
// and pass it the taskToken. That system will use this token to complete the activity.
// jiraClient.CreateTicket("Fraud Check for "+orderID, string(taskToken))
// Tell the Temporal worker that this activity will be completed externally.
return false, activity.ErrResultPending
}
// A separate service/CLI tool used by the fraud analyst to complete the activity
func CompleteFraudCheck(temporalClient client.Client, taskToken []byte, approved bool) {
if approved {
err := temporalClient.CompleteActivity(context.Background(), taskToken, true, nil)
// ...
} else {
err := temporalClient.CompleteActivity(context.Background(), taskToken, nil, errors.New("fraud check denied"))
// ...
}
}
This pattern allows you to integrate long-running, human-driven processes directly into your automated sagas without complex polling or external state machines.
3. Workflow Versioning for In-Flight Migrations
Your business logic will change. What happens if you have a thousand in-flight order sagas and you need to deploy a new version of the workflow code that adds a new step, like NotifyAnalytics? If you simply deploy the new code, the replay of an existing workflow history will fail because the new code expects a NotifyAnalytics step that doesn't exist in the old history. This is a deterministic execution violation.
Temporal solves this with its workflow versioning API, workflow.GetVersion.
func OrderSagaWorkflowV2(ctx workflow.Context, order OrderDetails) error {
// ... setup ...
// ... ReserveInventory and ProcessPayment steps ...
// VERSIONING BLOCK: Introduce a new optional step
v := workflow.GetVersion(ctx, "AddAnalyticsNotification", workflow.DefaultVersion, 1)
if v == 1 {
// This is new code logic. It will only execute for new workflows
// or for existing workflows that reach this point after the code is deployed.
_ = workflow.ExecuteActivity(ctx, "NotifyAnalytics", order.OrderID).Get(ctx, nil)
}
// ... CreateShipment step ...
return nil
}
How it works:
When a new* workflow starts with this code, GetVersion will return 1, and the NotifyAnalytics activity will execute. This execution will be recorded in the workflow's history.
When an old* workflow (started before this code was deployed) replays, it will hit the GetVersion call. The Temporal server, seeing no such version marker in the existing history, will return workflow.DefaultVersion (which is 0). The if v == 1 block will be skipped, and the replay will proceed, perfectly matching the original history.
This powerful feature allows for the safe evolution of complex, long-running business processes without requiring a "big bang" migration or running multiple versions of your worker fleet.
Performance and Scalability Considerations
* Activity Timeouts are Mandatory: Never run activities without StartToClose timeouts. This timeout dictates how long a single attempt of an activity can run. If it's exceeded, Temporal will retry the activity according to its retry policy. Without it, a stuck activity can hold worker resources indefinitely. Also, configure ScheduleToClose (the total time for all retries) and ScheduleToStart (how long it can wait in the task queue before being picked up).
* Task Queue Tuning: A Temporal worker has several knobs for tuning concurrency. The most important is MaxConcurrentActivityExecutionSize. Setting this too low will bottleneck your throughput. Setting it too high can overwhelm your worker's resources (CPU, memory, DB connection pool). Profile your activities and tune this value based on real-world performance metrics. You can also run different workers for different task queues (e.g., a high-throughput queue for short activities, a low-concurrency queue for resource-intensive ones).
Compensation Logic Reliability: Your compensation activities must* be as reliable as possible. They should be idempotent and designed to not fail under normal circumstances. If a compensation fails, Temporal will retry it, but a persistently failing compensation can lead to an inconsistent state (e.g., inventory reserved, payment taken, but refund continuously failing). This often requires out-of-band alerting for manual intervention.
Conclusion: Beyond State Machines
By combining the Saga pattern with Temporal, we elevate the implementation from a brittle, homegrown state machine to a truly resilient and observable distributed process. The core logic of the business transaction remains clean and readable, while Temporal handles the messy realities of distributed systems: state persistence, retries, timeouts, and recovery.
We've moved beyond a simple implementation to address production-critical concerns: ensuring idempotency to prevent duplicate charges, integrating human-in-the-loop processes with asynchronous activity completion, and managing code evolution with workflow versioning. These advanced patterns are not edge cases; they are the requirements for building robust, long-lived business applications on a microservices architecture. By offloading this complex orchestration and fault-tolerance logic to a dedicated platform, your service-level code can focus purely on its business domain, leading to faster development, higher reliability, and clearer operational insight into your most critical processes.