Durable Transactional Sagas with Temporal in Polyglot Systems

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Challenge: Atomicity in Distributed Systems

In a monolithic architecture, maintaining data consistency is often achieved through ACID transactions provided by a single relational database. As we decompose systems into microservices, each service typically owns its data, residing in separate databases. This model improves scalability and team autonomy but shatters the simplicity of atomic operations. A single business process, like creating a customer order, might now span multiple services: Orders, Inventory, and Payments. If the payment fails after the inventory has been reserved, how do we ensure the system returns to a consistent state?

The traditional solution, Two-Phase Commit (2PC), introduces a transaction coordinator that orchestrates a 'prepare' and 'commit' phase across all participating services. While guaranteeing atomicity, 2PC is notoriously brittle in distributed environments. It's a synchronous, blocking protocol that reduces availability—the failure of the coordinator or any participant can lock resources for extended periods. For these reasons, 2PC is rarely a practical choice for high-throughput microservice architectures.

This leads us to the Saga pattern, a design for managing data consistency across microservices in the absence of distributed transactions. A Saga is a sequence of local transactions where each transaction updates data within a single service. When a local transaction completes, the Saga triggers the next one. If any local transaction fails, the Saga executes a series of compensating transactions to semantically undo the preceding successful transactions.

While the Saga pattern is conceptually sound, its implementation is fraught with peril. A naive implementation requires building a complex, stateful orchestration engine. You must track the Saga's progress, durably store its state, handle retries with exponential backoff, manage timeouts, and ensure compensating actions are reliably executed. Building this bespoke infrastructure is a significant engineering effort, distracting from core business logic.

This is where Temporal enters the picture. Temporal provides a durable virtual memory for your application's state, turning the complex challenge of Saga orchestration into a 'workflow-as-code' problem. By leveraging Temporal's primitives—durable Workflows and reliable Activities—we can implement complex Sagas that are resilient to process crashes, server failures, and network partitions, without building a custom orchestration engine. This article will demonstrate how to implement a production-grade, polyglot Saga using Temporal, focusing on the advanced patterns required for real-world systems.

The Temporal Primitives for Building Sagas

Assuming a working knowledge of Temporal, let's focus on why its primitives are perfectly suited for the Saga pattern. A Temporal Workflow is a deterministic, resumable function execution. The state of the workflow, including local variables and the current point of execution, is durably persisted by the Temporal Cluster.

  • Workflows as Saga Coordinators: The workflow code itself becomes the Saga's orchestrator or coordinator. Its sequence of Activity executions defines the forward-recovery path of the Saga.
  • Activities as Local Transactions: Each Activity represents a local transaction within a microservice. Temporal guarantees that an Activity will be executed at least once. This reliability is the foundation of the Saga's forward progress.
  • Durability as the Saga Log: The workflow's event history, managed by the Temporal Cluster, serves as the Saga's audit log. It's a persistent, append-only record of every step taken, every success, and every failure. We don't need a separate database table to track Saga state; it's an intrinsic property of the workflow execution.
  • Crucially, this model allows us to write orchestration logic as simple, sequential code while Temporal handles the underlying distributed state management. The most powerful pattern for implementing Sagas in Temporal, especially in Go, is leveraging the language's defer statement within the workflow code. A deferred function call is pushed onto a stack and executed when the surrounding function returns. In a Temporal Workflow, this stack is durably persisted. If the workflow fails at any point, the deferred functions are executed in Last-In, First-Out (LIFO) order—the exact behavior required for executing compensating actions.

    Core Implementation: A Polyglot E-commerce Order Saga

    Let's model a realistic e-commerce order creation process. The Saga involves three services:

  • Order Service (Go): Creates the initial order record in a 'PENDING' state.
  • Inventory Service (Python): Reserves the items specified in the order.
  • Payment Service (Go): Charges the customer's credit card.
  • If any step fails, we must execute compensating actions: if payment fails, we un-reserve the inventory. If inventory reservation fails, we mark the order as 'FAILED'. Our orchestrating workflow will be written in Go, but it will invoke an Activity implemented in Python, demonstrating Temporal's powerful polyglot capabilities.

    The Go Orchestrator (Workflow)

    We'll define a workflow CreateOrderSaga that executes the three activities sequentially. We will use defer to build our stack of compensating actions. Notice how clean this makes the business logic; the happy path is clear, and the failure path is handled implicitly.

    workflow.go

    go
    package order
    
    import (
    	"time"
    
    	"go.temporal.io/sdk/temporal"
    	"go.temporal.io/sdk/workflow"
    )
    
    // OrderDetails contains information about the new order.
    type OrderDetails struct {
    	OrderID    string
    	CustomerID string
    	ItemID     string
    	Quantity   int
    	Price      float64
    }
    
    // CreateOrderSaga is a workflow that orchestrates the creation of an order.
    func CreateOrderSaga(ctx workflow.Context, orderDetails OrderDetails) (string, error) {
    	wfLogger := workflow.GetLogger(ctx)
    	wfLogger.Info("Saga workflow started.", "OrderID", orderDetails.OrderID)
    
    	ao := workflow.ActivityOptions{
    		StartToCloseTimeout: 10 * time.Second,
    		RetryPolicy: &temporal.RetryPolicy{
    			InitialInterval:    time.Second,
    			BackoffCoefficient: 2.0,
    			MaximumInterval:    100 * time.Second,
    			MaximumAttempts:    3,
    		},
    	}
    	ctx = workflow.WithActivityOptions(ctx, ao)
    
    	// Use a map to store compensation functions. This is more flexible than direct defers
    	// if we need to conditionally add/remove compensations.
    	var compensations []func() error
    	var executionErr error
    
    	// Defer the execution of compensations. This will run when the workflow function exits,
    	// either successfully or with an error.
    	defer func() {
    		if executionErr != nil {
    			wfLogger.Error("Saga failed. Starting compensation logic.", "Error", executionErr)
    			// Execute compensations in LIFO order
    			for i := len(compensations) - 1; i >= 0; i-- {
    				// Compensations should be designed to be idempotent and retry indefinitely until success.
    				compensationRetryPolicy := &temporal.RetryPolicy{
    					InitialInterval:    time.Second,
    					BackoffCoefficient: 2.0,
    					MaximumInterval:    100 * time.Second,
    					// No max attempts for compensations
    				}
    				compensationCtx := workflow.WithActivityOptions(workflow.NewDisconnectedContext(ctx), workflow.ActivityOptions{
    					StartToCloseTimeout: 5 * time.Minute,
    					RetryPolicy:         compensationRetryPolicy,
    				})
    				if err := compensations[i](); err != nil {
    					wfLogger.Error("Compensation activity failed. This is critical.", "Error", err)
                        // In a real system, you might escalate this failure to a human operator.
    				}
    			}
    		}
    	}()
    
    	// 1. Create Order in DB
    	err := workflow.ExecuteActivity(ctx, CreateOrderActivity, orderDetails).Get(ctx, nil)
    	if err != nil {
    		executionErr = err
    		return "", err
    	}
    	// Add compensation for CreateOrder
    	compensations = append(compensations, func() error {
    		return workflow.ExecuteActivity(ctx, MarkOrderAsFailedActivity, orderDetails.OrderID).Get(ctx, nil)
    	})
    
    	// 2. Reserve Inventory (Python Activity)
    	// Note: we specify a different task queue for the Python worker
    	pyActivityOpts := workflow.ActivityOptions{
    		TaskQueue:           "inventory-task-queue",
    		StartToCloseTimeout: 10 * time.Second,
    	}
    	pyCtx := workflow.WithActivityOptions(ctx, pyActivityOpts)
    	err = workflow.ExecuteActivity(pyCtx, "ReserveInventoryActivity", orderDetails).Get(ctx, nil)
    	if err != nil {
    		executionErr = err
    		return "", err
    	}
    	// Add compensation for ReserveInventory
    	compensations = append(compensations, func() error {
    		return workflow.ExecuteActivity(pyCtx, "UnreserveInventoryActivity", orderDetails).Get(ctx, nil)
    	})
    
    	// 3. Process Payment
    	err = workflow.ExecuteActivity(ctx, ProcessPaymentActivity, orderDetails).Get(ctx, nil)
    	if err != nil {
    		executionErr = err
    		return "", err
    	}
    
    	// 4. Mark order as COMPLETED
    	err = workflow.ExecuteActivity(ctx, MarkOrderAsCompletedActivity, orderDetails.OrderID).Get(ctx, nil)
    	if err != nil {
            // This is a tricky state. Payment succeeded but we can't update our DB.
            // The compensation logic will run, refunding the payment. 
            // This highlights the need for robust, idempotent activities.
    		executionErr = err
    		return "", err
    	}
    
    	wfLogger.Info("Saga workflow completed successfully.", "OrderID", orderDetails.OrderID)
    	return "Order completed successfully", nil
    }
    

    This workflow demonstrates a robust Saga pattern. Instead of direct defer calls for activities, we append anonymous functions to a compensations slice. This gives us more control and makes the logic explicit. The final defer block checks if an error occurred (executionErr != nil) and, if so, iterates through the compensations in reverse order. Note the use of a DisconnectedContext for compensations, ensuring they run even if the parent workflow context is cancelled.

    The Go Activity Workers

    The Go workers implement the activities related to orders and payments. They would contain the actual business logic for interacting with databases or payment gateways.

    activities.go

    go
    package order
    
    import (
    	"context"
    	"fmt"
    )
    
    // Mock activities for demonstration
    
    func CreateOrderActivity(ctx context.Context, details OrderDetails) error {
    	fmt.Printf("Creating order %s in database with status PENDING.\n", details.OrderID)
    	// DB insert logic here
    	return nil
    }
    
    func MarkOrderAsFailedActivity(ctx context.Context, orderID string) error {
    	fmt.Printf("COMPENSATION: Marking order %s as FAILED.\n", orderID)
    	// DB update logic here
    	return nil
    }
    
    func ProcessPaymentActivity(ctx context.Context, details OrderDetails) error {
    	fmt.Printf("Processing payment of $%.2f for order %s.\n", details.Price, details.OrderID)
    	// Payment gateway integration here
        // To test failure, uncomment the line below:
        // return fmt.Errorf("payment gateway declined transaction")
    	return nil
    }
    
    func MarkOrderAsCompletedActivity(ctx context.Context, orderID string) error {
    	fmt.Printf("Marking order %s as COMPLETED.\n", orderID)
    	// DB update logic here
    	return nil
    }

    The Python Activity Worker

    This is where the polyglot nature shines. A separate worker, written in Python, can handle the inventory-related tasks. It connects to the same Temporal namespace and listens on a specific task queue (inventory-task-queue).

    inventory_worker.py

    python
    import asyncio
    from temporalio.client import Client
    from temporalio.worker import Worker
    from dataclasses import dataclass
    
    # Define the same data structure for interoperability
    @dataclass
    class OrderDetails:
        OrderID: str
        CustomerID: str
        ItemID: str
        Quantity: int
        Price: float
    
    # Define activities as simple functions
    async def reserve_inventory_activity(details: OrderDetails) -> None:
        print(f"Reserving {details.Quantity} of item {details.ItemID} for order {details.OrderID}")
        # Database logic to reserve inventory here
        # To test failure, uncomment the line below:
        # raise RuntimeError("Insufficient stock for item " + details.ItemID)
    
    async def unreserve_inventory_activity(details: OrderDetails) -> None:
        print(f"COMPENSATION: Un-reserving {details.Quantity} of item {details.ItemID} for order {details.OrderID}")
        # Database logic to release inventory reservation here
    
    async def main():
        client = await Client.connect("localhost:7233")
        
        # Create a worker that connects to the inventory-task-queue
        worker = Worker(
            client,
            task_queue="inventory-task-queue",
            activities=[
                reserve_inventory_activity,
                unreserve_inventory_activity,
            ],
        )
        print("Python Inventory worker started.")
        await worker.run()
    
    if __name__ == "__main__":
        asyncio.run(main())
    

    To run this system, you would start the Go worker (listening on the default order task queue) and the Python worker (listening on inventory-task-queue). When the Go workflow executes the ReserveInventoryActivity, Temporal will route the task to the Python worker. If it fails, the workflow will catch the error and trigger the compensation, UnreserveInventoryActivity, which will also be routed to the Python worker.

    Advanced Patterns and Edge Case Handling

    A basic Saga implementation is a good start, but production systems require handling more complex scenarios.

    Idempotency Deep Dive

    Temporal guarantees at-least-once execution for Activities. This means an Activity could run more than once if, for example, the worker completes the task but crashes before it can acknowledge completion to the Temporal server. The server will time out and reschedule the task on another worker. If your activity is not idempotent (i.e., safe to run multiple times), you could double-charge a customer or reserve inventory twice.

    The solution is to enforce idempotency at the service level, driven by the workflow.

  • Generate a unique key in the workflow. A workflow's RunId is unique per execution. We can combine it with an activity-specific identifier to create a perfect idempotency key.
  • Pass the key to the Activity.
  • The service checks the key before executing logic. This is typically done within a database transaction.
  • Modified Workflow Snippet:

    go
    // Inside CreateOrderSaga workflow
    
    runID := workflow.GetInfo(ctx).WorkflowExecution.RunID
    
    // For payment activity
    paymentIdempotencyKey := fmt.Sprintf("%s-payment", runID)
    err := workflow.ExecuteActivity(ctx, ProcessPaymentActivity, orderDetails, paymentIdempotencyKey).Get(ctx, nil)
    if err != nil {
        // ... handle error
    }

    Modified Activity Implementation (Conceptual):

    go
    func ProcessPaymentActivity(ctx context.Context, details OrderDetails, idempotencyKey string) error {
        // db is your database connection pool
        tx, err := db.BeginTx(ctx, nil)
        if err != nil {
            return err
        }
        defer tx.Rollback() // Rollback if anything fails
    
        // 1. Check if this operation has been processed already
        var result string
        err = tx.QueryRowContext(ctx, "SELECT result FROM processed_idempotency_keys WHERE key = $1", idempotencyKey).Scan(&result)
        if err == nil { // Key found
            log.Printf("Idempotent request for key %s already processed with result: %s", idempotencyKey, result)
            return nil // Or return the stored result if needed
        } else if err != sql.ErrNoRows {
            return err // A real DB error occurred
        }
    
        // 2. If not found, execute the business logic
        paymentGatewayErr := callPaymentGateway(details)
        if paymentGatewayErr != nil {
            return paymentGatewayErr
        }
    
        // 3. Store the result with the idempotency key
        _, err = tx.ExecContext(ctx, "INSERT INTO processed_idempotency_keys (key, result) VALUES ($1, $2)", idempotencyKey, "SUCCESS")
        if err != nil {
            return err
        }
    
        return tx.Commit()
    }

    Handling Non-Idempotent or Flaky Compensations

    What happens if a compensating action fails? This is the nightmare scenario for Sagas. If UnreserveInventory fails repeatedly, you have an order that was never paid for but is holding onto stock. The system is in an inconsistent state.

  • Pattern 1: Retry Forever. As shown in our defer block, the retry policy for compensations should be aggressive, often with no maximum attempts. The activity should be designed to eventually succeed (e.g., if a dependent service is down, it will eventually come back up).
  • Pattern 2: The Escalation Pattern. If a compensation is truly stuck, the Saga should escalate the problem. This is implemented as another Activity that creates a ticket in a system like JIRA or PagerDuty, alerting a human operator to manually intervene.
  • Workflow Snippet for Escalation:

    go
    // Inside the deferred compensation loop
    
    err := workflow.ExecuteActivity(compensationCtx, compensations[i]).Get(compensationCtx, nil)
    if err != nil {
        wfLogger.Error("CRITICAL: Compensation activity failed after multiple retries.", "Error", err)
        // Escalate to human
        escalationDetails := EscalationDetails{
            WorkflowID: workflow.GetInfo(ctx).WorkflowExecution.ID,
            FailingActivity: "UnreserveInventoryActivity", // Get this dynamically
            Error: err.Error(),
        }
        // This activity should also be highly reliable
        _ = workflow.ExecuteActivity(ctx, CreateSupportTicketActivity, escalationDetails).Get(ctx, nil)
    }

    Workflow Versioning

    Business requirements change. What if we need to add a new step to our Saga, like sending a customer notification email after payment? If we deploy this new workflow code, what happens to the thousands of Sagas already in flight? They will fail because their execution history is not compatible with the new code's definition.

    Temporal solves this with its workflow.GetVersion API. This API allows you to create checkpoints in your code. When a workflow that was started on an older version of the code reaches this API call, it will execute the 'default' branch. Workflows started on the new code will see the new version and execute the logic within the versioned block.

    Example: Adding a Notification Step

    go
    // ... after ProcessPaymentActivity succeeds
    
    // VERSIONING: Add a step to notify the customer
    
    v1 := workflow.GetVersion(ctx, "AddNotificationStep", workflow.DefaultVersion, 1)
    if v1 == 1 {
        // This code will only be executed by new workflows started after this change was deployed.
        // Old workflows will skip this block.
        notificationDetails := NotificationDetails{ /* ... */ }
        err = workflow.ExecuteActivity(ctx, SendNotificationActivity, notificationDetails).Get(ctx, nil)
        if err != nil {
            // This is a non-critical step, so we might just log the error and continue
            // rather than failing the whole Saga.
            wfLogger.Warn("Failed to send notification", "Error", err)
        }
    }
    
    // ... continue to MarkOrderAsCompletedActivity

    This powerful feature enables you to evolve complex, long-running business processes without downtime or complex data migrations.

    Performance and Scalability Considerations

  • Activity Task Queue Tuning: In our example, we routed inventory tasks to a separate queue. This is a critical pattern for scalability. You can have a fleet of Python workers on auto-scaling infrastructure dedicated to the high-throughput inventory-task-queue, while a smaller, more stable set of Go workers handles the order-task-queue. This isolates workloads and prevents a slow activity from blocking others.
  • Payload Size and Data Converters: The entire history of a workflow, including activity inputs and outputs, is stored. Large payloads can bloat this history, degrading performance. For sensitive data (like credit card numbers), you never want them stored in plaintext in the workflow history. Temporal's DataConverter API is the solution. You can implement a custom converter to:
  • 1. Compress payloads before sending them to the Temporal server (e.g., using zlib/gzip).

    2. Encrypt sensitive fields, decrypting them only within the worker process that has the keys.

  • Observability: In a distributed system, observability is non-negotiable. Temporal's SDKs emit a rich set of metrics (compatible with Prometheus) that are essential for production monitoring:
  • - temporal_activity_execution_latency: High latency here points to a slow downstream service or a problem within your activity code.

    - temporal_workflow_task_schedule_to_start_latency: High latency indicates your workflow workers are starved for resources—they can't keep up with the rate of work. This is a signal to scale up your workflow worker fleet.

    - Monitoring compensation activity execution counts and failure rates is critical for detecting systemic issues.

    Conclusion: Beyond Simple Sagas

    The Saga pattern is fundamental to building resilient distributed systems. However, a manual implementation is a complex and error-prone endeavor. By adopting Temporal, we transform the problem from one of building distributed systems infrastructure to one of writing business logic. The workflow-as-code model, combined with durable defer for compensations, idempotency patterns, and built-in versioning, provides a complete toolkit for implementing Sagas that are not only correct but also observable, scalable, and maintainable over time.

    This approach elevates the Saga from a theoretical pattern to a practical, production-ready tool. The ability to seamlessly orchestrate logic across services written in different languages unlocks true polyglot architectures, allowing teams to use the best tool for each job without sacrificing transactional integrity at the business process level.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles