Implementing Resilient Sagas with Temporal for Distributed Transactions

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inevitable Failure of 2PC in Distributed Architectures

As senior engineers designing scalable microservice ecosystems, we've all confronted the distributed transaction problem. The classical database solution, two-phase commit (2PC), guarantees atomicity but introduces toxic coupling and catastrophic failure modes in a distributed environment. A lock held by a slow or failed service can cascade, bringing down the entire transaction. The reliance on a central transaction coordinator creates a single point of failure and a performance bottleneck. In short, 2PC is an anti-pattern for modern, loosely-coupled services.

This leaves us with the challenge of maintaining data consistency across service boundaries without ACID guarantees. The dominant pattern to solve this is the Saga.

Saga Orchestration: Centralized Logic for Complex Workflows

This article assumes you are familiar with the Saga pattern. We won't rehash the basics of choreography vs. orchestration. Instead, we'll assert that for any non-trivial business process, orchestration is the superior architectural choice.

Choreography (services emitting and reacting to events) appears decoupled but often hides complex, implicit dependencies. Debugging becomes a nightmare of tracing events across multiple services, and understanding the overall state of a business transaction is nearly impossible.

Orchestration centralizes the workflow logic in a dedicated component, which calls services in sequence. This provides explicit control, clear visibility, and a single source of truth for the transaction's state. The challenge, however, shifts from choreography's implicit complexity to the explicit, and formidable, task of building a durable, fault-tolerant orchestrator.

The Minefield of Manual Saga Orchestration

Building a production-grade Saga orchestrator from scratch is a significant engineering effort fraught with peril. Consider the requirements for an order fulfillment Saga:

  • Charge Customer
  • Update Inventory
  • Initiate Shipping
  • If step 2 fails, we must compensate by refunding the customer. If step 3 fails, we must refund the customer and restock the inventory.

    Manually implementing this involves:

    * Durable State Management: The orchestrator must persist the state of the Saga at every step. If the orchestrator process crashes after charging the customer but before updating inventory, it must resume from exactly where it left off upon restart. This typically involves a database table with a state machine column (PENDING_PAYMENT, PAYMENT_COMPLETE, INVENTORY_DEBITED, etc.).

    * Complex Error Handling: You need to distinguish between transient network errors (which should be retried) and permanent business logic failures (which should trigger compensation). This requires implementing sophisticated retry logic with exponential backoff and jitter.

    Guaranteed Compensation: The compensation logic must* execute, even if the orchestrator crashes while attempting to run it. This adds another layer of state management for the compensation phase.

    * Idempotency: All participating service operations (and their compensations) must be idempotent to handle retries safely. The orchestrator must generate and pass a unique idempotency key for each operation.

    * Visibility and Debugging: When a Saga fails, you need to know exactly which step failed, why it failed, and what the state of the system is. This requires extensive, structured logging and tooling that is difficult to build.

    This infrastructure plumbing is undifferentiated heavy lifting. It distracts from the core business logic we are trying to implement. This is the problem that Temporal.io solves.

    Temporal.io: A Durable Execution Engine for Sagas

    Temporal is not a message queue or a simple state machine. It's a durable execution engine. It allows you to write your orchestration logic (which it calls a Workflow) in a standard programming language like TypeScript, Go, or Java. The Temporal cluster ensures that your workflow code executes durably from start to finish, even if the process running your code (the Worker) crashes or the Temporal cluster itself experiences outages.

    * Workflows: The orchestration logic. Workflow code is deterministic and sandboxed. It executes from start to finish, potentially over days or years, preserving its local variables and execution stack across any infrastructure failure.

    * Activities: The implementation of individual steps in your Saga. An Activity is a function that executes your business logic, such as calling a service API. Temporal manages the execution, retries, and timeouts of Activities.

    By using Temporal, all the manual orchestration problems disappear:

    * State Management: Handled automatically. The full state of the workflow, including local variables and call stack, is durably persisted by the Temporal cluster.

    * Retries: Configured with a simple policy object. Temporal handles the exponential backoff and retries for you.

    * Compensation: Implemented using standard language constructs like try...catch...finally or Go's defer. Temporal guarantees the execution of your compensation code.

    Let's build a production-grade Saga to demonstrate this.

    Core Implementation: A Production-Grade Order Fulfillment Saga

    Our scenario is an e-commerce order that involves three services: PaymentService, InventoryService, and ShippingService. The business logic is as follows:

    • Reserve inventory for the items in the order.
    • Charge the customer's credit card.
    • If payment is successful, confirm the inventory reservation and ship the order.
    • If payment fails, cancel the inventory reservation.
    • If shipping fails, refund the payment and cancel the inventory reservation.

    We will use TypeScript for our examples.

    Defining the Activities

    First, we define the interfaces for our activities. These are the functions that interact with the outside world.

    typescript
    // src/activities.ts
    
    // Mock external services
    async function paymentServiceCharge(idempotencyKey: string, amount: number): Promise<string> { /* ... */ }
    async function paymentServiceRefund(transactionId: string, amount: number): Promise<void> { /* ... */ }
    async function inventoryServiceReserve(idempotencyKey: string, items: any[]): Promise<string> { /* ... */ }
    async function inventoryServiceCancel(reservationId: string): Promise<void> { /* ... */ }
    async function shippingServiceShip(idempotencyKey: string, order: any): Promise<string> { /* ... */ }
    
    // These are the functions our workflow will call.
    // They are simple wrappers around our service clients.
    
    export async function reserveInventory(items: any[]): Promise<string> {
      console.log('Reserving inventory for items:', items);
      // In a real app, you would get the idempotency key from the activity context
      const reservationId = await inventoryServiceReserve(uuidv4(), items);
      return reservationId;
    }
    
    export async function chargeCustomer(amount: number): Promise<string> {
      console.log(`Charging customer ${amount}`);
      const transactionId = await paymentServiceCharge(uuidv4(), amount);
      return transactionId;
    }
    
    export async function shipOrder(order: any): Promise<string> {
      console.log('Shipping order:', order);
      const shippingId = await shippingServiceShip(uuidv4(), order);
      return shippingId;
    }
    
    // Compensation Activities
    export async function cancelInventoryReservation(reservationId: string): Promise<void> {
      console.log(`Cancelling inventory reservation: ${reservationId}`);
      await inventoryServiceCancel(reservationId);
    }
    
    export async function refundCustomer(transactionId: string, amount: number): Promise<void> {
      console.log(`Refunding customer for transaction: ${transactionId}`);
      await paymentServiceRefund(transactionId, amount);
    }

    Code Example 1: Full Workflow with Compensation

    Now, let's write the workflow logic. Notice how we use a compensations array and a try...finally block. The finally block is guaranteed to execute by Temporal, even if the worker process crashes in the middle of the try block. This is the core of durable compensation.

    typescript
    // src/workflows.ts
    import { proxyActivities, sleep } from '@temporalio/workflow';
    import type * as activities from './activities';
    
    // Configure activities with retry policies
    const { 
      reserveInventory,
      chargeCustomer,
      shipOrder,
      cancelInventoryReservation,
      refundCustomer
    } = proxyActivities<typeof activities>({
      startToCloseTimeout: '30s',
      retry: {
        initialInterval: '1s',
        backoffCoefficient: 2,
        maximumInterval: '10s',
        maximumAttempts: 3,
      },
    });
    
    interface Compensation { 
      fn: (...args: any[]) => Promise<any>;
      args: any[];
    }
    
    export async function orderFulfillmentSaga(order: any): Promise<string> {
      const compensations: Compensation[] = [];
    
      try {
        // 1. Reserve Inventory
        const reservationId = await reserveInventory(order.items);
        compensations.unshift({ fn: cancelInventoryReservation, args: [reservationId] });
    
        // 2. Charge Customer
        const transactionId = await chargeCustomer(order.totalAmount);
        compensations.unshift({ fn: refundCustomer, args: [transactionId, order.totalAmount] });
    
        // 3. Ship Order
        const shippingId = await shipOrder(order);
    
        // If we reach here, the Saga was successful. We can clear compensations.
        compensations.length = 0;
        return `Order ${order.id} completed successfully. Shipping ID: ${shippingId}`;
    
      } catch (err) {
        // If any activity fails after all retries, this block will execute.
        console.error('Saga failed, initiating compensation.', err);
    
        for (const comp of compensations) {
          // We execute compensations without awaiting them all to run in parallel.
          // In a real scenario, you might want to handle compensation failures too.
          comp.fn(...comp.args).catch(compErr => {
              // TODO: Add robust handling for compensation failures, e.g., push to a dead-letter queue.
              console.error('Compensation failed!', compErr);
          });
        }
        throw new Error(`Order ${order.id} failed and was compensated.`);
    
      } finally {
        // The finally block is not needed here if all compensation logic is in the catch block.
        // However, it's a useful pattern if you have cleanup that must run regardless of success or failure.
        console.log('Saga finished execution (success or failure).');
      }
    }

    This code is remarkably simple compared to a manual implementation. We are writing straightforward, sequential logic, and the compensations array acts as our compensation stack. We unshift compensation functions onto it as we complete forward steps. If an error occurs, the catch block iterates through the stack and executes them. The Temporal platform guarantees this logic will execute to completion.

    Advanced Patterns and Edge Cases

    Real-world systems are more complex. Let's explore how to handle advanced scenarios.

    Handling Non-Retryable Errors

    What if chargeCustomer fails not because of a network error, but because the credit card was declined? Retrying is pointless. We need to fail fast and trigger compensation immediately.

    Temporal allows Activities to throw an ApplicationFailure. This signals to the workflow that the failure is non-retriable.

    Code Example 2: Non-Retryable Failures

    First, modify the chargeCustomer activity.

    typescript
    // src/activities.ts (modified)
    import { ApplicationFailure } from '@temporalio/common';
    
    // ... inside chargeCustomer activity
    export async function chargeCustomer(amount: number): Promise<string> {
      console.log(`Charging customer ${amount}`);
      try {
        const transactionId = await paymentServiceCharge(uuidv4(), amount);
        return transactionId;
      } catch (error: any) {
        if (error.type === 'PaymentDeclinedError') {
          // This is a business logic failure, not a transient technical one.
          // Mark it as non-retryable.
          throw ApplicationFailure.create({
            message: 'Credit card was declined',
            type: 'PaymentError',
            nonRetryable: true,
          });
        } 
        // For other errors, let the default retry policy apply.
        throw error;
      }
    }

    The workflow code doesn't need to change. When chargeCustomer throws this ApplicationFailure, Temporal will immediately stop retrying and propagate the error to the workflow's catch block, triggering the compensation logic just as we designed.

    Long-Running Activities and Heartbeating

    Imagine the shipOrder activity involves a complex process at a warehouse that could take 30 minutes. The worker running this activity could crash or be restarted for a deployment during this time. By default, Temporal would time out the activity and retry it from the beginning.

    This is inefficient. Activity Heartbeating allows a long-running activity to periodically report its progress back to the Temporal cluster. If the worker crashes, Temporal will re-schedule the activity on another available worker, providing it with the details from the last successful heartbeat. The new worker can then resume the work from where the previous one left off.

    Code Example 3: Heartbeating for Warehouse Operations

    typescript
    // src/activities.ts (modified)
    import { Context } from '@temporalio/activity';
    
    // A long-running task that can be resumed
    async function performWarehousePicking(order: any, resumeFromStep?: string) { /* ... */ }
    
    export async function shipOrder(order: any): Promise<string> {
      console.log('Shipping order:', order);
      let lastCompletedStep = '';
    
      // Check if this activity is being resumed after a crash
      if (Context.current().info.isAttempting) {
        const lastHeartbeatDetails = Context.current().info.heartbeatDetails<string>();
        if (lastHeartbeatDetails) {
          lastCompletedStep = lastHeartbeatDetails;
          console.log(`Resuming warehouse process from step: ${lastCompletedStep}`);
        }
      }
    
      // Step 1: Pick items
      if (lastCompletedStep < 'picked') {
        await performWarehousePicking(order, lastCompletedStep);
        lastCompletedStep = 'picked';
        Context.current().heartbeat(lastCompletedStep);
      }
    
      // Step 2: Pack items
      if (lastCompletedStep < 'packed') {
        await performPacking(order);
        lastCompletedStep = 'packed';
        Context.current().heartbeat(lastCompletedStep);
      }
    
      // Step 3: Generate shipping label
      const shippingId = await generateLabelAndShip(order);
    
      return shippingId;
    }

    In the workflow, we must also configure a heartbeatTimeout for this activity, which should be longer than the interval between heartbeats.

    typescript
    // src/workflows.ts (modified proxyActivities call for shipOrder)
    const { shipOrder } = proxyActivities<typeof activities>({
      startToCloseTimeout: '2h', // This long-running activity can take up to 2 hours
      heartbeatTimeout: '1m',    // Must heartbeat at least once per minute
      retry: { maximumAttempts: 5 },
    });

    Asynchronous Activity Completion for Human-in-the-Loop

    What if a step requires manual approval, like a fraud check for a large order? The workflow can't block for days waiting for a human.

    The solution is the Asynchronous Activity Completion pattern. The activity starts, tells the external system (e.g., a fraud review dashboard) to begin its work, and then tells the Temporal cluster that it will be completed later. The activity worker is now free to process other tasks. When the human completes the review, the external system uses the Temporal client to send a signal back to the cluster to complete the activity, either with a result or an error.

    Code Example 4: Async Completion for Fraud Review

    First, the activity tells Temporal it will complete asynchronously.

    typescript
    // src/activities.ts (new activity)
    import { Context } from '@temporalio/activity';
    
    export async function requestFraudCheck(order: any): Promise<string> {
      // Get the unique task token for this activity execution
      const taskToken = Context.current().info.taskToken;
    
      // Make this activity complete asynchronously
      Context.current().doNotCompleteOnReturn();
    
      // Call the external fraud service, passing it the task token
      // This service will later use the token to complete the activity
      await fraudReviewService.initiateReview(order, taskToken.toString('base64'));
    
      // The function returns, but the activity is still 'running' in Temporal
      return 'Fraud check initiated. Awaiting external completion.';
    }

    An external service (e.g., an Express.js API endpoint that the fraud dashboard calls) would then use the Temporal client to complete the activity.

    typescript
    // src/external-completer.ts (e.g., in an API server)
    import { Client } from '@temporalio/client';
    
    app.post('/fraud-review/approve', async (req, res) => {
      const { taskToken } = req.body;
    
      const client = new Client();
      await client.activity.complete(
        Buffer.from(taskToken, 'base64'),
        'Approved'
      );
    
      res.status(200).send('Activity completed successfully.');
    });
    
    app.post('/fraud-review/reject', async (req, res) => {
      const { taskToken, reason } = req.body;
    
      const client = new Client();
      await client.activity.fail(
        Buffer.from(taskToken, 'base64'),
        ApplicationFailure.create({ message: reason, type: 'FraudRejection', nonRetryable: true })
      );
    
      res.status(200).send('Activity failed successfully.');
    });

    The workflow code that calls this activity simply awaits it like any other. It remains paused (durably) until the activity is completed externally.

    Performance and Scalability Considerations

    * Task Queue Tuning: Don't run all your Sagas on the same generic task queue. Create dedicated task queues for different types of workflows. For example, high-priority-orders can be served by a large pool of workers, while low-priority-reports can be served by a smaller, auto-scaling pool. This isolates workloads and ensures high-priority transactions are not starved.

    * Workflow History Size: Every event in a workflow's lifecycle (activity start, activity completion, timer fired, etc.) is recorded in its history. A Saga with thousands of steps or one that runs for years can generate a very large history, which can impact performance. For infinite-running Sagas, use Temporal's continueAsNew feature to periodically start a new workflow run with a fresh history. For very complex Sagas, consider breaking them down into parent and child workflows to partition the history.

    * Activity Design: Activities should be idempotent and, ideally, fast. They are the bridge to the outside, non-deterministic world. Keep complex, deterministic logic inside the workflow code itself, where it can be executed durably and without external network calls. Avoid large payloads for activity inputs and outputs, as they must be serialized and stored in the workflow history.

    Conclusion: From Fragile State Machines to Resilient Code

    The Saga pattern is essential for data consistency in microservices, but its manual implementation is a notorious source of complexity and bugs. By leveraging a durable execution engine like Temporal, we can externalize the most difficult parts of orchestration—state management, retries, timeouts, and crash recovery.

    This allows us to write Saga logic that is clean, readable, and directly expresses the business process. We use standard language constructs like try...catch to manage complex compensation chains that are guaranteed to execute. Advanced patterns like heartbeating and asynchronous completion, which are extremely difficult to implement manually, become straightforward. The result is a system that is not just notionally fault-tolerant, but demonstrably resilient, allowing senior engineers to focus on delivering business value instead of building and maintaining fragile infrastructure plumbing.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles