Implementing the Saga Pattern with Temporal for Resilient Workflows

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Distributed Monolith's Ghost: State Management in Microservices

In the world of distributed systems, the siren song of microservices promises independent scalability, deployment, and team autonomy. Yet, it often lures us onto the rocks of a familiar problem: managing state and consistency across service boundaries. A simple user action, like placing an order, can trigger a cascade of operations across Payment, Inventory, and Shipping services. If any step fails, how do we roll back the preceding, already-committed transactions? The classic two-phase commit (2PC) protocol, a staple of monolithic architectures, is notoriously brittle and unscalable in a distributed, polyglot environment.

This is where the Saga pattern enters. A Saga is a sequence of local transactions where each transaction updates data within a single service and publishes an event or message to trigger the next transaction in the chain. If a local transaction fails, the Saga executes a series of compensating transactions to undo the preceding ones. While powerful, manually implementing a Saga, especially using a choreography-based approach, can lead to a distributed state machine nightmare. You end up with services tightly coupled through an implicit, hard-to-debug chain of events, and handling failures, retries, and timeouts becomes a monumental task.

This is the crux of our discussion: we will not be debating the merits of Sagas. We assume you've already hit the wall with manual implementations. Instead, we will dive deep into a production-ready, orchestration-based approach using Temporal, a durable execution engine that transforms this complex distributed systems problem into a far more manageable, testable, and observable workflow.

Temporal allows you to write your entire Saga logic in one place, as if it were a single, local function. The platform handles the durability, state management, retries, and queuing, ensuring your workflow code will execute to completion, even in the face of process crashes, network partitions, or server restarts. We will explore this by building a complex e-commerce order processing Saga, focusing on the advanced implementation details that separate a proof-of-concept from a production-ready system.


The Anatomy of Our E-commerce Saga

Let's define a realistic, multi-step business process that's ripe for failure. When a customer places an order, the following sequence of operations must occur across different microservices:

  • Order Service: Creates an Order entity with a PENDING status.
  • Payment Service: Charges the customer's credit card for the total amount.
  • Inventory Service: Reserves the specific items in the order to prevent them from being sold to someone else.
  • Shipping Service: Arranges for shipping and generates a tracking label.
  • Order Service: Updates the Order entity's status to CONFIRMED.
  • The failure domain is vast:

    * What if payment fails (insufficient funds)? The order must be marked FAILED.

    * What if payment succeeds, but inventory reservation fails (item just went out of stock)? The payment must be refunded, and the order marked FAILED.

    * What if inventory is reserved, but the shipping provider's API is down? The payment must be refunded, the inventory must be released, and the order marked FAILED.

    Each forward action has a corresponding backward, or compensating, action:

    Action (Activity)Compensating Action (Activity)
    chargeCardrefundPayment
    reserveInventoryreleaseInventory
    createShipmentcancelShipment

    Our goal is to implement this logic as a single, durable Temporal Workflow that orchestrates calls to these microservices.

    Project Setup and Core Components

    We'll use a TypeScript-based monorepo structure. The services themselves (e.g., Express APIs for Payment, Inventory) are outside the scope of this article; we will focus on the Temporal client, worker, and workflow definitions that interact with them.

    bash
    # Project Structure
    /my-ecommerce-app
    ├── package.json
    ├── tsconfig.json
    ├── /workflows
    │   ├── index.ts           # Workflow entry point
    │   ├── orderWorkflow.ts   # Our main Saga logic
    │   └── types.ts           # Shared types
    ├── /activities
    │   ├── paymentActivities.ts
    │   ├── inventoryActivities.ts
    │   └── shippingActivities.ts
    ├── /worker
    │   └── index.ts           # Temporal Worker process
    └── /client
        └── index.ts           # Temporal Client to start workflows

    Defining Activities: The Interface to the Real World

    Activities are simple functions that execute your business logic, like calling a third-party API or interacting with a database. They are where all non-deterministic code and side effects live. Crucially, Temporal guarantees an activity will be executed at least once. This has significant implications for idempotency, which we'll address later.

    Let's define the interfaces for our activities. We'll group them by domain.

    ./activities/paymentActivities.ts

    typescript
    import { OrderDetails } from '../workflows/types';
    
    // Pretend these are API clients for your microservices
    import { paymentServiceAPI } from '../services/paymentService';
    
    export async function chargeCard(details: OrderDetails): Promise<{ paymentId: string }> {
      console.log(`Charging card for order ${details.orderId} for amount ${details.amount}`);
      // In a real app, this would be an HTTP or gRPC call.
      const result = await paymentServiceAPI.charge(details.userId, details.amount, details.idempotencyKey);
      return { paymentId: result.transactionId };
    }
    
    export async function refundPayment(paymentId: string): Promise<void> {
      console.log(`Refunding payment ${paymentId}`);
      await paymentServiceAPI.refund(paymentId);
    }

    We would create similar files for inventoryActivities.ts and shippingActivities.ts. The key takeaway is that these are standard async functions. Temporal handles the boilerplate of calling them, retrying them on failure, and recording their results.

    Implementing the Saga Workflow: Orchestration and Compensation

    Now for the core of the implementation. The Workflow is a deterministic function that orchestrates the execution of Activities. It cannot contain any non-deterministic code (like Math.random(), Date.now(), or direct API calls) because Temporal re-executes the workflow code from its event history to rebuild its state upon a worker restart.

    We'll use the proxyActivities function from @temporalio/workflow to create a typed, callable interface to our activities. We must also configure robust retry policies and timeouts.

    ./workflows/orderWorkflow.ts

    typescript
    import * as wf from '@temporalio/workflow';
    import type * as activities from '../activities'; // Using type-only import
    import { OrderDetails } from './types';
    
    // Proxy activities for type-safe calls
    const { chargeCard, refundPayment } = wf.proxyActivities<typeof activities.payment>({ 
      startToCloseTimeout: '30s', 
      retry: {
        initialInterval: '1s',
        backoffCoefficient: 2,
        maximumInterval: '10s',
        nonRetryableErrorTypes: ['PaymentDeclinedError'], // Don't retry on a hard decline
      }
    });
    
    const { reserveInventory, releaseInventory } = wf.proxyActivities<typeof activities.inventory>({ 
      startToCloseTimeout: '30s',
      // ... similar retry policy
    });
    
    const { createShipment, cancelShipment } = wf.proxyActivities<typeof activities.shipping>({ 
      startToCloseTimeout: '60s',
      // ... similar retry policy
    });
    
    // Main workflow function
    export async function orderWorkflow(details: OrderDetails): Promise<string> {
      // Compensation stack: an array of functions to call on failure
      const compensations: (() => Promise<any>)[] = [];
    
      try {
        // 1. Charge the card
        const { paymentId } = await chargeCard(details);
        // If successful, push the compensation to the stack
        compensations.push(() => refundPayment(paymentId));
    
        // 2. Reserve inventory
        const { reservationId } = await reserveInventory(details.items);
        compensations.push(() => releaseInventory(reservationId));
    
        // 3. Create shipment
        const { shipmentId } = await createShipment(details.shippingAddress);
        compensations.push(() => cancelShipment(shipmentId));
    
        // All steps succeeded! The workflow can complete.
        // In a real app, we'd update the order status here via another activity.
        return `Order ${details.orderId} completed successfully.`;
    
      } catch (error) {
        // A step failed. Execute compensations in reverse order.
        wf.log.error('Workflow failed, starting compensation.', { error: (error as Error).message });
    
        for (const compensation of compensations.reverse()) {
          // Use continueAsNew on failure of compensation to handle it separately
          // Or simply log and alert for manual intervention.
          try {
            await compensation();
          } catch (compensationError) {
            wf.log.error('Compensation activity failed.', { error: (compensationError as Error).message });
            // TODO: Signal for manual intervention
          }
        }
    
        // Re-throw the original error to mark the workflow as failed.
        throw error;
      }
    }

    This is the magic of Temporal. This code looks like a simple, sequential program, but it's incredibly resilient. If the worker process crashes between chargeCard and reserveInventory, a new worker will pick up, re-hydrate the state from the workflow history, and see that chargeCard has already completed successfully. It will then proceed to call reserveInventory without re-charging the card.

    The compensation logic is straightforward: we maintain a stack of compensation functions. If any activity throws an error that isn't handled by the retry policy, the catch block is executed. We iterate through our stack in reverse (LIFO) order to undo what we've done. This is the core of an orchestrated Saga.


    Advanced Patterns and Production Edge Cases

    Simple success/failure paths are one thing, but production systems are messy. Let's tackle some advanced scenarios.

    1. Idempotency in Activities

    Temporal guarantees at-least-once execution for activities. A network blip could cause an activity to execute successfully, but the worker fails to report the completion back to the Temporal cluster. The cluster will time out and reschedule the activity on another worker. If your chargeCard activity isn't idempotent, you might double-charge the customer.

    The solution is to pass a unique identifier from the deterministic workflow to the non-deterministic activity. The workflow can generate this using wf.uuid4().

    Modify orderWorkflow.ts:

    typescript
    // Inside orderWorkflow function
    const idempotencyKey = wf.uuid4();
    const { paymentId } = await chargeCard({ ...details, idempotencyKey });

    Then, your Payment Service must be designed to handle this:

    services/paymentService.ts (conceptual)

    typescript
    class PaymentServiceAPI {
      async charge(userId: string, amount: number, idempotencyKey: string) {
        // 1. Check if a transaction with this idempotencyKey already exists
        const existingTx = await db.transactions.find({ idempotencyKey });
        if (existingTx) {
          return { status: 'SUCCESS', transactionId: existingTx.id };
        }
    
        // 2. If not, create a new transaction record with the key and a PENDING status
        const newTx = await db.transactions.create({
          idempotencyKey,
          userId,
          amount,
          status: 'PENDING',
        });
    
        // 3. Attempt the charge with the payment gateway
        try {
          const result = await paymentGateway.charge(amount, ...);
          // 4. Update our transaction to SUCCESS
          await db.transactions.update(newTx.id, { status: 'SUCCESS', gatewayId: result.id });
          return { status: 'SUCCESS', transactionId: newTx.id };
        } catch (e) {
          // 5. Update our transaction to FAILED
          await db.transactions.update(newTx.id, { status: 'FAILED' });
          throw e;
        }
      }
    }

    This transactional outbox-like pattern within the service ensures that even if the activity is retried, the charge operation itself happens only once.

    2. Asynchronous Activity Completion

    What if an activity depends on an external system that responds via a webhook? For example, a payment provider that takes several minutes to confirm a transaction and sends a webhook when it's done. The activity cannot simply await the result.

    This is a classic use case for Asynchronous Activity Completion. The activity tells the Temporal cluster, "I'm not done yet, I'll let you know when I am." It then returns its taskToken to the external system, which will use it to complete the activity later.

    ./activities/paymentActivities.ts

    typescript
    import { Context } from '@temporalio/activity';
    
    export async function chargeCardAsync(details: OrderDetails): Promise<{ paymentId: string }> {
      const taskToken = Context.current().info.taskToken;
    
      // This call doesn't wait for the payment to complete.
      // It just initiates it and passes our taskToken to the payment service.
      await paymentServiceAPI.initiateAsyncCharge(
        details.userId,
        details.amount,
        taskToken // The key to completing the activity later
      );
    
      // Tell the worker not to complete this activity yet.
      Context.current().doNotCompleteOnReturn();
    
      // This return value is ignored, but required by the function signature.
      return { paymentId: '' }; 
    }

    Your Payment Service would then have a webhook endpoint:

    payment-service/webhook.ts (conceptual Express route)

    typescript
    app.post('/webhooks/payment-complete', async (req, res) => {
      const { status, transactionId, taskToken } = req.body;
    
      // Create a connection to the Temporal frontend service
      const connection = await Connection.connect();
      const client = new WorkflowClient({ connection });
    
      if (status === 'SUCCESS') {
        // Complete the activity successfully
        await client.activity.complete(taskToken, { paymentId: transactionId });
      } else {
        // Fail the activity
        const error = new ApplicationFailure('Payment failed by webhook', 'PaymentDeclined');
        await client.activity.fail(taskToken, error);
      }
    
      res.status(200).send('OK');
    });

    The workflow code remains blissfully unaware of this asynchronicity. It simply awaits the chargeCardAsync activity, and Temporal pauses the workflow execution until it receives the completion signal via the client.

    3. Handling Non-Idempotent or Failing Compensations

    This is the nightmare scenario: what if your compensation fails? A refundPayment call might fail due to the payment provider's API being down. If the compensation is not idempotent, retrying it could lead to multiple refunds.

    This is where human-in-the-loop patterns become essential. If a compensation fails after several retries, the workflow should not fail entirely. Instead, it should move into a state that signals for manual intervention.

    We can use Temporal Signals for this. A signal is a way to send data to a running workflow.

    typescript
    // Inside the catch block of orderWorkflow.ts
    for (const compensation of compensations.reverse()) {
      try {
        await wf.executeActivity(compensation, { 
            // Short retry for compensations
            startToCloseTimeout: '60s', 
            retry: { maximumAttempts: 3 }
        });
      } catch (compensationError) {
        wf.log.error('CRITICAL: Compensation failed. Awaiting manual intervention.', { 
            activity: compensation.name,
            error: (compensationError as Error).message 
        });
        // 1. Set a state indicating we need help
        let needsManualIntervention = true;
        
        // 2. Define a signal handler
        wf.setHandler(manualCompensationSignal, ({ activityName, result }) => {
            wf.log.info(`Received manual compensation signal for ${activityName} with result ${result}`);
            needsManualIntervention = false;
        });
    
        // 3. Wait until the signal is received
        await wf.condition(() => !needsManualIntervention);
      }
    }

    This workflow will now pause indefinitely if a compensation fails, waiting for an operator to investigate the issue (e.g., in Datadog logs), manually perform the refund via the payment provider's dashboard, and then use the Temporal CLI or UI to send a signal to the workflow to tell it to proceed.

    bash
    # Operator sends a signal after manually fixing the refund
    temporal workflow signal --workflow-id 'order-xyz' \
      --name 'manualCompensation' \
      --input '{ "activityName": "refundPayment", "result": "completed_manually" }'

    Testing Your Saga Logic

    One of the most powerful features of Temporal is its testability. The @temporalio/testing package provides an in-memory test environment that lets you mock activities and test your workflow logic, including complex compensation paths, without any external dependencies.

    Here's a test case using Vitest that verifies our compensation logic.

    ./workflows/orderWorkflow.test.ts

    typescript
    import { TestWorkflowEnvironment } from '@temporalio/testing';
    import { Worker, DefaultLogger } from '@temporalio/worker';
    import { orderWorkflow } from './orderWorkflow';
    import { vi, test, expect, beforeEach, afterEach } from 'vitest';
    
    let testEnv: TestWorkflowEnvironment;
    
    beforeEach(async () => {
      testEnv = await TestWorkflowEnvironment.createTimeSkipping();
    });
    
    afterEach(async () => {
      await testEnv.teardown();
    });
    
    test('should execute all compensations if inventory reservation fails', async () => {
      const { client, nativeConnection } = testEnv;
    
      // Mock activities
      const activities = {
        payment: {
          chargeCard: vi.fn().mockResolvedValue({ paymentId: 'payment-123' }),
          refundPayment: vi.fn().mockResolvedValue(undefined),
        },
        inventory: {
          reserveInventory: vi.fn().mockRejectedValue(new Error('Out of stock')),
          releaseInventory: vi.fn().mockResolvedValue(undefined), // This should not be called
        },
        shipping: {
          createShipment: vi.fn(), // This should not be called
          cancelShipment: vi.fn(), // This should not be called
        }
      };
    
      const worker = await Worker.create({
        connection: nativeConnection,
        taskQueue: 'test',
        workflowsPath: require.resolve('./orderWorkflow'),
        activities,
      });
    
      const runPromise = worker.runUntil(async () => {
        await expect(
          client.workflow.execute(orderWorkflow, {
            workflowId: 'test-wf',
            taskQueue: 'test',
            args: [/* order details here */],
          })
        ).rejects.toThrow('Out of stock');
    
        // Verify the call sequence
        expect(activities.payment.chargeCard).toHaveBeenCalledOnce();
        expect(activities.inventory.reserveInventory).toHaveBeenCalledOnce();
        expect(activities.shipping.createShipment).not.toHaveBeenCalled();
        
        // CRITICAL: Verify compensation was called
        expect(activities.payment.refundPayment).toHaveBeenCalledOnce();
        expect(activities.payment.refundPayment).toHaveBeenCalledWith('payment-123');
    
        // Verify other compensations were NOT called
        expect(activities.inventory.releaseInventory).not.toHaveBeenCalled();
      });
    
      await runPromise;
    });

    This test deterministically simulates a failure in the reserveInventory activity. It then asserts that chargeCard was called, but createShipment was not. Most importantly, it verifies that the refundPayment compensation was triggered correctly, proving the Saga's rollback logic works as expected. This ability to test complex, long-running, and failure-prone distributed logic as a simple unit test is a game-changer.


    Conclusion: From Fragile Choreography to Durable Orchestration

    We've moved from the high-level concept of a Saga to a concrete, robust implementation capable of handling the messy reality of production environments. By leveraging a durable execution engine like Temporal, we elevated the problem from low-level state management, retries, and event plumbing to high-level business logic orchestration.

    The key takeaways for senior engineers are:

  • Orchestration over Choreography: For complex, multi-step processes, orchestration provides a single source of truth for the workflow's state, making it easier to reason about, debug, and modify.
  • Embrace Idempotency: The at-least-once execution guarantee of distributed systems necessitates designing your service endpoints and activity logic to be idempotent.
  • Plan for Asynchronicity: Real-world systems involve webhooks and callbacks. Asynchronous Activity Completion is a critical pattern for integrating with them without blocking worker threads.
  • Design for Failed Compensations: The most resilient systems plan for the 'unhappy path of the unhappy path'. A strategy for handling failing compensations, often involving a human-in-the-loop, is non-negotiable.
  • Test Your Failure Paths: The true value of a Saga is how it behaves on failure. The ability to unit test these failure and compensation paths with a framework like temporalio/testing is what instills confidence in your system's resilience.
  • The Saga pattern is not a silver bullet, but when implemented with the right tools, it provides a powerful model for maintaining business consistency across microservice boundaries. By abstracting the infrastructure concerns of durability and state management, Temporal allows us to focus on what matters: writing clean, robust, and testable business logic that can weather the storm of distributed computing.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles