Implementing the Saga Pattern with Temporal for Fault-Tolerant Workflows

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Challenge of Distributed Transactions

In a distributed microservices architecture, maintaining data consistency across service boundaries for a single business process is a formidable challenge. A classic example is a travel booking system: a user's request to book a trip involves reserving a flight, a hotel, and a car rental. These are three separate transactions against three distinct services. If the flight and hotel bookings succeed but the car rental fails, how do we ensure the system returns to a consistent state? We cannot simply leave the user with a partial, unusable booking.

Traditional distributed transaction protocols like two-phase commit (2PC) are ill-suited for modern, loosely-coupled architectures. They rely on synchronous communication and long-lived locks, which introduce significant temporal coupling and reduce the availability and scalability of the entire system. A failure in one service can cascade and block resources across many others.

This is where the Saga pattern provides a more resilient alternative. A saga is a sequence of local transactions. Each transaction updates data within a single service and publishes an event or triggers the next transaction in the sequence. If a local transaction fails, the saga executes a series of compensating transactions to undo the preceding successful transactions. This ensures atomicity at the business process level without requiring distributed locks.

This article focuses on the Orchestration-based Saga, where a central coordinator is responsible for sequencing the steps and triggering compensations. We will demonstrate how Temporal, a durable execution system, provides an exceptionally robust and developer-friendly platform for implementing this pattern, abstracting away the complexities of state management, retries, and failure handling.


Why Temporal Excels as a Saga Orchestrator

While you can implement a saga orchestrator from scratch or use a state machine library, Temporal offers several foundational advantages that make it a superior choice for production systems.

  • Durable Execution & State Persistence: A Temporal Workflow's execution state, including local variables and call stack, is automatically and continuously persisted by the Temporal Cluster. Your workflow code is written as a standard, sequential function, but it's resilient to process crashes, server restarts, and deployments. If the worker running your workflow code goes down, another worker will pick up exactly where it left off. This eliminates the need for a dedicated database to track the saga's state.
  • Built-in Retries and Timeouts: Network failures and transient service unavailability are facts of life in distributed systems. Temporal Activities (the building blocks of workflows that execute business logic) come with configurable, built-in retry policies. You don't need to litter your business logic with complex retry loops; you declare the policy, and Temporal handles the rest.
  • Simplified Compensation Logic: The durable nature of workflows allows for a remarkably intuitive implementation of compensation logic. A standard try...catch...finally block in your workflow code becomes the backbone of your saga. The main business logic resides in the try block, and the compensation logic is placed in a catch or finally block, executing only when a failure occurs.
  • Visibility and Tooling: Unlike event-driven (choreographed) sagas, where tracking the state of a business process can be incredibly difficult, Temporal provides full visibility into every running and completed workflow. You can query a workflow's state, inspect its history, and manually signal it for interventions.
  • Core Implementation: A Travel Booking Saga

    Let's build a practical example: a BookTripWorkflow that orchestrates booking a flight, a hotel, and a car rental. If any step fails, it must cancel any previously successful bookings.

    Environment Setup

    First, you'll need a running Temporal cluster. The simplest way to get started is with docker-compose.

    yaml
    # docker-compose.yml
    version: '3.8'
    services:
      temporal:
        image: temporalio/auto-setup:1.21.3
        ports:
          - "7233:7233" # gRPC
          - "8233:8233" # Web UI
        environment:
          - DB=sqlite3
          - DB_PORT=5432
          - DYNAMIC_CONFIG_FILE_PATH=config/dynamic_config.yaml

    Run docker-compose up to start the cluster. You can access the Temporal Web UI at http://localhost:8233.

    Project Structure (TypeScript)

    We'll use the TypeScript SDK. Initialize a new Node.js project and install the necessary dependencies:

    bash
    npm init -y
    npm install @temporalio/client @temporalio/worker @temporalio/workflow uuid
    npm install -D typescript ts-node @types/node @types/uuid

    Our project structure will be:

    text
    .
    ├── src/
    │   ├── activities.ts     # Business logic implementations (API calls, etc.)
    │   ├── client.ts         # Starts the workflow
    │   ├── shared.ts         # Shared interfaces and task queue names
    │   ├── worker.ts         # Hosts and runs workflows and activities
    │   └── workflows.ts      # The saga orchestrator logic
    ├── package.json
    ├── tsconfig.json
    └── docker-compose.yml

    1. Defining Activities

    Activities are functions that execute your business logic. They are where you interact with the outside world (databases, APIs, etc.). For our saga, we need booking and cancellation activities for each service.

    typescript
    // src/shared.ts
    export const TASK_QUEUE_NAME = 'trip-booking-saga';
    
    export interface TripBookingInfo {
      flightBookingId?: string;
      hotelBookingId?: string;
      carRentalBookingId?: string;
    }
    
    export interface BookingInput {
      userId: string;
      tripId: string;
    }
    typescript
    // src/activities.ts
    import { TripBookingInfo, BookingInput } from './shared';
    
    // Dummy implementations simulating API calls
    // In a real application, these would make HTTP requests to other services.
    
    async function bookFlight(input: BookingInput): Promise<string> {
      console.log(`Booking flight for user ${input.userId} on trip ${input.tripId}...`);
      // Simulate a potential failure
      if (Math.random() < 0.2) { // 20% chance of failure
        throw new Error('Flight booking service is unavailable');
      }
      const flightBookingId = `FLIGHT-${Date.now()}`;
      console.log(`Flight booked successfully: ${flightBookingId}`);
      return flightBookingId;
    }
    
    async function bookHotel(input: BookingInput): Promise<string> {
      console.log(`Booking hotel for user ${input.userId} on trip ${input.tripId}...`);
      const hotelBookingId = `HOTEL-${Date.now()}`;
      console.log(`Hotel booked successfully: ${hotelBookingId}`);
      return hotelBookingId;
    }
    
    async function bookCar(input: BookingInput): Promise<string> {
      console.log(`Booking car for user ${input.userId} on trip ${input.tripId}...`);
      // Simulate a higher chance of failure for the last step
      if (Math.random() < 0.5) { // 50% chance of failure
        throw new Error('No cars available for the selected dates');
      }
      const carBookingId = `CAR-${Date.now()}`;
      console.log(`Car booked successfully: ${carBookingId}`);
      return carBookingId;
    }
    
    async function cancelFlight(bookingId: string): Promise<void> {
      console.log(`Cancelling flight booking: ${bookingId}...`);
      // In a real app, this might also fail and need its own retry policy.
      console.log('Flight cancelled.');
    }
    
    async function cancelHotel(bookingId: string): Promise<void> {
      console.log(`Cancelling hotel booking: ${bookingId}...`);
      console.log('Hotel cancelled.');
    }
    
    async function cancelCar(bookingId: string): Promise<void> {
      console.log(`Cancelling car rental booking: ${bookingId}...`);
      console.log('Car cancelled.');
    }
    
    export const activities = { bookFlight, bookHotel, bookCar, cancelFlight, cancelHotel, cancelCar };

    2. The Saga Orchestrator Workflow

    This is the core of our implementation. We'll use proxyActivities to get type-safe handles to our activity functions and orchestrate them within a try...finally block.

    typescript
    // src/workflows.ts
    import { proxyActivities, ApplicationFailure, defineSignal } from '@temporalio/workflow';
    import type * as activities from './activities';
    import { TripBookingInfo, BookingInput } from './shared';
    
    const { bookFlight, bookHotel, bookCar, cancelFlight, cancelHotel, cancelCar } = proxyActivities<typeof activities>(
      {
        startToCloseTimeout: '30 seconds',
        // Configure retries for transient failures
        retry: {
          initialInterval: '1 second',
          backoffCoefficient: 2,
          maximumInterval: '1 minute',
          nonRetryableErrorTypes: ['BookingValidationError'], // Example of a business logic error
        },
      }
    );
    
    export async function BookTripWorkflow(input: BookingInput): Promise<TripBookingInfo> {
      const successfulBookings: { type: string; id: string }[] = [];
      let workflowError: Error | undefined;
    
      try {
        // Step 1: Book Flight
        const flightBookingId = await bookFlight(input);
        successfulBookings.push({ type: 'flight', id: flightBookingId });
    
        // Step 2: Book Hotel
        const hotelBookingId = await bookHotel(input);
        successfulBookings.push({ type: 'hotel', id: hotelBookingId });
    
        // Step 3: Book Car
        const carRentalBookingId = await bookCar(input);
        successfulBookings.push({ type: 'car', id: carRentalBookingId });
    
        // All steps succeeded
        return {
          flightBookingId,
          hotelBookingId,
          carRentalBookingId,
        };
      } catch (err) {
        workflowError = err as Error;
        // Re-throw the error to fail the workflow after compensation
        throw ApplicationFailure.create({
          message: `Trip booking failed: ${(err as Error).message}`,
          nonRetryable: true,
        });
      } finally {
        // This block executes whether the try block succeeded or failed
        if (workflowError) {
          console.log('Starting compensation logic due to workflow failure.');
          // Iterate in reverse order of execution to unwind the saga
          for (let i = successfulBookings.length - 1; i >= 0; i--) {
            const booking = successfulBookings[i];
            try {
              switch (booking.type) {
                case 'flight':
                  await cancelFlight(booking.id);
                  break;
                case 'hotel':
                  await cancelHotel(booking.id);
                  break;
                case 'car':
                  await cancelCar(booking.id);
                  break;
              }
            } catch (compensationErr) {
              // This is a critical failure. The compensation itself failed.
              // We'll discuss how to handle this in the advanced section.
              console.error(`Failed to compensate for ${booking.type} booking ${booking.id}`, compensationErr);
            }
          }
        }
      }
    }

    Key Implementation Details:

    * State Management: The successfulBookings array acts as our local state within the workflow. Because the workflow is durable, this array's state is preserved across any worker failures.

    * Reverse Compensation: The finally block checks if an error occurred. If so, it iterates through the successfulBookings array in reverse order to execute the compensating transactions.

    * Error Propagation: We re-throw the original error wrapped in an ApplicationFailure at the end of the catch block. This ensures the workflow execution is ultimately marked as 'Failed', which is the correct final state for an unsuccessful saga.

    3. Worker and Client

    Now, we need a worker to host our workflow and activity code, and a client to start the workflow.

    typescript
    // src/worker.ts
    import { Worker } from '@temporalio/worker';
    import { TASK_QUEUE_NAME } from './shared';
    import * as activities from './activities';
    
    async function run() {
      const worker = await Worker.create({
        workflowsPath: require.resolve('./workflows'),
        activities,
        taskQueue: TASK_QUEUE_NAME,
      });
      await worker.run();
    }
    
    run().catch((err) => {
      console.error(err);
      process.exit(1);
    });
    typescript
    // src/client.ts
    import { Connection, Client } from '@temporalio/client';
    import { BookTripWorkflow } from './workflows';
    import { TASK_QUEUE_NAME, BookingInput } from './shared';
    import { v4 as uuidv4 } from 'uuid';
    
    async function run() {
      const connection = await Connection.connect();
      const client = new Client({ connection });
    
      const tripId = uuidv4();
      const input: BookingInput = { userId: 'user-123', tripId };
    
      console.log(`Starting trip booking workflow ${tripId}`);
    
      const handle = await client.workflow.start(BookTripWorkflow, {
        taskQueue: TASK_QUEUE_NAME,
        workflowId: `trip-booking-saga-${tripId}`,
        args: [input],
      });
    
      console.log(`Workflow started. Workflow ID: ${handle.workflowId}`);
      const result = await handle.result();
      console.log('Workflow finished. Result:', result);
    }
    
    run().catch((err) => {
      console.error(err);
      process.exit(1);
    });

    To run this, you'll need two terminals:

  • npx ts-node src/worker.ts
  • npx ts-node src/client.ts
  • Observe the output in the worker terminal. You will see successful runs, and you will also see runs where a booking fails, followed immediately by the compensation logic printing the 'Cancelling...' messages in reverse order.


    Advanced Patterns and Production Edge Cases

    The implementation above works for the happy path and simple failures, but production systems require handling more complex scenarios.

    1. Idempotency in Activities

    Temporal guarantees at-least-once execution for activities. This means an activity might run more than once if, for example, a worker successfully completes an activity but crashes before it can acknowledge completion to the Temporal server. The server will time out and re-schedule the activity on another worker.

    If your bookFlight activity is not idempotent, a retry could result in the user being charged twice. This is unacceptable.

    Solution: Pass a unique idempotency key from the deterministic workflow to the non-deterministic activity. The workflow is the source of truth for this key.

    First, let's modify the workflow to generate and pass these keys.

    typescript
    // src/workflows.ts (modified excerpt)
    import { randomUUID } from '@temporalio/workflow'; // Use the deterministic UUID from the workflow SDK!
    
    // ... inside BookTripWorkflow
      try {
        // Step 1: Book Flight
        const flightIdempotencyKey = randomUUID();
        const flightBookingId = await bookFlight({ ...input, idempotencyKey: flightIdempotencyKey });
        successfulBookings.push({ type: 'flight', id: flightBookingId });
    
        // ... repeat for other activities

    Now, the service that implements the bookFlight logic must handle this key. This is typically done by checking for the key in a database table before processing the request.

    typescript
    // Hypothetical implementation of the flight booking service (e.g., an Express.js route)
    
    // DB Schema (e.g., PostgreSQL)
    // CREATE TABLE processed_requests ( idempotency_key UUID PRIMARY KEY, response_body JSONB, created_at TIMESTAMPTZ DEFAULT NOW() );
    
    app.post('/flights/book', async (req, res) => {
      const { idempotencyKey, userId, tripId } = req.body;
    
      // 1. Check for existing key
      const existingRequest = await db.query('SELECT response_body FROM processed_requests WHERE idempotency_key = $1', [idempotencyKey]);
    
      if (existingRequest.rows.length > 0) {
        // Request already processed, return the stored result to ensure consistency
        return res.status(200).json(existingRequest.rows[0].response_body);
      }
    
      // 2. If not found, process the booking
      try {
        const bookingResult = await performBookingLogic(userId, tripId);
    
        // 3. Store the result and key in a single transaction before responding
        await db.query('INSERT INTO processed_requests (idempotency_key, response_body) VALUES ($1, $2)', [idempotencyKey, bookingResult]);
    
        return res.status(201).json(bookingResult);
      } catch (err) {
        // Handle processing error
        return res.status(500).json({ error: 'Booking failed' });
      }
    });

    This pattern ensures that even if the Temporal activity is retried, the actual business logic is executed only once.

    2. Handling Non-Compensatable Failures

    What happens if the saga fails, and then a compensating transaction also fails? For example, the cancelFlight service is down. The workflow is now in a critical state: a trip is partially booked, and the automatic rollback has failed.

    This is where a human-in-the-loop pattern becomes necessary.

    Solution: Implement a retry policy on the compensation activity. If it continues to fail after a certain number of attempts, the workflow should stop and wait for a manual signal from an operator.

    First, let's define a signal.

    typescript
    // src/workflows.ts (additions)
    
    export const manualCompensationSignal = defineSignal<[string]>('manualCompensationSignal');

    Next, let's create a dedicated, more resilient proxy for compensation activities.

    typescript
    // src/workflows.ts (modified excerpt)
    
    const { cancelFlight, cancelHotel, cancelCar } = proxyActivities<typeof activities>(
      {
        // A more aggressive retry policy for critical cleanup tasks
        startToCloseTimeout: '5 minutes',
        retry: {
          initialInterval: '10 seconds',
          maximumAttempts: 10, // Attempt 10 times before giving up
          nonRetryableErrorTypes: [], // Retry on all errors for compensation
        },
      }
    );
    
    // ... inside BookTripWorkflow's `finally` block
    
    // ... inside the loop
    const booking = successfulBookings[i];
    let compensated = false;
    
    while (!compensated) {
      try {
        switch (booking.type) {
          // ... cases for flight, hotel, car
        }
        compensated = true;
      } catch (compensationErr) {
        console.error(`Compensation for ${booking.type} ${booking.id} failed after retries. Awaiting manual intervention.`);
        // Workflow is now blocked here, waiting for the signal
        let intervention: string | null = null;
        await condition(() => {
            // setHandler will capture the signal and update the variable
            return intervention !== null;
        });
    
        if (intervention === 'markCompensated') {
            console.log(`Manually marked ${booking.type} ${booking.id} as compensated.`);
            compensated = true;
        } else {
            // 'retry' was signaled, loop will continue
            console.log(`Retrying compensation for ${booking.type} ${booking.id} based on manual signal.`);
        }
      }
    }
    // ... end of loop

    We also need to add a signal handler to the workflow to receive the manual intervention.

    typescript
    // src/workflows.ts (modified excerpt)
    
    export async function BookTripWorkflow(input: BookingInput): Promise<TripBookingInfo> {
      // ... existing variables
      let intervention: string | null = null;
    
      // Setup the signal handler
      setHandler(manualCompensationSignal, (signalValue) => {
        intervention = signalValue;
      });
    
      // ... rest of the workflow logic
    }

    An operator, alerted by monitoring tools that a workflow is stuck, can now use the Temporal CLI (or API) to unblock the workflow:

    bash
    # Signal the workflow to mark the failed compensation as complete
    temporal workflow signal --workflow-id 'trip-booking-saga-...' --name 'manualCompensationSignal' --input '"markCompensated"'
    
    # Or, signal it to retry the compensation logic again
    temporal workflow signal --workflow-id 'trip-booking-saga-...' --name 'manualCompensationSignal' --input '"retry"'

    This pattern provides a critical escape hatch, ensuring that automated processes have a path for manual resolution when they encounter truly exceptional failures.

    3. Performance and Scalability

    Activity Heartbeating

    If an activity can run for a very long time (e.g., a video processing step that takes hours), a worker crash could go undetected until the startToCloseTimeout is reached. This can leave the entire saga stalled.

    Solution: Use Activity Heartbeating. The activity periodically reports back to the Temporal Cluster that it is still alive and making progress.

    typescript
    // src/activities.ts (example of a long-running activity)
    import { Context } from '@temporalio/activity';
    
    async function processLargeFile(filePath: string): Promise<void> {
      // ... setup logic
      for (let i = 0; i < 100; i++) {
        // Process a chunk of the file
        await processChunk(i);
    
        // Report progress back to Temporal
        Context.current().heartbeat(`Processed chunk ${i + 1} of 100`);
      }
    }

    If the cluster does not receive a heartbeat within the configured heartbeatTimeout, it will time out the activity and re-schedule it, allowing for much faster failure detection.

    Task Queue Tuning

    As your system grows, you may have different types of workloads. Some activities might be short-lived and high-throughput (like our booking activities), while others might be long-running and resource-intensive (like processLargeFile). Lumping them all onto the same task queue can lead to head-of-line blocking, where a long-running task prevents shorter tasks from being picked up.

    Solution: Use separate task queues for different workload profiles. Create a fleet of workers for each queue, scaled appropriately for the workload.

    typescript
    // Worker for high-throughput tasks
    const highThroughputWorker = await Worker.create({
      ...commonOptions,
      taskQueue: 'high-throughput-tasks',
      maxConcurrentActivityTaskExecutions: 200,
    });
    
    // Worker for resource-intensive tasks
    const resourceIntensiveWorker = await Worker.create({
      ...commonOptions,
      taskQueue: 'resource-intensive-tasks',
      maxConcurrentActivityTaskExecutions: 10, // Fewer concurrent tasks
    });

    When calling the activity from the workflow, specify the target task queue:

    typescript
    // In workflow
    const { bookFlight } = proxyActivities<typeof activities>({ 
        taskQueue: 'high-throughput-tasks', 
        /* ...other options */ 
    });
    
    const { processLargeFile } = proxyActivities<typeof activities>({ 
        taskQueue: 'resource-intensive-tasks', 
        /* ...other options */ 
    });

    This architectural separation is crucial for maintaining performance and reliability in a complex, high-volume system.


    The Saga pattern is a powerful technique for managing consistency in distributed systems. However, its implementation details are fraught with challenges around state management, failure recovery, and idempotency. By leveraging a durable execution engine like Temporal, we can abstract away these infrastructure concerns. This allows us to write orchestration logic that is clear, resilient, and testable, transforming a complex distributed systems problem into a more manageable and robust software engineering task.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles