Implementing the Saga Pattern with Temporal for Long-Running Workflows

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Peril of Distributed Transactions: From 2PC to Saga

In a monolithic architecture, ACID transactions are a safety net we often take for granted. When your services are decomposed into a distributed, microservices-based system, maintaining data consistency across service boundaries becomes a formidable challenge. The classic two-phase commit (2PC) protocol, while guaranteeing atomicity, introduces tight coupling and synchronous blocking, making it an anti-pattern for scalable, resilient systems. A failure in any participating service or the transaction coordinator can hold locks and stall the entire operation.

This is where the Saga pattern emerges as a superior alternative for managing long-running transactions. A Saga is a sequence of local transactions where each transaction updates data within a single service. If a local transaction fails, the Saga executes a series of compensating transactions to semantically undo the preceding successful transactions.

However, implementing a Saga from scratch is a significant engineering feat. You are responsible for:

  • State Management: Persistently tracking the Saga's progress. Which steps have completed? Which need to be compensated?
  • Idempotency: Ensuring that retried steps or compensation actions don't have unintended side effects.
  • Compensation Logic: Meticulously coding the reverse operation for every forward action.
  • Failure Recovery: Building a robust mechanism to resume a Saga from its last known state after a process crash or network partition.
  • This manual implementation often leads to a complex state machine managed in a database, a message queue with dead-letter queues, or a dedicated orchestration service—all of which introduce significant operational overhead and boilerplate code.

    Temporal: Shifting the Paradigm from State Machines to Code

    Temporal is a durable execution system that fundamentally changes how we build distributed applications. Instead of you managing state, Temporal manages the state of your code's execution. It provides a programming model where long-running, fault-tolerant processes, called Workflows, are written as ordinary code. The Temporal platform guarantees that your workflow code will execute to completion, regardless of process failures, server restarts, or network outages.

    This makes it an ideal platform for implementing Sagas. The complex state machine, retry logic, and persistence concerns are completely abstracted away. Your Saga becomes a straightforward sequence of function calls within a Temporal Workflow.

    Temporal's core components we'll leverage:

    * Workflow: The orchestrator of the Saga. The workflow code defines the sequence of operations and the compensation logic. Its state is durably persisted by the Temporal Cluster.

    * Activity: A single, well-defined business action that executes within the Saga (e.g., bookFlight, chargeCreditCard). Activities are where side effects (like database calls or API requests) happen. They are typically idempotent and can be retried independently.

    * Worker: A process that hosts and executes Workflow and Activity code. You scale your application by running more Worker processes.

    Using Temporal, the Saga's state is implicitly managed within the workflow's execution history. If a worker crashes mid-Saga, another worker will pick up exactly where the last one left off, with all local variables and execution state perfectly restored.

    Production Implementation: A Multi-Service Trip Booking Saga

    Let's model a realistic scenario: a trip booking service that must coordinate three separate microservices:

  • Flight Service: Books a flight.
  • Hotel Service: Reserves a hotel room.
  • Payment Service: Processes the payment.
  • If any of these steps fail, all previously completed steps must be compensated. For example, if payment fails, the booked flight and hotel must be canceled.

    Project Setup

    We'll use the Temporal TypeScript SDK. Ensure you have a Temporal development server running (e.g., via temporal server start-dev).

    bash
    npm init -y
    npm install @temporalio/client @temporalio/worker @temporalio/workflow nanoid
    # Dev dependencies
    npm install -D @temporalio/testing @types/node ts-node typescript

    Step 1: Defining the Activities

    Activities are the interface to the outside world. They are plain TypeScript functions that contain our business logic. We'll define them in src/activities.ts.

    typescript
    // src/activities.ts
    
    import { Context } from '@temporalio/activity';
    
    // Mock external service clients
    const flightService = {
      book: async (bookingId: string) => {
        console.log(`Booking flight for ${bookingId}`);
        // Simulate API call
        await new Promise(resolve => setTimeout(resolve, 500));
        if (Math.random() < 0.1) throw new Error('Flight service unavailable');
        return { flightId: `flight-${bookingId}` };
      },
      cancel: async (flightId: string) => {
        console.log(`Canceling flight ${flightId}`);
        await new Promise(resolve => setTimeout(resolve, 300));
      },
    };
    
    const hotelService = {
      reserve: async (bookingId: string) => {
        console.log(`Reserving hotel for ${bookingId}`);
        await new Promise(resolve => setTimeout(resolve, 500));
        if (Math.random() < 0.1) throw new Error('No rooms available');
        return { hotelId: `hotel-${bookingId}` };
      },
      cancel: async (hotelId: string) => {
        console.log(`Canceling hotel reservation ${hotelId}`);
        await new Promise(resolve => setTimeout(resolve, 300));
      },
    };
    
    const paymentService = {
      charge: async (bookingId: string, amount: number) => {
        console.log(`Charging ${amount} for booking ${bookingId}`);
        await new Promise(resolve => setTimeout(resolve, 500));
        // This service is more likely to fail
        if (Math.random() > 0.5) {
          throw new Error('Payment declined');
        }
        return { paymentId: `payment-${bookingId}` };
      },
      refund: async (paymentId: string) => {
        console.log(`Refunding payment ${paymentId}`);
        await new Promise(resolve => setTimeout(resolve, 300));
      },
    };
    
    // --- Activity Implementations ---
    
    export async function bookFlight(bookingId: string): Promise<{ flightId: string }> {
      // In a real app, you'd use dependency injection for service clients
      return await flightService.book(bookingId);
    }
    
    export async function cancelFlight(flightId: string): Promise<void> {
      await flightService.cancel(flightId);
    }
    
    export async function reserveHotel(bookingId: string): Promise<{ hotelId: string }> {
      return await hotelService.reserve(bookingId);
    }
    
    export async function cancelHotel(hotelId: string): Promise<void> {
      await hotelService.cancel(hotelId);
    }
    
    export async function chargePayment(bookingId: string, amount: number): Promise<{ paymentId: string }> {
      // Using activity info for idempotency token
      const { activityId } = Context.current().info;
      console.log(`Using idempotency key for payment: ${activityId}`);
      // Pass activityId to the payment service for idempotency
      return await paymentService.charge(bookingId, amount);
    }
    
    export async function refundPayment(paymentId: string): Promise<void> {
      await paymentService.refund(paymentId);
    }

    Key Production Pattern: Idempotency

    Notice in chargePayment, we access Context.current().info.activityId. Temporal guarantees that this ID is unique for every attempt of an activity execution. By passing this ID to your downstream service as an Idempotency-Key header or similar, you can prevent double charges if the activity is retried due to a transient failure.

    Step 2: Implementing the Saga Logic in a Workflow

    This is where the magic happens. We'll use a simple but powerful Saga execution pattern within the workflow. We maintain a list of compensation functions to be executed in case of failure.

    typescript
    // src/workflows.ts
    import { proxyActivities, sleep, workflowInfo } from '@temporalio/workflow';
    import type * as activities from './activities';
    
    // A simple Saga compensation stack
    interface Compensation {
      fn: () => Promise<any>;
      description: string;
    }
    
    const { 
      bookFlight, cancelFlight,
      reserveHotel, cancelHotel,
      chargePayment, refundPayment
    } = proxyActivities<typeof activities>({
      startToCloseTimeout: '10 seconds',
      // Set aggressive retry policies for activities
      retry: {
        initialInterval: '1 second',
        backoffCoefficient: 2.0,
        maximumInterval: '30 seconds',
        maximumAttempts: 5,
      },
    });
    
    export async function tripBookingSaga(bookingId: string, amount: number): Promise<string> {
      const compensations: Compensation[] = [];
    
      try {
        // 1. Book Flight
        const { flightId } = await bookFlight(bookingId);
        compensations.unshift({ 
          fn: () => cancelFlight(flightId),
          description: `Canceling flight ${flightId}`
        });
        console.log(`Flight booked: ${flightId}`);
    
        // 2. Reserve Hotel
        const { hotelId } = await reserveHotel(bookingId);
        compensations.unshift({ 
          fn: () => cancelHotel(hotelId),
          description: `Canceling hotel ${hotelId}`
        });
        console.log(`Hotel reserved: ${hotelId}`);
    
        // 3. Charge Payment
        const { paymentId } = await chargePayment(bookingId, amount);
        compensations.unshift({ 
          fn: () => refundPayment(paymentId),
          description: `Refunding payment ${paymentId}`
        });
        console.log(`Payment successful: ${paymentId}`);
    
        return `Trip booked successfully! WorkflowId: ${workflowInfo().workflowId}`;
    
      } catch (error) {
        console.error('Saga failed, starting compensation...', { error });
    
        for (const comp of compensations) {
          try {
            console.log(`Executing compensation: ${comp.description}`);
            await comp.fn();
          } catch (compensationError) {
            // In a real system, you'd want to handle compensation failures,
            // possibly by alerting a human operator.
            console.error(`Compensation failed: ${comp.description}`, { compensationError });
            // Continue to the next compensation
          }
        }
    
        throw new Error(`Saga failed and compensated: ${(error as Error).message}`);
      }
    }

    Analysis of the Workflow Code:

    * proxyActivities: This is how we invoke activities from a workflow. The configuration specifies timeouts and retry policies. This is crucial; we want individual steps to be resilient to transient issues without failing the entire Saga immediately.

    * compensations array: This acts as our compensation stack. After each successful forward action, we unshift (add to the beginning) its corresponding compensation action. This ensures that if a failure occurs, we execute compensations in the correct reverse order (LIFO - Last In, First Out).

    * try...catch block: This is the core of the Saga orchestration. If any activity throws an un-retryable error, execution jumps to the catch block.

    * Compensation Loop: The catch block iterates through the compensations stack and executes each function. The durability of the workflow guarantees that this compensation logic will run to completion, even if the worker executing it crashes.

    Step 3: Creating the Worker and Client

    Now, we need a Worker to host our code and a Client to start the workflow.

    typescript
    // src/worker.ts
    import { Worker } from '@temporalio/worker';
    import * as activities from './activities';
    
    async function run() {
      const worker = await Worker.create({
        workflowsPath: require.resolve('./workflows'),
        activities,
        taskQueue: 'trip-booking-saga',
      });
      await worker.run();
    }
    
    run().catch((err) => {
      console.error(err);
      process.exit(1);
    });
    typescript
    // src/client.ts
    import { Connection, Client } from '@temporalio/client';
    import { tripBookingSaga } from './workflows';
    import { nanoid } from 'nanoid';
    
    async function run() {
      const connection = await Connection.connect();
      const client = new Client({ connection });
    
      const bookingId = `trip-${nanoid()}`;
    
      console.log(`Starting trip booking saga for ID: ${bookingId}`);
    
      try {
        const result = await client.workflow.execute(tripBookingSaga, {
          taskQueue: 'trip-booking-saga',
          workflowId: `trip-booking-${bookingId}`,
          args: [bookingId, 5000], // bookingId, amount
        });
        console.log('Workflow Succeeded:', result);
      } catch (error) {
        console.error('Workflow Failed:', (error as Error).message);
      }
    }
    
    run().catch((err) => {
      console.error(err);
      process.exit(1);
    });

    To run this, you'll need three terminal windows:

  • temporal server start-dev
  • npx ts-node src/worker.ts
  • npx ts-node src/client.ts
  • Since our chargePayment activity is designed to fail frequently, you will often see the compensation logic kick in, printing the Canceling... and Refunding... logs in the worker output.

    Advanced Patterns and Edge Case Handling

    The simple implementation above is powerful, but real-world systems require more nuance.

    1. Parallel Sagas

    What if booking the flight and hotel can happen in parallel to speed up the process? The Saga logic must adapt.

    typescript
    // In src/workflows.ts
    export async function tripBookingSagaParallel(bookingId: string, amount: number): Promise<string> {
      const compensations: Compensation[] = [];
    
      try {
        // 1. Start Flight and Hotel booking in parallel
        const flightPromise = bookFlight(bookingId);
        const hotelPromise = reserveHotel(bookingId);
    
        const [{ flightId }, { hotelId }] = await Promise.all([flightPromise, hotelPromise]);
    
        // Add compensations AFTER they succeed
        compensations.unshift({ fn: () => cancelHotel(hotelId), description: `Canceling hotel ${hotelId}` });
        compensations.unshift({ fn: () => cancelFlight(flightId), description: `Canceling flight ${flightId}` });
        console.log(`Flight booked: ${flightId}, Hotel reserved: ${hotelId}`);
    
        // 2. Charge Payment (sequentially)
        const { paymentId } = await chargePayment(bookingId, amount);
        compensations.unshift({ fn: () => refundPayment(paymentId), description: `Refunding payment ${paymentId}` });
        console.log(`Payment successful: ${paymentId}`);
    
        return `Trip booked successfully (in parallel)! WorkflowId: ${workflowInfo().workflowId}`;
    
      } catch (error) {
        // ... compensation logic remains the same ...
        // It will correctly compensate for whatever completed before the failure.
        // If only the flight succeeded before the hotel failed, only the flight cancellation runs.
        console.error('Saga failed, starting compensation...', { error });
        for (const comp of compensations) {
          try {
            console.log(`Executing compensation: ${comp.description}`);
            await comp.fn();
          } catch (compensationError) {
            console.error(`Compensation failed: ${comp.description}`, { compensationError });
          }
        }
        throw new Error(`Saga failed and compensated: ${(error as Error).message}`);
      }
    }

    Here, Promise.all ensures we wait for both parallel operations to complete. If one fails, Promise.all rejects, and our catch block is triggered. The key is that we only add compensations to the stack after an operation is confirmed successful. The compensation logic correctly handles partial success within the parallel block.

    2. Non-Compensatable Actions

    Some actions cannot be undone, like sending a final confirmation email. The rule for Sagas is simple: execute non-compensatable actions only after all compensatable actions have succeeded.

    In our example, this action would be placed at the very end of the try block, just before the return statement. If it fails, the Saga has technically succeeded from a data consistency perspective, and the failure of the email can be handled separately (e.g., retried by a different system or logged for manual intervention).

    3. Testing Sagas with `@temporalio/testing`

    Testing distributed systems is notoriously difficult. Temporal's testing framework (@temporalio/testing) makes it remarkably straightforward to write deterministic unit tests for complex workflow logic.

    typescript
    // src/workflows.test.ts
    import { TestWorkflowEnvironment } from '@temporalio/testing';
    import { Worker } from '@temporalio/worker';
    import { tripBookingSaga } from './workflows';
    import * as activities from './activities';
    
    // Use jest.mock for fine-grained control
    jest.mock('./activities');
    const mockedActivities = activities as jest.Mocked<typeof activities>;
    
    describe('tripBookingSaga', () => {
      let testEnv: TestWorkflowEnvironment;
    
      beforeAll(async () => {
        testEnv = await TestWorkflowEnvironment.createTimeSkipping();
      });
    
      afterAll(async () => {
        await testEnv?.teardown();
      });
    
      it('should complete successfully when all activities succeed', async () => {
        mockedActivities.bookFlight.mockResolvedValue({ flightId: 'F123' });
        mockedActivities.reserveHotel.mockResolvedValue({ hotelId: 'H456' });
        mockedActivities.chargePayment.mockResolvedValue({ paymentId: 'P789' });
    
        const worker = await Worker.create({ ... }); // Worker setup
        const result = await worker.runUntil(async () => {
          return testEnv.client.workflow.execute(tripBookingSaga, {
            workflowId: 'test-success',
            taskQueue: 'test',
            args: ['booking-1', 100],
          });
        });
    
        expect(result).toContain('Trip booked successfully!');
        expect(mockedActivities.cancelFlight).not.toHaveBeenCalled();
        expect(mockedActivities.cancelHotel).not.toHaveBeenCalled();
        expect(mockedActivities.refundPayment).not.toHaveBeenCalled();
      });
    
      it('should compensate all steps if payment fails', async () => {
        mockedActivities.bookFlight.mockResolvedValue({ flightId: 'F123' });
        mockedActivities.reserveHotel.mockResolvedValue({ hotelId: 'H456' });
        mockedActivities.chargePayment.mockRejectedValue(new Error('Payment Declined'));
    
        const worker = await Worker.create({ ... }); // Worker setup
        await expect(
          worker.runUntil(async () => {
            return testEnv.client.workflow.execute(tripBookingSaga, {
              workflowId: 'test-failure',
              taskQueue: 'test',
              args: ['booking-2', 200],
            });
          })
        ).rejects.toThrow('Saga failed and compensated: Payment Declined');
    
        // Verify compensations were called in reverse order
        expect(mockedActivities.refundPayment).not.toHaveBeenCalled(); // It failed, so no payment to refund
        expect(mockedActivities.cancelHotel).toHaveBeenCalledWith('H456');
        expect(mockedActivities.cancelFlight).toHaveBeenCalledWith('F123');
      });
    });

    By mocking the activities, we can test the workflow's orchestration logic in isolation. We can simulate failures at any stage of the Saga and assert that the correct compensation functions are called in the correct order. The time-skipping environment allows us to test long-running workflows with timers and sleeps in milliseconds.

    Performance and Scalability Considerations

    * Activity Heartbeating: If an activity within your Saga can run for a long time (e.g., provisioning a resource that takes 30 minutes), it should perform heartbeating via Context.current().heartbeat(). This allows Temporal to detect if the worker running the activity has crashed much faster than waiting for the startToCloseTimeout, enabling quicker failure detection and recovery.

    * Worker Tuning: The throughput of your Sagas is determined by your Worker configuration. The maxConcurrentActivityTaskExecutions setting is critical. If your activities are I/O-bound (making network calls), you can set this number high (e.g., 100) to allow a single worker process to handle many concurrent Saga steps. If they are CPU-bound, you should keep it closer to the number of cores on the machine.

    * Task Queues: For very high-volume systems, consider separating different Sagas or even different steps of a single Saga onto different task queues. This allows you to scale worker fleets independently. For example, you could have a large worker fleet for the high-traffic chargePayment activity and a smaller one for less frequent booking activities.

    Conclusion

    The Saga pattern is a powerful solution for maintaining data consistency in distributed systems, but its manual implementation is fraught with complexity. By leveraging a durable execution system like Temporal, we elevate the implementation from a complex state management problem to a business logic problem.

    Temporal provides the durable persistence, guaranteed execution, and robust retry mechanisms out of the box, allowing senior engineers to write Sagas as straightforward, testable code. The patterns discussed here—parallel execution, idempotency, and deterministic testing—are not just theoretical but are the building blocks for creating truly resilient, scalable, and maintainable microservice architectures.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles