Implementing the Saga Pattern with Temporal for Long-Running Workflows
The Peril of Distributed Transactions: From 2PC to Saga
In a monolithic architecture, ACID transactions are a safety net we often take for granted. When your services are decomposed into a distributed, microservices-based system, maintaining data consistency across service boundaries becomes a formidable challenge. The classic two-phase commit (2PC) protocol, while guaranteeing atomicity, introduces tight coupling and synchronous blocking, making it an anti-pattern for scalable, resilient systems. A failure in any participating service or the transaction coordinator can hold locks and stall the entire operation.
This is where the Saga pattern emerges as a superior alternative for managing long-running transactions. A Saga is a sequence of local transactions where each transaction updates data within a single service. If a local transaction fails, the Saga executes a series of compensating transactions to semantically undo the preceding successful transactions.
However, implementing a Saga from scratch is a significant engineering feat. You are responsible for:
This manual implementation often leads to a complex state machine managed in a database, a message queue with dead-letter queues, or a dedicated orchestration service—all of which introduce significant operational overhead and boilerplate code.
Temporal: Shifting the Paradigm from State Machines to Code
Temporal is a durable execution system that fundamentally changes how we build distributed applications. Instead of you managing state, Temporal manages the state of your code's execution. It provides a programming model where long-running, fault-tolerant processes, called Workflows, are written as ordinary code. The Temporal platform guarantees that your workflow code will execute to completion, regardless of process failures, server restarts, or network outages.
This makes it an ideal platform for implementing Sagas. The complex state machine, retry logic, and persistence concerns are completely abstracted away. Your Saga becomes a straightforward sequence of function calls within a Temporal Workflow.
Temporal's core components we'll leverage:
* Workflow: The orchestrator of the Saga. The workflow code defines the sequence of operations and the compensation logic. Its state is durably persisted by the Temporal Cluster.
* Activity: A single, well-defined business action that executes within the Saga (e.g., bookFlight, chargeCreditCard). Activities are where side effects (like database calls or API requests) happen. They are typically idempotent and can be retried independently.
* Worker: A process that hosts and executes Workflow and Activity code. You scale your application by running more Worker processes.
Using Temporal, the Saga's state is implicitly managed within the workflow's execution history. If a worker crashes mid-Saga, another worker will pick up exactly where the last one left off, with all local variables and execution state perfectly restored.
Production Implementation: A Multi-Service Trip Booking Saga
Let's model a realistic scenario: a trip booking service that must coordinate three separate microservices:
If any of these steps fail, all previously completed steps must be compensated. For example, if payment fails, the booked flight and hotel must be canceled.
Project Setup
We'll use the Temporal TypeScript SDK. Ensure you have a Temporal development server running (e.g., via temporal server start-dev).
npm init -y
npm install @temporalio/client @temporalio/worker @temporalio/workflow nanoid
# Dev dependencies
npm install -D @temporalio/testing @types/node ts-node typescript
Step 1: Defining the Activities
Activities are the interface to the outside world. They are plain TypeScript functions that contain our business logic. We'll define them in src/activities.ts.
// src/activities.ts
import { Context } from '@temporalio/activity';
// Mock external service clients
const flightService = {
book: async (bookingId: string) => {
console.log(`Booking flight for ${bookingId}`);
// Simulate API call
await new Promise(resolve => setTimeout(resolve, 500));
if (Math.random() < 0.1) throw new Error('Flight service unavailable');
return { flightId: `flight-${bookingId}` };
},
cancel: async (flightId: string) => {
console.log(`Canceling flight ${flightId}`);
await new Promise(resolve => setTimeout(resolve, 300));
},
};
const hotelService = {
reserve: async (bookingId: string) => {
console.log(`Reserving hotel for ${bookingId}`);
await new Promise(resolve => setTimeout(resolve, 500));
if (Math.random() < 0.1) throw new Error('No rooms available');
return { hotelId: `hotel-${bookingId}` };
},
cancel: async (hotelId: string) => {
console.log(`Canceling hotel reservation ${hotelId}`);
await new Promise(resolve => setTimeout(resolve, 300));
},
};
const paymentService = {
charge: async (bookingId: string, amount: number) => {
console.log(`Charging ${amount} for booking ${bookingId}`);
await new Promise(resolve => setTimeout(resolve, 500));
// This service is more likely to fail
if (Math.random() > 0.5) {
throw new Error('Payment declined');
}
return { paymentId: `payment-${bookingId}` };
},
refund: async (paymentId: string) => {
console.log(`Refunding payment ${paymentId}`);
await new Promise(resolve => setTimeout(resolve, 300));
},
};
// --- Activity Implementations ---
export async function bookFlight(bookingId: string): Promise<{ flightId: string }> {
// In a real app, you'd use dependency injection for service clients
return await flightService.book(bookingId);
}
export async function cancelFlight(flightId: string): Promise<void> {
await flightService.cancel(flightId);
}
export async function reserveHotel(bookingId: string): Promise<{ hotelId: string }> {
return await hotelService.reserve(bookingId);
}
export async function cancelHotel(hotelId: string): Promise<void> {
await hotelService.cancel(hotelId);
}
export async function chargePayment(bookingId: string, amount: number): Promise<{ paymentId: string }> {
// Using activity info for idempotency token
const { activityId } = Context.current().info;
console.log(`Using idempotency key for payment: ${activityId}`);
// Pass activityId to the payment service for idempotency
return await paymentService.charge(bookingId, amount);
}
export async function refundPayment(paymentId: string): Promise<void> {
await paymentService.refund(paymentId);
}
Key Production Pattern: Idempotency
Notice in chargePayment, we access Context.current().info.activityId. Temporal guarantees that this ID is unique for every attempt of an activity execution. By passing this ID to your downstream service as an Idempotency-Key header or similar, you can prevent double charges if the activity is retried due to a transient failure.
Step 2: Implementing the Saga Logic in a Workflow
This is where the magic happens. We'll use a simple but powerful Saga execution pattern within the workflow. We maintain a list of compensation functions to be executed in case of failure.
// src/workflows.ts
import { proxyActivities, sleep, workflowInfo } from '@temporalio/workflow';
import type * as activities from './activities';
// A simple Saga compensation stack
interface Compensation {
fn: () => Promise<any>;
description: string;
}
const {
bookFlight, cancelFlight,
reserveHotel, cancelHotel,
chargePayment, refundPayment
} = proxyActivities<typeof activities>({
startToCloseTimeout: '10 seconds',
// Set aggressive retry policies for activities
retry: {
initialInterval: '1 second',
backoffCoefficient: 2.0,
maximumInterval: '30 seconds',
maximumAttempts: 5,
},
});
export async function tripBookingSaga(bookingId: string, amount: number): Promise<string> {
const compensations: Compensation[] = [];
try {
// 1. Book Flight
const { flightId } = await bookFlight(bookingId);
compensations.unshift({
fn: () => cancelFlight(flightId),
description: `Canceling flight ${flightId}`
});
console.log(`Flight booked: ${flightId}`);
// 2. Reserve Hotel
const { hotelId } = await reserveHotel(bookingId);
compensations.unshift({
fn: () => cancelHotel(hotelId),
description: `Canceling hotel ${hotelId}`
});
console.log(`Hotel reserved: ${hotelId}`);
// 3. Charge Payment
const { paymentId } = await chargePayment(bookingId, amount);
compensations.unshift({
fn: () => refundPayment(paymentId),
description: `Refunding payment ${paymentId}`
});
console.log(`Payment successful: ${paymentId}`);
return `Trip booked successfully! WorkflowId: ${workflowInfo().workflowId}`;
} catch (error) {
console.error('Saga failed, starting compensation...', { error });
for (const comp of compensations) {
try {
console.log(`Executing compensation: ${comp.description}`);
await comp.fn();
} catch (compensationError) {
// In a real system, you'd want to handle compensation failures,
// possibly by alerting a human operator.
console.error(`Compensation failed: ${comp.description}`, { compensationError });
// Continue to the next compensation
}
}
throw new Error(`Saga failed and compensated: ${(error as Error).message}`);
}
}
Analysis of the Workflow Code:
* proxyActivities: This is how we invoke activities from a workflow. The configuration specifies timeouts and retry policies. This is crucial; we want individual steps to be resilient to transient issues without failing the entire Saga immediately.
* compensations array: This acts as our compensation stack. After each successful forward action, we unshift (add to the beginning) its corresponding compensation action. This ensures that if a failure occurs, we execute compensations in the correct reverse order (LIFO - Last In, First Out).
* try...catch block: This is the core of the Saga orchestration. If any activity throws an un-retryable error, execution jumps to the catch block.
* Compensation Loop: The catch block iterates through the compensations stack and executes each function. The durability of the workflow guarantees that this compensation logic will run to completion, even if the worker executing it crashes.
Step 3: Creating the Worker and Client
Now, we need a Worker to host our code and a Client to start the workflow.
// src/worker.ts
import { Worker } from '@temporalio/worker';
import * as activities from './activities';
async function run() {
const worker = await Worker.create({
workflowsPath: require.resolve('./workflows'),
activities,
taskQueue: 'trip-booking-saga',
});
await worker.run();
}
run().catch((err) => {
console.error(err);
process.exit(1);
});
// src/client.ts
import { Connection, Client } from '@temporalio/client';
import { tripBookingSaga } from './workflows';
import { nanoid } from 'nanoid';
async function run() {
const connection = await Connection.connect();
const client = new Client({ connection });
const bookingId = `trip-${nanoid()}`;
console.log(`Starting trip booking saga for ID: ${bookingId}`);
try {
const result = await client.workflow.execute(tripBookingSaga, {
taskQueue: 'trip-booking-saga',
workflowId: `trip-booking-${bookingId}`,
args: [bookingId, 5000], // bookingId, amount
});
console.log('Workflow Succeeded:', result);
} catch (error) {
console.error('Workflow Failed:', (error as Error).message);
}
}
run().catch((err) => {
console.error(err);
process.exit(1);
});
To run this, you'll need three terminal windows:
temporal server start-devnpx ts-node src/worker.tsnpx ts-node src/client.tsSince our chargePayment activity is designed to fail frequently, you will often see the compensation logic kick in, printing the Canceling... and Refunding... logs in the worker output.
Advanced Patterns and Edge Case Handling
The simple implementation above is powerful, but real-world systems require more nuance.
1. Parallel Sagas
What if booking the flight and hotel can happen in parallel to speed up the process? The Saga logic must adapt.
// In src/workflows.ts
export async function tripBookingSagaParallel(bookingId: string, amount: number): Promise<string> {
const compensations: Compensation[] = [];
try {
// 1. Start Flight and Hotel booking in parallel
const flightPromise = bookFlight(bookingId);
const hotelPromise = reserveHotel(bookingId);
const [{ flightId }, { hotelId }] = await Promise.all([flightPromise, hotelPromise]);
// Add compensations AFTER they succeed
compensations.unshift({ fn: () => cancelHotel(hotelId), description: `Canceling hotel ${hotelId}` });
compensations.unshift({ fn: () => cancelFlight(flightId), description: `Canceling flight ${flightId}` });
console.log(`Flight booked: ${flightId}, Hotel reserved: ${hotelId}`);
// 2. Charge Payment (sequentially)
const { paymentId } = await chargePayment(bookingId, amount);
compensations.unshift({ fn: () => refundPayment(paymentId), description: `Refunding payment ${paymentId}` });
console.log(`Payment successful: ${paymentId}`);
return `Trip booked successfully (in parallel)! WorkflowId: ${workflowInfo().workflowId}`;
} catch (error) {
// ... compensation logic remains the same ...
// It will correctly compensate for whatever completed before the failure.
// If only the flight succeeded before the hotel failed, only the flight cancellation runs.
console.error('Saga failed, starting compensation...', { error });
for (const comp of compensations) {
try {
console.log(`Executing compensation: ${comp.description}`);
await comp.fn();
} catch (compensationError) {
console.error(`Compensation failed: ${comp.description}`, { compensationError });
}
}
throw new Error(`Saga failed and compensated: ${(error as Error).message}`);
}
}
Here, Promise.all ensures we wait for both parallel operations to complete. If one fails, Promise.all rejects, and our catch block is triggered. The key is that we only add compensations to the stack after an operation is confirmed successful. The compensation logic correctly handles partial success within the parallel block.
2. Non-Compensatable Actions
Some actions cannot be undone, like sending a final confirmation email. The rule for Sagas is simple: execute non-compensatable actions only after all compensatable actions have succeeded.
In our example, this action would be placed at the very end of the try block, just before the return statement. If it fails, the Saga has technically succeeded from a data consistency perspective, and the failure of the email can be handled separately (e.g., retried by a different system or logged for manual intervention).
3. Testing Sagas with `@temporalio/testing`
Testing distributed systems is notoriously difficult. Temporal's testing framework (@temporalio/testing) makes it remarkably straightforward to write deterministic unit tests for complex workflow logic.
// src/workflows.test.ts
import { TestWorkflowEnvironment } from '@temporalio/testing';
import { Worker } from '@temporalio/worker';
import { tripBookingSaga } from './workflows';
import * as activities from './activities';
// Use jest.mock for fine-grained control
jest.mock('./activities');
const mockedActivities = activities as jest.Mocked<typeof activities>;
describe('tripBookingSaga', () => {
let testEnv: TestWorkflowEnvironment;
beforeAll(async () => {
testEnv = await TestWorkflowEnvironment.createTimeSkipping();
});
afterAll(async () => {
await testEnv?.teardown();
});
it('should complete successfully when all activities succeed', async () => {
mockedActivities.bookFlight.mockResolvedValue({ flightId: 'F123' });
mockedActivities.reserveHotel.mockResolvedValue({ hotelId: 'H456' });
mockedActivities.chargePayment.mockResolvedValue({ paymentId: 'P789' });
const worker = await Worker.create({ ... }); // Worker setup
const result = await worker.runUntil(async () => {
return testEnv.client.workflow.execute(tripBookingSaga, {
workflowId: 'test-success',
taskQueue: 'test',
args: ['booking-1', 100],
});
});
expect(result).toContain('Trip booked successfully!');
expect(mockedActivities.cancelFlight).not.toHaveBeenCalled();
expect(mockedActivities.cancelHotel).not.toHaveBeenCalled();
expect(mockedActivities.refundPayment).not.toHaveBeenCalled();
});
it('should compensate all steps if payment fails', async () => {
mockedActivities.bookFlight.mockResolvedValue({ flightId: 'F123' });
mockedActivities.reserveHotel.mockResolvedValue({ hotelId: 'H456' });
mockedActivities.chargePayment.mockRejectedValue(new Error('Payment Declined'));
const worker = await Worker.create({ ... }); // Worker setup
await expect(
worker.runUntil(async () => {
return testEnv.client.workflow.execute(tripBookingSaga, {
workflowId: 'test-failure',
taskQueue: 'test',
args: ['booking-2', 200],
});
})
).rejects.toThrow('Saga failed and compensated: Payment Declined');
// Verify compensations were called in reverse order
expect(mockedActivities.refundPayment).not.toHaveBeenCalled(); // It failed, so no payment to refund
expect(mockedActivities.cancelHotel).toHaveBeenCalledWith('H456');
expect(mockedActivities.cancelFlight).toHaveBeenCalledWith('F123');
});
});
By mocking the activities, we can test the workflow's orchestration logic in isolation. We can simulate failures at any stage of the Saga and assert that the correct compensation functions are called in the correct order. The time-skipping environment allows us to test long-running workflows with timers and sleeps in milliseconds.
Performance and Scalability Considerations
* Activity Heartbeating: If an activity within your Saga can run for a long time (e.g., provisioning a resource that takes 30 minutes), it should perform heartbeating via Context.current().heartbeat(). This allows Temporal to detect if the worker running the activity has crashed much faster than waiting for the startToCloseTimeout, enabling quicker failure detection and recovery.
* Worker Tuning: The throughput of your Sagas is determined by your Worker configuration. The maxConcurrentActivityTaskExecutions setting is critical. If your activities are I/O-bound (making network calls), you can set this number high (e.g., 100) to allow a single worker process to handle many concurrent Saga steps. If they are CPU-bound, you should keep it closer to the number of cores on the machine.
* Task Queues: For very high-volume systems, consider separating different Sagas or even different steps of a single Saga onto different task queues. This allows you to scale worker fleets independently. For example, you could have a large worker fleet for the high-traffic chargePayment activity and a smaller one for less frequent booking activities.
Conclusion
The Saga pattern is a powerful solution for maintaining data consistency in distributed systems, but its manual implementation is fraught with complexity. By leveraging a durable execution system like Temporal, we elevate the implementation from a complex state management problem to a business logic problem.
Temporal provides the durable persistence, guaranteed execution, and robust retry mechanisms out of the box, allowing senior engineers to write Sagas as straightforward, testable code. The patterns discussed here—parallel execution, idempotency, and deterministic testing—are not just theoretical but are the building blocks for creating truly resilient, scalable, and maintainable microservice architectures.