Temporal Sagas: Orchestrating Fault-Tolerant Microservice Transactions
The Inescapable Challenge of Consistency in Microservices
In a distributed microservices architecture, the dream of ACID-compliant, two-phase commit (2PC) transactions across service boundaries quickly evaporates under the harsh realities of the CAP theorem. Network partitions are not an anomaly; they are an inevitability. Attempting to enforce strict consistency with distributed locks and 2PC leads to brittle, tightly coupled systems with poor availability. When the PaymentService is down, you cannot afford for the OrderService to be completely blocked.
This is where the Saga pattern emerges as the canonical solution for managing data consistency across multiple services. A Saga is a sequence of local transactions where each transaction updates data within a single service. If a local transaction fails, the Saga executes a series of compensating transactions that revert the preceding successful transactions.
However, implementing Sagas is notoriously complex. The two primary approaches, Choreography and Orchestration, present their own challenges:
* Choreography: Services communicate through an event bus (e.g., Kafka). The OrderService emits an OrderCreated event, the PaymentService consumes it and emits PaymentProcessed, and so on. This is loosely coupled but suffers from poor visibility. Understanding the state of a business transaction requires correlating events across multiple topics, making debugging a nightmare. Error handling is distributed and often inconsistent.
* Orchestration: A central orchestrator explicitly calls each service in sequence and is responsible for invoking compensating actions upon failure. This provides clear visibility and centralized control but introduces the risk of the orchestrator itself becoming a single point of failure and a complex, stateful component to manage.
This is the precise problem domain where Temporal.io excels. Temporal is not just a workflow engine; it's a durable execution platform that turns the orchestrator from a liability into a highly available, fault-tolerant, and surprisingly simple component. It allows you to write complex, long-running, stateful logic as straightforward code, while the platform handles the persistence, retries, and state management.
This article will demonstrate how to implement a robust, production-grade Saga orchestrator using Temporal and TypeScript, focusing on the advanced patterns required to handle real-world failures.
A Production-Grade Order Fulfillment Saga
Let's model a realistic e-commerce order fulfillment process. A successful transaction involves these steps across different services:
PaymentService): Place a hold on the customer's credit card.InventoryService): Decrement the stock count for the ordered items.PaymentService): Finalize the charge.ShippingService): Generate a shipping label and tracking number.If any step fails, we must execute compensating actions in reverse order:
* If CreateShipment fails, we must Void Payment (not refund, as it was only authorized and captured) and Release Inventory.
* If CapturePayment fails, we must Void Payment Authorization and Release Inventory.
* If ReserveInventory fails, we must Void Payment Authorization.
With Temporal, this entire orchestration logic is captured in a single, durable function called a Workflow.
Defining the Activities
Activities are the building blocks of a Temporal Workflow. They represent the individual units of work that interact with the outside world (e.g., calling a microservice API, querying a database). They are designed to be retried and must be idempotent.
Here are the interfaces for our activities, including their compensating actions:
// src/activities.ts
// These would typically be clients for your microservices
import { paymentService } from './services/paymentService';
import { inventoryService } from './services/inventoryService';
import { shippingService } from './services/shippingService';
export interface OrderDetails {
userId: string;
itemId: string;
quantity: number;
amountCents: number;
idempotencyKey: string; // Critical for idempotency
}
// --- Forward Actions ---
export async function authorizePayment(details: OrderDetails): Promise<{ transactionId: string }> {
console.log(`Authorizing payment for ${details.amountCents} cents.`);
const result = await paymentService.authorize(details.idempotencyKey, details.amountCents);
return { transactionId: result.transactionId };
}
export async function reserveInventory(details: OrderDetails): Promise<{ reservationId: string }> {
console.log(`Reserving ${details.quantity} of item ${details.itemId}.`);
const result = await inventoryService.reserve(details.idempotencyKey, details.itemId, details.quantity);
return { reservationId: result.reservationId };
}
export async function capturePayment(transactionId: string, amountCents: number): Promise<void> {
console.log(`Capturing payment for transaction ${transactionId}.`);
await paymentService.capture(transactionId, amountCents);
}
export async function createShipment(details: OrderDetails): Promise<{ trackingNumber: string }> {
console.log(`Creating shipment for order.`);
const result = await shippingService.create(details.idempotencyKey, details.itemId, details.quantity);
return { trackingNumber: result.trackingNumber };
}
// --- Compensation Actions ---
export async function voidPaymentAuthorization(transactionId: string): Promise<void> {
console.log(`Voiding payment authorization for transaction ${transactionId}.`);
await paymentService.void(transactionId);
}
export async function releaseInventory(reservationId: string): Promise<void> {
console.log(`Releasing inventory for reservation ${reservationId}.`);
await inventoryService.release(reservationId);
}
export async function refundPayment(transactionId: string): Promise<void> {
console.log(`Refunding payment for transaction ${transactionId}.`);
// Note: In our flow, we'd void, but a full refund is another common compensation
await paymentService.refund(transactionId);
}
The Saga Workflow Implementation
The magic of Temporal is how it allows us to structure the Saga. We can use standard try...catch...finally blocks. The code inside the try block represents the forward path of the Saga. The catch block contains the compensation logic. Temporal guarantees that this logic will execute, even if the worker process crashes and restarts on another machine years later.
// src/workflows.ts
import * as wf from '@temporalio/workflow';
import type * as activities from './activities';
import { OrderDetails } from './activities';
// Proxy activities to the workflow
const {
authorizePayment,
reserveInventory,
capturePayment,
createShipment,
voidPaymentAuthorization,
releaseInventory,
refundPayment
} = wf.proxyActivities<typeof activities>({
startToCloseTimeout: '30 seconds',
// Standard retry policy for most activities
retry: {
initialInterval: '1 second',
backoffCoefficient: 2,
maximumInterval: '1 minute',
nonRetryableErrorTypes: ['ApplicationError'], // Don't retry on business logic failures
},
});
export async function orderFulfillmentSaga(orderDetails: OrderDetails): Promise<{ trackingNumber: string }> {
const compensationStack: (() => Promise<void>)[] = [];
let capturedPayment = false;
try {
// 1. Authorize Payment
const { transactionId } = await authorizePayment(orderDetails);
compensationStack.push(() => voidPaymentAuthorization(transactionId));
// 2. Reserve Inventory
const { reservationId } = await reserveInventory(orderDetails);
compensationStack.push(() => releaseInventory(reservationId));
// 3. Capture Payment
await capturePayment(transactionId, orderDetails.amountCents);
capturedPayment = true; // Mark payment as captured
// 4. Create Shipment
const { trackingNumber } = await createShipment(orderDetails);
// Saga completed successfully!
return { trackingNumber };
} catch (err) {
// If capture succeeded, the primary compensation is a refund.
if (capturedPayment) {
// A more robust implementation might use a different activity for refund vs void.
// This demonstrates a conditional compensation path.
const { transactionId } = await wf.getState('transactionId'); // Assuming we stored it
await refundPayment(transactionId);
} else {
// Standard compensation rollback
for (const compensation of compensationStack.reverse()) {
await compensation();
}
}
// Re-throw the error to fail the workflow
throw err;
}
}
This implementation is deceptively simple but incredibly powerful:
compensationStack array is a durable, workflow-managed variable. If the worker crashes after authorizePayment but before reserveInventory, Temporal will resume the workflow on another worker with compensationStack correctly populated.catch block iterates through the stack in reverse, executing the compensating activities. This is the core of the Saga pattern, implemented with familiar programming constructs.await calls can take seconds, minutes, or even days. Temporal transparently persists the workflow's state after each await, making it resilient to any infrastructure failure.Advanced Pattern: Absolute Idempotency in Activities
A common failure mode in distributed systems is the "at-least-once" delivery guarantee. A Temporal worker might successfully execute the authorizePayment activity, but crash before it can acknowledge completion to the Temporal server. The server, seeing no acknowledgment, will re-schedule the activity on another worker. If your PaymentService is not idempotent, you will double-charge the customer.
While Temporal offers de-duplication for workflow execution starts, it's a best practice to ensure your activities are fully idempotent at the application level.
The most robust pattern is to have the workflow generate a unique idempotency key for each logical operation and pass it to the activity.
Workflow Enhancement:
We can use wf.uuid4() to generate a unique key for each step. This key is deterministic within the context of a workflow re-execution, which is crucial.
// src/workflows.ts (enhanced)
// ... imports
export async function orderFulfillmentSaga(orderDetails: OrderDetails): Promise<{ trackingNumber: string }> {
// ... compensationStack setup
try {
// Generate a deterministic UUID for this specific operation within this workflow execution
const paymentIdempotencyKey = await wf.uuid4();
const { transactionId } = await authorizePayment({ ...orderDetails, idempotencyKey: paymentIdempotencyKey });
// ... rest of the workflow
} catch (err) {
// ... compensation logic
}
}
Service-Side Implementation:
Your downstream service (PaymentService in this case) must be designed to handle this key. This typically involves a database table to track processed idempotency keys.
// Example implementation in your PaymentService (e.g., Express.js)
// DB Schema: idempotency_keys (key TEXT PRIMARY KEY, response JSONB, created_at TIMESTAMPTZ)
async function handleAuthorization(req, res) {
const { idempotencyKey, amount } = req.body;
// 1. Check if we've seen this key before
const existing = await db.query('SELECT response FROM idempotency_keys WHERE key = $1', [idempotencyKey]);
if (existing.rows.length > 0) {
// Key already processed, return the saved response
return res.status(200).json(existing.rows[0].response);
}
let transactionResult;
try {
// 2. Perform the actual operation
transactionResult = await paymentGateway.authorize(amount);
// 3. Atomically store the key and the result
// Use a transaction to ensure you don't have a partial write
await db.query('BEGIN');
await db.query(
'INSERT INTO idempotency_keys (key, response) VALUES ($1, $2)',
[idempotencyKey, transactionResult]
);
await db.query('COMMIT');
return res.status(201).json(transactionResult);
} catch (error) {
await db.query('ROLLBACK');
// Handle error
}
}
This pattern provides ironclad protection against duplicate operations caused by retries.
Edge Case: Handling Non-Compensatable Failures
What happens when a compensation fails? Imagine the createShipment activity fails, and the Saga correctly calls refundPayment. But the PaymentService is now down, and refundPayment itself fails repeatedly. This is a business-critical failure that often requires human intervention.
We can model this in Temporal using more aggressive retry policies for compensations and a final escalation path.
1. Define a Separate, More Aggressive Activity Proxy for Compensations:
// src/workflows.ts
// ... regular activity proxy
const compensationActivities = wf.proxyActivities<typeof activities>({
startToCloseTimeout: '5 minutes', // Give more time for compensations
retry: {
initialInterval: '5 seconds',
backoffCoefficient: 2,
maximumInterval: '5 minutes',
// We might retry indefinitely until a human intervenes
maximumAttempts: 0,
},
});
2. Use a Signal for Human Intervention:
Temporal workflows can wait for external events called Signals. We can define a signal manualRefundProcessed that an operator can send to the workflow after they've manually refunded the customer through the payment provider's dashboard.
// src/workflows.ts (enhanced catch block)
// ... imports
// Define a signal
export const manualRefundSignal = wf.defineSignal<[string]>('manualRefundProcessed');
export async function orderFulfillmentSaga(orderDetails: OrderDetails): Promise<{ trackingNumber: string }> {
const compensationStack: (() => Promise<void>)[] = [];
let capturedPayment = false;
let transactionId: string | null = null;
let refundFailed = false;
// Define a handler for the signal
wf.setHandler(manualRefundSignal, (refundConfirmationId) => {
wf.log.info(`Manual refund confirmed with ID: ${refundConfirmationId}`);
refundFailed = false; // Unblock the workflow
});
try {
const paymentResult = await authorizePayment(orderDetails);
transactionId = paymentResult.transactionId;
// ... rest of try block
} catch (err) {
if (capturedPayment && transactionId) {
try {
await compensationActivities.refundPayment(transactionId);
} catch (refundErr) {
wf.log.error('CRITICAL: Payment refund failed. Awaiting manual intervention.');
refundFailed = true;
// This is a durable timer. The workflow will sleep here, potentially for days.
await wf.condition(() => !refundFailed);
wf.log.info('Manual refund signal received. Continuing compensation.');
}
}
// Execute remaining compensations
for (const compensation of compensationStack.reverse()) {
// ...
}
throw err;
}
}
In this advanced pattern:
* If refundPayment fails after all its retries, the workflow logs a critical error and sets a refundFailed flag.
* await wf.condition(() => !refundFailed) effectively pauses the workflow indefinitely. Your observability stack should trigger an alert from the critical log message.
* An engineer investigates, manually processes the refund, and then uses the Temporal CLI or SDK to send the manualRefundProcessed signal to the specific workflow execution.
* The signal handler updates the refundFailed flag, the condition becomes true, and the workflow unblocks to continue its compensation logic.
This demonstrates how Temporal can manage escalations and human-in-the-loop workflows, a notoriously difficult problem to solve with traditional systems.
Performance Pattern: The Claim Check for Large Payloads
Temporal's event history, which records every step of a workflow's execution, has a size limit (typically 50MB). If your Saga passes large data objects (e.g., a multi-megabyte user profile or a product image) between activities, you can quickly exhaust this limit, causing the workflow to fail.
The solution is the Claim Check pattern. Instead of passing the large payload directly, you:
- Create an activity that uploads the payload to an external blob store (like S3 or GCS).
- This activity returns a reference or URI (the "claim check
- Pass this small claim check string through the workflow.
- Any subsequent activity that needs the data uses another activity to download it from the blob store using the claim check.
Implementation:
// src/activities/claimCheck.ts
import { S3Client, PutObjectCommand, GetObjectCommand } from "@aws-sdk/client-s3";
const s3 = new S3Client({});
const BUCKET_NAME = process.env.S3_BUCKET!;
export async function upload(data: Buffer, key: string): Promise<string> {
const command = new PutObjectCommand({
Bucket: BUCKET_NAME,
Key: key,
Body: data,
});
await s3.send(command);
return `s3://${BUCKET_NAME}/${key}`;
}
export async function download(claimCheck: string): Promise<Buffer> {
const url = new URL(claimCheck);
const key = url.pathname.substring(1);
const command = new GetObjectCommand({
Bucket: BUCKET_NAME,
Key: key,
});
const response = await s3.send(command);
return Buffer.from(await response.Body.transformToByteArray());
}
Workflow Usage:
// src/workflows.ts
import * as claimCheckActivities from './activities/claimCheck';
// ...
const { upload, download } = wf.proxyActivities<typeof claimCheckActivities>({ ... });
async function processLargeDataWorkflow(largePayload: Buffer) {
const claimCheck = await upload(largePayload, `workflow-${wf.info().workflowId}/input.dat`);
// Now pass the small `claimCheck` string to other activities
await someOtherActivity(claimCheck);
}
async function someOtherActivity(claimCheck: string) {
// Activity implementation
const data = await download(claimCheck);
// ... process data
}
This pattern keeps your workflow history lean and performant, decoupling the workflow orchestration logic from the large data payloads it operates on.
Conclusion: Beyond State Machines
Implementing the Saga pattern correctly is a significant architectural hurdle in any microservices environment. Traditional approaches force engineers to build and maintain complex, ad-hoc state machines using databases, queues, and event streams. This infrastructure is brittle, hard to test, and lacks visibility.
Temporal shifts the paradigm. By providing a durable virtual memory for your code, it allows you to express complex, long-running, fault-tolerant business logic as a simple, cohesive function. The intricate machinery of state persistence, task queuing, retries, and timers is abstracted away by the platform.
By leveraging advanced patterns like application-level idempotency, human-in-the-loop escalations, and the Claim Check pattern, you can use Temporal to build highly resilient, observable, and scalable Sagas that form the backbone of your distributed application. The orchestrator is no longer a complex liability but a simple, powerful, and testable piece of your core business logic.