Exactly-Once Semantics in Kafka Sagas: An Idempotent Processor Deep Dive

October 2, 2025

17 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Fragility of At-Least-Once in Mission-Critical Sagas

In event-driven architectures, the choreographed saga is an elegant pattern for managing distributed transactions. By having services react to each other's events, we achieve high decoupling and resilience. An OrderService emits an OrderCreated event, a PaymentService consumes it and emits PaymentProcessed, and so on. It's a clean, reactive flow. However, the bedrock of most messaging systems, including Apache Kafka's default configuration, is at-least-once delivery. This guarantee, while preventing data loss, introduces a sinister problem in sagas: the high probability of duplicate processing.

Consider a PaymentService consumer that successfully processes a payment but crashes before committing its consumer offset. Upon restart, it will re-consume the same OrderCreated event and, without proper safeguards, charge the customer a second time. This isn't a theoretical edge case; it's an inevitability in any production system subject to deployments, network blips, or broker failures. For senior engineers, simply acknowledging this problem isn't enough. We must architect a solution that provides exactly-once semantics (EOS) from the perspective of our business logic.

This article will not rehash the basics of Kafka's EOS producer or transaction APIs. Instead, we will focus on the most critical and complex part: implementing a truly idempotent consumer using the Kafka Streams Processor API. We will build a stateful, idempotent PaymentProcessor that can safely re-process messages without causing duplicate side effects, leveraging Kafka's own state stores and transactional capabilities to achieve end-to-end exactly-once processing for our saga step.

Our Scenario: A Multi-Step E-Commerce Order Saga

Let's define the business process we'll be securing:

Order Service: Creates an order, emits OrderCreated to the orders.v1 topic.

Payment Service: Consumes OrderCreated, processes payment via an external gateway, emits PaymentSucceeded or PaymentFailed to the payments.v1 topic.

Inventory Service: Consumes PaymentSucceeded, reserves stock, emits InventoryReserved or InventoryOutOfStock to the inventory.v1 topic.

Order Service (again): Consumes InventoryReserved to mark the order as complete. It also consumes failure events (PaymentFailed, InventoryOutOfStock) to trigger compensation logic (e.g., marking the order as failed).

The Happy Path:

orders.v1 -> PaymentService -> payments.v1 -> InventoryService -> inventory.v1 -> OrderService

The Failure/Compensation Path (e.g., Payment Fails):

orders.v1 -> PaymentService -> payments.v1 (with PaymentFailed) -> OrderService (to cancel)

Our focus will be on hardening the PaymentService, as it involves a critical, non-retriable external side effect: charging a credit card.

Why Standard Consumers and Producer Settings Are Insufficient

Many engineers believe setting processing.guarantee=exactly_once_v2 in Kafka Streams (or using the transactional producer/consumer) is a silver bullet. It is not. This setting provides a powerful guarantee: writes to output topics and updates to state stores within a single stream processing task are atomic. It solves the classic "read-process-write" atomicity problem.

However, it does not make interactions with external systems (like a database or a payment gateway API) atomic within that same Kafka transaction. The failure mode remains:

PaymentProcessor reads OrderCreated (Offset: 10).

The Kafka Streams transaction begins.
The processor makes a successful API call to Stripe, charging the customer.

The processor attempts to write PaymentSucceeded to the output topic and update its internal state store.

CRASH! The application terminates before the Kafka transaction can be committed.

Upon restart, the consumer group will start from the last committed offset, which is before offset 10. It will re-read the OrderCreated event, and without an idempotency check, it will call the Stripe API again, leading to a double charge.

This is where application-level idempotency, built on top of Kafka's EOS foundation, becomes non-negotiable.

The Idempotent Processor Pattern with Kafka Streams

The solution is to build a stateful processor that remembers which events it has already processed. We can leverage Kafka Streams' first-class support for state stores, which are fault-tolerant, changelog-backed key-value stores co-located with our stream processing tasks.

The Core Logic:

Idempotency Key: Each incoming event must have a unique key that identifies the specific operation. This could be a UUID generated by the upstream service or a composite key like {orderId}-{sagaStep}.

State Store: We will use a KeyValueStore to track the idempotency keys of successfully processed events.

The Check-Act-Set Flow: Upon receiving an event, the processor will:

a. Check: Look up the event's idempotency key in the state store.

b. Act: If the key is not found, perform the business logic (call the payment gateway).

c. Set: If the business logic succeeds, write the output event to the sink topic and store the idempotency key in the state store. This entire "Act-Set" operation must be atomic, which is precisely what exactly_once_v2 provides.

d. If the key is found, it means we've already processed this event. We can safely skip the business logic and, if necessary, re-emit the original result (which we can also store in the state store).

This pattern ensures that even if the event is delivered ten times, the critical side effect happens only once.

Deep Dive: Implementing the `PaymentProcessor`

Let's write the code. We'll use Java and the Kafka Streams Processor API for fine-grained control.

1. Project Setup (Maven Dependencies)

xml

<dependencies>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>3.7.0</version>
    </dependency>
    <!-- For JSON serialization/deserialization -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.15.2</version>
    </dependency>
    <!-- Add logging framework -->
</dependencies>

2. The State Store Definition

First, we define the name and structure of our state store. This store will hold the idempotency keys we've already processed.

java

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.state.StoreBuilder;
import org.apache.kafka.streams.state.Stores;
import org.apache.kafka.streams.state.KeyValueStore;

public class PaymentTopology {
    public static final String PROCESSED_EVENTS_STORE_NAME = "processed-payments-store";

    public static StoreBuilder<KeyValueStore<String, String>> buildStateStore() {
        return Stores.keyValueStoreBuilder(
                Stores.persistentKeyValueStore(PROCESSED_EVENTS_STORE_NAME),
                Serdes.String(),
                Serdes.String() // Storing the result (e.g., "SUCCEEDED:tx123")
        );
    }
}

3. The Custom PaymentProcessor

This is the heart of our solution. It implements the Processor interface, giving us access to the ProcessorContext and the state store.

java

import org.apache.kafka.streams.processor.api.Processor;
import org.apache.kafka.streams.processor.api.ProcessorContext;
import org.apache.kafka.streams.processor.api.Record;
import org.apache.kafka.streams.state.KeyValueStore;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

// Assume these are your domain objects
import com.example.events.OrderCreated;
import com.example.events.PaymentSucceeded;
import com.example.events.PaymentFailed;

public class IdempotentPaymentProcessor implements Processor<String, OrderCreated, String, Object> {

    private static final Logger log = LoggerFactory.getLogger(IdempotentPaymentProcessor.class);
    private KeyValueStore<String, String> processedEventsStore;
    private ProcessorContext<String, Object> context;
    private final PaymentGatewayClient paymentGatewayClient;

    public IdempotentPaymentProcessor(PaymentGatewayClient paymentGatewayClient) {
        this.paymentGatewayClient = paymentGatewayClient;
    }

    @Override
    public void init(ProcessorContext<String, Object> context) {
        this.context = context;
        this.processedEventsStore = context.getStateStore(PaymentTopology.PROCESSED_EVENTS_STORE_NAME);
    }

    @Override
    public void process(Record<String, OrderCreated> record) {
        String orderId = record.key();
        OrderCreated event = record.value();
        String idempotencyKey = event.getEventId(); // Assuming a unique ID per event

        String existingResult = processedEventsStore.get(idempotencyKey);

        if (existingResult != null) {
            log.warn("Duplicate event detected for idempotency key: {}. Skipping processing.", idempotencyKey);
            // Optional: Re-forward the original result if downstream systems need it.
            // For example, if existingResult was a serialized PaymentSucceeded event.
            // context.forward(...);
            return;
        }

        try {
            // --- CRITICAL SECTION --- 
            // This is the external side effect.
            PaymentGatewayClient.PaymentResult result = paymentGatewayClient.charge(orderId, event.getAmount(), event.getCreditCard());

            if (result.isSuccess()) {
                PaymentSucceeded successEvent = new PaymentSucceeded(orderId, result.getTransactionId());
                
                // Forward the success event downstream
                context.forward(new Record<>(orderId, successEvent, System.currentTimeMillis()));
                
                // Atomically update the state store
                processedEventsStore.put(idempotencyKey, "SUCCEEDED:" + result.getTransactionId());
                log.info("Payment succeeded for order: {}", orderId);
            } else {
                throw new PaymentProcessingException(result.getErrorMessage());
            }

        } catch (PaymentProcessingException | NetworkException e) {
            log.error("Payment failed for order: {}. Emitting failure event.", orderId, e);
            PaymentFailed failureEvent = new PaymentFailed(orderId, e.getMessage());
            
            // Forward the failure event for compensation
            context.forward(new Record<>(orderId, failureEvent, System.currentTimeMillis()));
            
            // Atomically update the state store to prevent retries of a known failure
            processedEventsStore.put(idempotencyKey, "FAILED:" + e.getMessage());
        }
    }

    @Override
    public void close() {
        // Cleanup resources if needed
    }
}

4. Assembling the Topology

Finally, we wire everything together in our main application class.

java

import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import java.util.Properties;

public class PaymentServiceApplication {

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "payment-service");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
        // The magic property for EOS
        props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
        // Configure Serdes, etc.

        Topology topology = new Topology();

        // 1. Add the state store to the topology
        topology.addStateStore(PaymentTopology.buildStateStore());

        // 2. Define the source and sink topics
        String sourceTopic = "orders.v1";
        String sinkTopic = "payments.v1";

        // 3. Wire the processor into the topology
        topology.addSource("OrderSource", sourceTopic)
                .addProcessor(
                    "PaymentProcessor",
                    () -> new IdempotentPaymentProcessor(new StripePaymentGatewayClient()), // DI your client
                    "OrderSource"
                )
                .addSink(
                    "PaymentSink",
                    sinkTopic,
                    "PaymentProcessor"
                );

        KafkaStreams streams = new KafkaStreams(topology, props);
        streams.start();

        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
}

With this setup, the put call to processedEventsStore and the forward call (which writes to the payments.v1 topic) are part of the same atomic transaction. If the application crashes after the Stripe call but before this transaction commits, on restart it will see the idempotency key is missing and re-attempt the Stripe call. If it crashes after the transaction commits, on restart it will see the key in the store and skip the Stripe call. This closes the window for duplicate processing.

Advanced Edge Cases and Production Considerations

A senior engineer knows the job isn't done yet. This pattern introduces new complexities that must be managed.

1. State Store Growth and TTL

You cannot store every idempotency key forever. The state store would grow indefinitely. We need a strategy for eviction.

* The Problem: If we evict a key too early, a very late, duplicate message could be re-processed.

* The Solution: Use a TimestampedKeyValueStore and a Punctuator. The Punctuator is a callback that Kafka Streams can invoke on a schedule. We can schedule a daily or hourly punctuator to scan the store and remove keys older than our business SLA for message processing (e.g., older than 7 days).

Example Punctuator for TTL:

java

// Inside your Processor's init() method
context.schedule(Duration.ofHours(1), PunctuationType.WALL_CLOCK_TIME, timestamp -> {
    try (KeyValueIterator<String, ValueAndTimestamp<String>> iter = timestampedStore.all()) {
        long cutoff = timestamp - Duration.ofDays(7).toMillis();
        while (iter.hasNext()) {
            KeyValue<String, ValueAndTimestamp<String>> entry = iter.next();
            if (entry.value.timestamp() < cutoff) {
                timestampedStore.delete(entry.key);
            }
        }
    }
});

This adds operational complexity but is essential for long-running services.

2. Handling Out-of-Order Events

Choreographed sagas are susceptible to out-of-order events. What if an OrderUpdated event arrives before the OrderCreated event? Our current processor would ignore it.

* The Problem: A simple idempotent check doesn't handle state progression. You might need to process OrderCreated before you can process OrderUpdated.

The Solution: Your state store needs to be more sophisticated. Instead of just storing a boolean processed flag, you should store the state* of the saga for that order. When an event arrives, you check if the current state allows for that event's transition. If not, you could either drop it, send it to a dead-letter queue, or buffer it in another state store until the prerequisite event arrives.

3. Idempotency in Compensation Logic

Your compensation logic must be just as idempotent as your forward-recovery logic. Imagine an InventoryOutOfStock event triggers a refund. If your PaymentService crashes during the refund process, it might restart and issue two refunds.

* The Problem: Compensation actions are also critical side effects.

* The Solution: Apply the exact same idempotent processor pattern to your compensation consumers. The consumer that processes InventoryOutOfStock should use an idempotency key (e.g., {orderId}-refund) and a state store to ensure the refund API is called only once.

4. Testing Stateful Processors

Unit testing this logic is impossible without the right tools. You need to simulate state stores and message flow.

* The Problem: You can't just new IdempotentPaymentProcessor() and call process() because the context and state store will be null.

* The Solution: Use the kafka-streams-test-utils library. It provides a TopologyTestDriver that lets you run your topology in-memory, interact with state stores, and pipe records through your processors.

Example Test with TopologyTestDriver:

java

import org.apache.kafka.streams.TopologyTestDriver;
import org.apache.kafka.streams.TestInputTopic;
import org.apache.kafka.streams.TestOutputTopic;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import static org.mockito.Mockito.*;

class IdempotentPaymentProcessorTest {
    private TopologyTestDriver testDriver;
    private TestInputTopic<String, OrderCreated> inputTopic;
    private TestOutputTopic<String, Object> outputTopic;
    private PaymentGatewayClient mockGateway;

    @BeforeEach
    void setup() {
        // Build topology as in the main app
        Topology topology = ...;
        mockGateway = mock(PaymentGatewayClient.class);
        // Inject mock into processor supplier
        topology.addProcessor("PaymentProcessor", () -> new IdempotentPaymentProcessor(mockGateway), ...);

        Properties props = ...; // Use mock schema registry if needed
        testDriver = new TopologyTestDriver(topology, props);
        
        // Setup test topics
        inputTopic = testDriver.createInputTopic(...);
        outputTopic = testDriver.createOutputTopic(...);
    }

    @Test
    void shouldProcessEventOnlyOnce() {
        OrderCreated event = new OrderCreated("order-123", "evt-abc", ...);
        when(mockGateway.charge(any(), any(), any()))
            .thenReturn(new PaymentGatewayClient.PaymentResult(true, "txn-xyz"));

        // First time
        inputTopic.pipeInput("order-123", event);
        
        // Second time (duplicate)
        inputTopic.pipeInput("order-123", event);

        // Assertions
        verify(mockGateway, times(1)).charge(any(), any(), any());
        assertThat(outputTopic.getQueueSize()).isEqualTo(1);
        assertThat(outputTopic.readRecord().value()).isInstanceOf(PaymentSucceeded.class);
    }
}

Performance and Final Considerations

Enabling exactly_once_v2 is not free. It introduces overhead by using a two-phase commit protocol coordinated by the Kafka transaction coordinator. This will increase latency per message and may slightly reduce overall throughput compared to at_least_once.

Is it worth it?

* Yes, for any saga step involving financial transactions, inventory management, or any other business process where a duplicate operation causes data corruption or real-world financial loss.

* Maybe not, for less critical operations like updating a read-model, sending notifications (where a duplicate is an annoyance but not a disaster), or high-volume analytics pipelines.

By combining Kafka's transactional primitives with application-level idempotency logic via the Processor API, we move from a fragile, best-effort saga implementation to a robust, fault-tolerant system that can correctly handle the inevitable failures of a distributed environment. This pattern is complex, but for mission-critical systems, it's the professional standard for building reliable, event-driven applications.

The Fragility of At-Least-Once in Mission-Critical Sagas

Our Scenario: A Multi-Step E-Commerce Order Saga

Why Standard Consumers and Producer Settings Are Insufficient

The Idempotent Processor Pattern with Kafka Streams

Deep Dive: Implementing the `PaymentProcessor`

Advanced Edge Cases and Production Considerations

1. State Store Growth and TTL

2. Handling Out-of-Order Events

3. Idempotency in Compensation Logic

4. Testing Stateful Processors

Performance and Final Considerations

Found this article helpful?