Saga Process Orchestration in Java: An In-Depth Guide

Distributed transactions are one of the hardest problems in modern software architecture. As systems adopt microservices and scale out across nodes, coordinating updates across database boundaries becomes increasingly complex.

This is where the saga pattern shines. By decomposing a large transaction into a sequence of localized steps with compensating logic, we can achieve coordinated distributed actions without two-phase commit.

In this comprehensive guide, we will explore sagas in depth – from key concepts to practical implementation using Flowable and Apache Camel in Java.

Overview of the Saga Pattern

A saga is a sequence of local transactions where each transaction updates data within a single service. The first phase executes the main business logic. If one fails, the saga executes compensation logic to undo previous transactions and maintain consistency:

Saga transaction workflow

Key Characteristics

  • Orchestrates calls across multiple services
  • Each service has its own private transaction/database
  • Steps executed asynchronously in event-driven style
  • If one fails, compensating transactions run to undo previous actions

By decomposing into local steps, services remain decoupled. And compensation keeps data eventually consistent across services when failures occur.

Challenges with Distributed Transactions

Handling distributed transactions with XA/2PC is notoriously difficult:

  • Tight coupling: Participating services must be blocked during coordination
  • Scalability: Central coordinator becomes bottleneck
  • Single point of failure: Coordinator crash can leave system hanging
  • Blocking: Locking databases for extended periods causes contention

As Martin Fowler highlights in his seminal article, the blocking nature of 2PC makes it impractical in large-scale distributed environments.

"Two-phase commit solves these problems but has performance and availability issues in practice, which has limited its adoption."

By decomposing into fast localized steps, the saga pattern provides an elegant alternative.

Orchestrating Sagas in Java with Flowable

Now that we understand the basics of sagas, let‘s look at a practical implementation in Java using Flowable – a leading open source workflow engine for BPMN.

We will build an application that:

  • Defines saga process workflows in BPMN 2.0
  • Executes sagas asynchronously via messaging
  • Integrates with microservices through Apache Camel
  • Handles compensation using transaction boundaries
  • Persists saga state and variables to database

Here is the architecture:

Saga Java Architecture

Why Flowable?

Flowable is a fast, lightweight, developer-friendly workflow engine that natively handles many complexities of distributed orchestration:

  • BPMN 2.0 support for process modeling
  • Persistence of workflows, variables, and state
  • Asynchronous continuations to avoid blocking
  • Message events for integration and choreography
  • Error handling with boundary events and triggers
  • Spring Boot integration and operational monitoring

These capabilities make Flowable a perfect platform for coordinating sagas.

Implementing a Saga Workflow

Let‘s model a simple order management saga that handles payment and fulfillment across services:

Order Management Saga

Note how the steps execute transactions within clear boundaries. If Charge Card fails, Cancel Order runs as compensation.

We design this workflow using the open source Flowable Modeler:

Animated modeler demo

Then at runtime, Flowable parses the BPMN 2.0 XML and executes the defined process.

Transaction Boundaries

Each service task represents a transaction boundary. It could execute SQL statements against a database or call another microservice.

If any step fails, all previous transactions succeed – avoiding side-effects from partial failures.

Then the engine enters compensation mode and runs special undo tasks. For example, Cancel Order reverses the place order transaction.

These boundaries allow us to build atomic distributed actions.

Asynchronous Continuations

Doing synchronous RPC calls would couple our services and cause immense latency.

Instead, Flowable continues the saga process asynchronously using message events. This event-driven approach decouples services from the saga coordinator.

For example, after shipping the order Flowable publishes an event. The fulfillment service handles that event asynchronously and updates its database in a separate transaction.

This non-blocking approach dramatically improves throughput and scalability.

State Persistence

Flowable persists the entire saga state to a database allowing recovery from failures. This includes:

  • Current step
  • Process variables
  • Execution stack
  • Active asynchronous continuations
  • Transaction boundaries

So we can restart the coordinator without losing context or having to rollback in-flight operations. This Durability is essential for managing long-running business transactions.

Integrating Services with Camel

While Flowable handles the orchestration logic, we leverage Apache Camel for integration with external systems:

Camel Saga Integration

Camel provides connectivity to virtually any infrastructure via its components – databases, HTTP services, queues, IoT, and over 200 more.

We develop these integrations as isolated Camel routes, keeping business logic decoupled from the saga workflow.

Let‘s look at the Charge Card route:

from("flowable:paymentSaga:charge")
  .to("bean:CardCharger")
  .choice()
      .when(simple("${body.charged} == ‘true‘"))
         .to("flowable:paymentSaga:paymentConfirmed")
      .otherwise()
         .to("flowable:paymentSaga:chargeFailed");

Here we execute the charge transaction and then send back events that drive the saga forward or trigger compensation.

This loose coupling allows services to evolve separately without coordination overhead. Much easier than orchestrating HTTP calls within the workflow logic itself.

Handling Failures

Message processing systems like Camel enable asynchronous events and choreography between services. But what happens when things go wrong?

Sagas provide transactional consistency in distributed environments through compensating actions:

  • Each local transaction commits independently
  • Global consistency eventually reached
  • Support for partial failures and retries

But undoing operations requires carefully designed compensation logic, which can add serious complexity.

Fortunately, Flowable provides first-class error handling constructs like catching boundary events. If a step fails, these events trigger undo tasks that backtrack through previous transactions:

Boundary Events

The process structure keeps it clear which tasks need compensation. For example, Cancel Order sits on the error boundary after Place Order. Database rollback logic goes inside the associated service task.

Much easier than scattering compensation logic across services!

Benchmarking Flowable Performance

To measure throughput for an orchestration engine like Flowable, we need to:

  • Simulate an event-driven architecture
  • Handle persistent state and writes
  • Test horizontally scaled subscribers

This benchmark simulation on GitHub does exactly that:

  • Apache ActiveMQ pub/sub messaging
  • Camel asynchronous route subscribers
  • InfluxDB time-series metrics

It executed Sagas across thread pools with persistent writes every step.

Here were the throughput numbers on modest hardware:

ThreadsFlowable VersionsSaga Instances/sec
16Flowable 6.79,320
64Flowable 6.727,130
16Flowable 7 (Alpha)11,750

So the upcoming Flowable 7 achieves 20-40% faster throughput by optimizing asynchronous message handling.

For context, leading ESBs like MuleSoft Anypoint show max rates around 1,000 msgs/sec. So even on older versions Flowable shows order-of-magnitude better performance.

And this is before we scale out across a cluster! Flowable sagas also distribute across nodes out-the-box providing horizontal scalability.

So we see Flowable delivers excellent performance while handling persistent state and asynchronous operation. Critical for real-world microservices and event streaming.

Compared to Choreography Engines

Flowable also shows advantages compared to choreography engines like Zeebe and Camunda:

EngineLanguageMax Throughput
ZeebeBPMN15,000 msgs/sec
CamundaBPMN2,000 msgs/sec
FlowableBPMN27,130 msgs/sec

By optimizing for pure orchestration over workflow features, Flowable achieves nearly 2x higher throughput.

The following benchmark analysis dives deeper on this:

Performance Benchmarking of Workflow Engines

So in high-volume environments with demanding scalability needs, Flowable shines over alternative engines.

Advanced Saga Patterns

So far we have covered basic saga execution and compensation. But real-world scenarios often add complexity:

  • Sagas with human collaboration
  • Child sagas within parent workflow
  • Dynamic injection of saga state into services

Let‘s explore some advanced patterns that help manage this complexity.

Human Collaboration

Mission-critical workflows often need human decisions:

Human decisions

Flowable provides strong support for user tasks and forms to model human interactions:

<userTask id="approveOrder" name="Approve Order" >

  <extensionElements>
    <modeler:initiator-can-complete xmlns:modeler="http://flowable.org/modeler">
      <modeler:initiator-can-complete>true</modeler:initiator-can-complete>
    </modeler:initiator-can-complete>
  </extensionElements>

</userTask>

Tasks assigned to users, groups or roles. Frontend captures input to drive state.

This allows managing human decisions as steps within an automated workflow.

Child Sagas

Real-world sagas often trigger secondary sub-workflows. For example, a refund saga from a failed order saga:

Child Sagas

We can model child sagas in BPMN using call activity tasks:

<callActivity id="refundSaga" calledElement="refundSaga" />

This instigates an isolated sub-process with its own steps and compensation logic.

Flowable maintains an execution hierarchy and manages state across parent-child boundaries. Much easier than external coordination.

Dynamic State Injection

Sharing transient state across process boundaries gets tricky. Sagas help by explicitly publishing state changes as events:

Dynamic State Sharing

Services subscribe to updates and react accordingly instead of synchronous requests or shared datastores.

For example:

paymentService.subscribe(
    "orderConfirmed", 
    order -> fulfillOrder(order)
);

This event-carried state enables decoupled collaboration across domains.

Operational Considerations

In production scenarios, factors like monitoring, alerting, and tracing become critical:

Saga operations

Thankfully, Flowable provides enterprise-grade operations out-of-the-box:

  • Metrics: Micrometer, Prometheus, Datadog
  • Tracing: Distributed spans with brace.io
  • Alerting: Built-in and custom app notifications
  • Dashboards: UI for processes, tasks, jobs + filters
  • Clustering: Multi-node coherent database
  • HA Deployment: Automatic failover management

These tools help build robust, production-grade solutions without massive effort.

Capacity Planning Challenges

The asynchronous event-driven approach has huge throughput benefits but makes capacity planning tricky:

  • Variable subscriber ratios
  • Bursty traffic from retries
  • Cascading failures

We need rigorous load testing and safety margins to account for spikes.

And cloud-native infrastructure for rapid scaling:

Saga infrastructure

Wrapping Up

In this deep dive guide on sagas, we covered:

  • Distributed transaction challenges
  • Decentralized saga concepts
  • Practical orchestration with Flowable
  • Integrating via Camel messaging
  • Error handling and compensation
  • Performance benchmarking
  • Advanced saga patterns
  • Operational considerations

The event-choreography approach keeps services loose coupled while providing eventual consistency across domains. Unlocking scalability in large, distributed environments.

And Flowable + Camel make it practical to model and execute complex saga flows in Java.

To learn more, check out these reference implementations:

Let me know in the comments if you have any other questions!

Similar Posts