Nguyen Le PhongNguyen Le Phong

The Saga Pattern for Distributed Transactions

A practical explanation of the Saga pattern for distributed transactions: why one database transaction stops working across services, how orchestration and choreography differ, and why compensation, observability, and idempotency matter.

A teammate opens the order dashboard after lunch and points at one row with a tired smile. The payment succeeded, the inventory reservation failed, the email still went out, and support is now asking whether the customer owns the product or not. Nobody did anything careless. The work simply crossed too many service boundaries for one simple transaction to protect it.

Inside a single database, a transaction feels comforting. Either all the writes commit, or all of them roll back. An order row, a payment row, and a stock update can move together under one reliable boundary. In a distributed system, that boundary often disappears. Payment may live in one service, inventory in another, shipping in another, and notification somewhere else. Each service owns its own data, and asking one database transaction to cover all of them usually means adding tight coupling, slow locks, or an operational burden the team did not intend to carry.

The Saga pattern is one way to handle that reality. Instead of pretending the whole business process can be one atomic transaction, a saga breaks it into a sequence of local transactions. Each step commits inside its own service. If a later step fails, the system runs compensating actions to undo or reduce the effect of earlier steps. It is less like pressing one giant save button and more like coordinating a careful workflow where every participant knows what to do next and what to do if the plan changes.

A common example is order checkout. First the order service creates a pending order. Then the payment service charges the customer. Then inventory reserves stock. Then shipping prepares a delivery request. If inventory cannot reserve stock after payment succeeds, the system cannot simply roll back the payment database from the inventory service. It needs a compensating action, such as refunding or voiding the charge and marking the order as failed. The customer experience may still need a clear message, and support needs enough evidence to understand the path that happened.

There are two common ways to coordinate a saga. In orchestration, one central coordinator tells each service what to do: create order, charge payment, reserve stock, arrange shipping, send email. The coordinator also decides what compensation to run when a step fails. This can be easier to understand because the workflow is visible in one place. The trade-off is that the coordinator can become too knowledgeable about every service and too important to every release.

In choreography, there is no central conductor. Services publish events and react to events from others. The order service publishes OrderCreated. Payment listens and publishes PaymentCaptured. Inventory listens and publishes StockReserved or StockReservationFailed. This can reduce direct coupling, but it can also make the full business flow harder to see. A new engineer may need to follow several event handlers across several repositories before understanding why one customer email was sent.

Neither style is universally better. Orchestration is often useful when the process is business-critical, has many branches, and needs a clear owner. Choreography can work well when the steps are simpler, the services already communicate through events, and the team has strong observability around event flow. The better question is not which pattern sounds more elegant. It is which one your team can debug at 4 p.m. on a normal Tuesday, when one step failed quietly and a customer is waiting.

Compensation is the part that deserves the most honest design. Not every action can be undone cleanly. You can refund a payment, but you cannot unsend an email. You can release inventory, but you may not be able to erase the confusion caused by a confirmation message that arrived too early. This is why saga design is partly technical and partly product thinking. The team needs to decide which states are visible, which messages are delayed until the process is stable, and which failures require human review.

Idempotency is another quiet requirement. Distributed systems retry. Messages arrive twice. A timeout may hide a successful operation. If the payment step runs twice because the caller did not receive the first response, the system should not charge the customer twice. Each step needs a stable operation key, a way to recognize duplicate requests, and a predictable response when the same command arrives again. Without idempotency, a saga can turn a recoverable delay into a second incident.

Observability is what makes a saga livable. Every step should carry a correlation ID. Logs should show the saga id, current state, command, event, retry count, compensation action, and failure reason. Metrics should show stuck sagas, compensation rate, average completion time, and the steps that fail most often. A small admin view can be more valuable than another diagram because it lets operators answer a simple question: where is this business process right now?

The Saga pattern is not a way to make distributed transactions feel simple. It is a way to make their complexity explicit enough to manage. The team trades one strong database boundary for a visible process, local commits, compensating actions, and careful failure handling. That trade is reasonable when service ownership and scaling matter. It is unnecessary ceremony when one database and one transaction would serve the product just fine.

The next time a business flow crosses several services, it helps to sketch the boring parts first. What commits locally? What can fail? What can be compensated? What must never run twice? What will support see when the process gets stuck? If you have lived through a half-finished distributed workflow, that memory is often the best teacher. It reminds us that architecture is not only about connecting services; it is also about giving people a clear path back when the connection breaks.

記事はいかがでしたか?