The Outbox Pattern

A practical explanation of the Outbox Pattern: why dual writes fail, how a transactional outbox keeps database state and events aligned, and what teams still need to handle around retries, idempotency, ordering, cleanup, and observability.

Par Nguyen Le Phong15 mai 20266 min de lecture

Software Architecture
Outbox Pattern
Distributed Systems
Event-Driven Architecture
Reliability
Microservices

The bug started with an order that existed in the database but never arrived anywhere else. The customer had clicked checkout. The order row was there. The payment service was waiting for an event that never came. The support team could see the shape of the problem before engineering had a name for it: one part of the system knew the truth, and the rest of the system was still living yesterday.

This is the common dual-write problem. A service often needs to do two things for one business action: write to its own database and publish a message for other services. If the database commit succeeds but the broker publish fails, the system has a local truth that nobody else hears. If the publish happens first and the database write fails, other services may react to a fact that never became real. The gap between those two writes is small in code and large in production.

The Outbox Pattern is a calm way to remove that gap. Instead of writing business data and then publishing directly to a broker, the service writes the business data and a message record into an outbox table in the same database transaction. If the transaction commits, both the state change and the intent to publish commit together. If it rolls back, neither exists. A separate relay then reads the outbox table and publishes messages to the broker.

The pattern works because it moves the unreliable boundary to a safer place. The database transaction stays local, where ACID is still available. Publishing to the broker becomes asynchronous and retryable. If the relay crashes after reading a row, it can read it again. If the broker is down, the outbox rows wait. The system may be delayed, but it is no longer silently split between saved state and missing event.

There is one important trade-off: outbox delivery is usually at-least-once, not exactly-once. A relay may publish the same message twice if it crashes after publishing but before marking the row as sent. That means consumers must be idempotent. They need a message id, an event id, or a business key that lets them say, I have already processed this. The outbox fixes the producer side of dual writes. It does not excuse consumers from careful retry design.

Ordering also deserves attention. If one aggregate emits several events, the relay should publish them in a predictable order for that aggregate, or the consumers should be designed not to depend on strict global order. Global ordering across the whole system is usually expensive and unnecessary. The useful question is smaller: which events must be observed in order for this business object to make sense?

The outbox table itself needs operational care. It needs status fields, timestamps, retry count, error details, and a cleanup policy. Old published rows should not grow forever. Failed rows should not disappear without a trace. A dashboard showing pending count, oldest unprocessed message, publish latency, and repeated failures turns the outbox from a hidden mechanism into something the team can operate.

There are several implementation styles. Some teams use a polling relay that queries unsent rows every few seconds. Others use change data capture to stream committed outbox rows out of the database. Polling is simpler and often enough. CDC can reduce latency and database load at higher scale, but it adds platform complexity. The best choice depends on volume, team maturity, and how quickly other services need to react.

The Outbox Pattern is not a reason to publish every internal detail as an event. A good event still needs a clear meaning, stable schema, versioning plan, and owner. If the domain event is vague, the outbox will only deliver vague messages more reliably. Architecture still has to decide what happened in the business, who needs to know, and which fields are safe to promise over time.

I like the outbox because it is not dramatic. It accepts that databases and brokers cannot share one easy transaction, then builds a small bridge between them with the tools each side can trust. The lesson is practical: when one action must be saved and announced, do not hope two separate writes stay together by luck. Put the announcement inside the same commit, relay it patiently, and make every receiver safe to hear it more than once.

If your team has seen an event go missing, a webhook arrive twice, or a workflow pause because one service knew something the others did not, the outbox is worth discussing. It may not make the system simpler, but it can make the failure mode more honest, visible, and repairable.

Qu'en avez-vous pensé ?

Lectures liées