Three parts ago, splitting a monolith into microservices, we mentioned a bill we would eventually have to pay — the distributed-systems tax. Then we wired services together with events and gave their data a private home. This is the part where the bill comes due.
The tax is simple to name and hard to dodge: the moment a call leaves your process, it can be slow, it can fail, and it can happen twice — and none of that is an exception, it is Tuesday. Resilience is the discipline of building a system that stays standing anyway. Not a system that never fails — that does not exist — but one that fails small, fails fast, and recovers on its own.
The comforting lies we tell about the network
In 1994, engineers at Sun catalogued the fallacies of distributed computing — the false assumptions every team makes when they first spread work across machines:
- The network is reliable. It is not. Packets drop, connections reset.
- Latency is zero. It is not. A call across a data centre is glacial next to a function call.
- Bandwidth is infinite, the topology never changes, transport cost is zero. None of these hold.
Inside a monolith, a method call is effectively instant and never "half-happens." The instant you replace it with a network call, all of those fallacies become your problem. Everything below is a tool for treating the network as the unreliable thing it actually is.
Timeouts: never wait forever
This is the most important resilience tool and the most commonly forgotten one. A call with no timeout is an outage with a delay. Here is the chain reaction: a downstream service hangs; your call waits; the thread (or connection) handling that request is stuck; more requests arrive and grab more stuck threads; within seconds your healthy service has no threads left and falls over too. One slow dependency took down a service that was working fine.
// A call with no timeout can hang until your whole service starves
const res = await fetch(url) // waits... forever?
// Bound every outbound call. A fast failure beats an endless wait.
const res = await fetch(url, { signal: AbortSignal.timeout(800) })
The rule is absolute: every call that leaves your process gets a timeout. A request that fails in 800 ms is recoverable; one that hangs for 60 seconds takes the whole node down with it.
Retries: helpful, until they are a stampede
Many failures are transient — a blip, a brief overload, a node restarting. Retrying often works. But the naive retry is a loaded gun pointed at your own foot.
Picture a service that slows under load. Every caller times out and immediately retries. Now the struggling service receives double the traffic at the worst possible moment, slows further, triggers more retries… and you have a retry storm that turns a wobble into an outage. The fixes are well known:
- Exponential backoff. Wait longer between each attempt: 100 ms, 200, 400, 800. Give the dependency room to breathe.
- Jitter. Add randomness so a thousand clients do not retry in lockstep on the same tick — synchronised retries are their own stampede.
- A retry budget. Cap attempts (say, three) and then give up gracefully. Infinite retries just move the outage around.
- Only retry idempotent operations. Which is the whole next section.
Retries without backoff and jitter do not add resilience — they add a self-inflicted DDoS. The most damaging outages are often not the original failure but the retry storm the clients unleash trying to recover from it. Be gentle with a service that is already on its knees.
Idempotency: the price of being allowed to retry
A retry is only safe if doing the operation twice has the same effect as doing it once. That property is idempotency, and it is the thread tying this whole series together: in part seven, brokers deliver at-least-once; in part eight, the outbox relay retries; here, your own client retries. In every case the same defence applies — make the operation safe to repeat.
The standard technique is an idempotency key: the caller attaches a unique id, and the receiver remembers which ids it has already processed.
// The caller sends a stable key; the server records it once.
async function charge(req) {
if (await processed.has(req.idempotencyKey)) {
return processed.get(req.idempotencyKey) // replay the first result, don't charge again
}
const result = await paymentGateway.charge(req)
await processed.set(req.idempotencyKey, result)
return result
}
You will see vendors promise exactly-once delivery. Across an unreliable network it is essentially unachievable in the general case; what real systems build is at-least-once delivery plus idempotent handling, which is indistinguishable from exactly-once from the outside — and far more honest about how the world works.
Circuit breakers: stop hammering what is already down
When a dependency is genuinely down — not blipping, down — retrying is worse than useless. Every doomed call wastes a timeout, ties up a thread, and delays the error your user is going to get anyway. A circuit breaker is a small piece of state that notices the dependency is failing and starts failing fast instead, giving the dependency room to recover and your callers an instant answer.
It borrows the metaphor from electrical wiring. Closed: current flows, calls go through. Too many failures and it trips Open: calls fail immediately without even trying. After a cooldown it goes Half-Open and lets a single trial call through — if that succeeds, it closes again; if it fails, it re-opens and waits longer. The point is humane on both ends: the sick service stops being flooded, and the caller stops burning resources on calls it knows will fail.
Bulkheads: contain the damage
A ship is divided into watertight compartments so a single hull breach floods one section, not the whole vessel. The bulkhead pattern applies the same idea to resources. If every outbound dependency draws from one shared pool of threads or connections, then one slow dependency can drain the entire pool and starve calls to healthy dependencies. Give each dependency its own bounded pool, and a failure in one is contained to one.
This is why a single struggling third-party API can, in a poorly isolated system, take down features that have nothing to do with it. Bulkheads turn "the whole app is down" into "one feature is degraded."
Graceful degradation: a worse answer beats no answer
Resilience is not only about staying up — it is about staying useful when a piece is missing. When a dependency fails, the question is: what can we still do?
- Serve stale data. A slightly out-of-date cached price beats a blank page.
- Fall back to a default. Recommendations engine down? Show the bestsellers list.
- Shed non-essential features. Drop the live-inventory badge so customers can still check out.
The mindset shift is to treat every dependency as optional until proven otherwise, and to decide the fallback before the incident, not during it. A checkout that works without the recommendation widget is resilient; one that white-screens because a side feature failed is not.
Observability: you cannot operate what you cannot see
Every pattern here fails silently by design. A timeout that quietly retries, a breaker that quietly opens, a queue that quietly backs up — each hides the symptom, which is exactly the danger. Without visibility, your first signal is an angry customer.
The three pillars are not optional in a distributed system: metrics (latency, error rate, breaker state), logs (what happened), and distributed tracing (follow one request across every hop — the only way to find which of eight services added the latency). Build the dashboard and the alert on the dead-letter queue and the breaker before you go live, not after the first 2 a.m. page.
The playbook, on one page
None of these patterns is exotic. The skill is matching the right one to the failure in front of you — and not reaching for all of them at once.
| The failure you will actually see | The pattern that answers it |
|---|---|
| A dependency is slow or hanging | A timeout on every outbound call — non-negotiable |
| A brief, transient blip | Retry with exponential backoff + jitter + a budget |
| A dependency is fully down | A circuit breaker — fail fast, stop hammering it |
| The same request arrives twice | An idempotency key — safe to repeat |
| One slow dependency starves everything | Bulkheads — isolated resource pools |
| A non-critical piece is unavailable | Graceful degradation — a pre-decided fallback |
| "Is this even happening?" | Observability — metrics, logs, distributed tracing |
The honest view by company size
- Solo / early startup. You need exactly two things: timeouts on every external call, and idempotency on anything that takes money or sends a message. That is 80% of the protection for almost none of the effort. Skip the rest until you feel its absence.
- Growing scale-up. Add retries with backoff and jitter, and a circuit breaker around your flakiest dependencies. Adopt a tracing tool the day a bug takes more than an hour to locate. Decide fallbacks for your top user flows.
- Enterprise. Resilience becomes systematic: bulkheads and breakers as defaults in shared libraries or a service mesh, error budgets and SLOs, and chaos testing that deliberately breaks things in production to prove the patterns actually work. The goal is that no single dependency can ever take the whole system down.
Key takeaways
- The network lies. It is unreliable, slow, and will duplicate work. Resilience is designing for that as the normal case, not the exception.
- Timeouts first, always. An unbounded call is an outage waiting to happen; one slow dependency can starve a healthy service. Bound every outbound call.
- Retries need backoff, jitter, and a budget. Naive retries turn a wobble into a retry storm — a DDoS you aimed at yourself.
- Retry safely means idempotently. At-least-once plus idempotent handling is the honest, achievable version of "exactly once."
- Circuit breakers and bulkheads contain failure; graceful degradation keeps you useful; observability lets you see any of it. Decide the fallback before the incident, not during it.
That closes the distributed-systems arc of this Foundations series. We went from one box to many: structuring the code, drawing service boundaries, letting them talk through events, giving their data an honest home, and finally keeping the whole thing standing when the network misbehaves. The through-line was never "use the fancy pattern." It was the same north star as the very first part — buy complexity only when the problem makes you, and pay for it with discipline you can name out loud.