When a system was a monolith on one machine, debugging was almost cosy. One log file. One process you could attach a debugger to. One place where the truth lived. Reproduce the bug, read the stack trace, fix it. Distributed systems quietly took that comfort away. A single checkout request now fans out across a dozen services on a dozen machines, and when it is slow or broken, there is no one log file and nobody to ssh into. You cannot fix what you cannot see — and at scale, seeing is a discipline of its own.
That discipline is worth separating from its older cousin. Monitoring is watching the dashboards you built in advance, for the failures you predicted — “CPU high,” “disk full.” Observability is being able to ask brand-new questions of a running system — including ones you never thought to predict — without shipping new code to answer them. Monitoring tells you that something is wrong; observability lets you find out why. At scale you need both, and both are built from three kinds of data.
The three pillars: metrics, logs, traces
Each pillar answers a different question. The classic mistake is trying to make one of them do all three jobs — drowning in logs to compute what a metric should, or building dashboards to chase what a trace would show in seconds.
- Metrics are numbers over time: request count, error count, p99 latency, queue depth. They are cheap to store because they are aggregated, which makes them perfect for dashboards and “is something wrong right now?” alerts. What they cannot tell you is which request or which user. Metrics are the smoke alarm, not the investigation.
- Logs are discrete, detailed events: “order 8123 failed validation: missing shipping address.” They answer “what exactly happened in this specific case?” Their cost is volume — at scale, logs are often the biggest line on the bill — and an unstructured string log is nearly useless to query. Structure them.
- Traces follow one request as it travels across services, broken into timed spans. This is the pillar that distributed systems forced into existence: it answers “where did the time go?” and “which hop failed?” when the answer is smeared across ten machines.
// Unstructured — fine on one machine, useless across fifty
console.log(`Order ${id} failed for ${userId}: ${err.message}`)
// Structured — queryable, and tied to the whole request by trace_id
log.error({
event: "order_failed",
trace_id: ctx.traceId, // the same id flows through every service
order_id: id,
user_id: userId,
error: err.message,
})
The thread that ties them together: one trace ID
Here is the single highest-leverage habit in this whole article. Give every request one ID at the edge, and propagate it through every service it touches — stamping it on every log line and every span. This is context propagation, and it is the difference between ten separate, useless log files and one coherent story: “show me everything that happened to request abc-123.”
The open standard for generating and propagating this data is OpenTelemetry (OTel) — one vendor-neutral way to emit metrics, logs, and traces so you are not married to a single backend. Adopt it early; retrofitting trace propagation across a mature fleet is genuinely miserable work.
Reading a distributed request: the trace waterfall
A trace draws every span of one request on a single timeline. Suddenly the slow span is obvious at a glance — “checkout is slow” becomes “payment-service is slow” in seconds, instead of an afternoon of cross-team Slack archaeology. This is the payoff that justifies the cost of instrumenting everything.
Metrics that scale: RED, USE, and the cardinality trap
You do not need a thousand dashboards; you need a few disciplined signals. Two well-worn recipes cover most of it:
- RED, for request-driven services: Rate (requests/sec), Errors (failures/sec), Duration (the latency distribution). Three numbers per service catch most user-facing pain.
- USE, for resources like CPU, disks, and connection pools: Utilization, Saturation, Errors.
And then the trap that quietly bankrupts teams: cardinality. A metric label with unbounded values — user_id, request_id — multiplies into millions of distinct time series, and your metrics bill explodes. Keep metric labels low-cardinality (status code, endpoint, region). The high-cardinality detail — which user, which request — belongs in traces and logs, not on a metric.
Never put an unbounded value (a user ID, an email, a full URL with query params) into a metric label. Each unique value is a brand-new time series stored forever. This is the most common way a small metrics bill becomes a frightening one overnight. High-cardinality questions go to traces and logs.
From “is it up?” to “is it good enough?”: SLIs, SLOs, and error budgets
The real maturity leap is to stop measuring the machine and start measuring the user's experience — as a number you can act on.
- An SLI (Service Level Indicator) is a measured ratio of good events: the percentage of requests served under 300 ms, or the percentage that are not
5xx. - An SLO (Objective) is the target you commit to: 99.9% over a rolling 30 days.
- The error budget is the magic that falls out of it: 100% − 99.9% = 0.1%, which over a month is roughly 43 minutes of “allowed” failure. Reliability becomes a budget you are permitted to spend.
This reframes the eternal fight between product (“ship faster”) and engineering (“be reliable”) into a shared number instead of a shouting match. Budget healthy? Ship boldly, take risks. Budget nearly gone? Stop shipping features and stabilise. And, crucially, it tells you what is actually worth waking someone up for.
Alerting that wakes you for the right reasons
The fastest way to make on-call hell is to alert on causes. “Disk at 80%” at 3 a.m., every night, for a disk that auto-expands — and soon nobody reads the alerts, including the night the real one fires. That is alert fatigue, and it is how genuine outages slip through.
# Alert on the SYMPTOM (users in pain), not the cause (one hot CPU).
# Page only when we are burning the monthly error budget fast.
alert: HighErrorBudgetBurn
expr: error_rate_5m > 14.4 * (1 - 0.999) # 99.9% SLO, fast-burn
for: 5m
then: page on-call · attach the runbook link
Two rules fix most of it. Alert on symptoms, not causes — page when users are actually in pain (error-budget burn rate, elevated error rate, latency past the SLO), not when one internal metric twitches. And every page must be urgent and actionable — if there is nothing to do this minute, it is a ticket or a dashboard, not a 3 a.m. page. Attach a runbook to every alert so the woken engineer has a first step, not just a fright.
The bill is real: sampling and retention
Observability data can quietly cost more than the application it watches — traces and logs especially. Three levers keep it sane. Sample traces: keep 100% of errors and slow requests, but only a small fraction of the boring successful ones (“tail sampling” decides after seeing the outcome). Tier retention: hot and searchable for days, cheap and cold for months. And hold the cardinality line. Decide these on purpose — or your vendor will decide them for you, with an invoice.
What to reach for
| The question you're actually asking | The pillar that answers it |
|---|---|
| Is something wrong right now? | Metrics + alerts on SLO burn |
| Where is the time going across services? | Distributed traces (the waterfall) |
| What exactly happened to this one request? | Structured logs, found by trace ID |
| Are we reliable enough to ship, or should we stabilise? | SLOs + error budget |
| Why did this whole class of errors start at 14:02? | All three, correlated by trace ID |
The honest view by company size
- Solo / early startup. You do not need an observability platform. An error tracker (crash and exception alerts), structured logs you can search, and two or three uptime/latency alerts cover the vast majority of incidents. Do one cheap thing now, though: put a trace ID on every log line. It is nearly free today and you will be grateful the first time you grow a second service.
- Growing scale-up. Adopt OpenTelemetry before you are locked in, and propagate the trace ID through every service. Stand up RED metrics and a small set of dashboards. Write your first SLOs on the one or two journeys that truly matter (checkout, login). Start an on-call rotation and ruthlessly delete every alert that is not actionable. Begin sampling traces before the bill teaches you to.
- Enterprise. The three pillars are a platform with an owner. Error-budget policy formally gates releases. Each team owns its own SLOs and on-call. Sampling, retention, and cardinality are governed centrally, because the cost is now a real line item. By here the hard part is no longer collecting data — it is culture: actually acting on the budgets, keeping alerts honest, and writing the runbooks before the incident, not during it.
Key takeaways
- You can't fix what you can't see. Distributed systems removed the single log file, so seeing the system became a discipline you have to build on purpose.
- Three pillars, three jobs. Metrics ask “is something wrong?”, logs ask “what exactly happened?”, traces ask “where did the time go?” Don't force one to do all three.
- One propagated trace ID is the highest-leverage habit. It stitches ten log files into one story — adopt OpenTelemetry early so you are not locked in.
- Measure the user, not the machine. SLIs and SLOs turn reliability into an error budget: ship boldly when it's healthy, stabilise when it's spent.
- Alert on symptoms, not causes. Every page must be urgent and actionable, or you train people to ignore the one alert that matters.
- Observability costs real money. Sample traces, tier retention, and guard cardinality on purpose — or the invoice will do it for you.
Resilience is building the system to survive failure; observability is the sense organ that lets you notice it failing, find the cause in minutes instead of hours, and learn from it afterwards. Build for failure, then make failure visible. Together, that is most of what “operating” a large system actually means — and it is the quiet difference between a 3 a.m. mystery and a 3 a.m. fix.