Nguyen Le Phong

ソフトウェアアーキテクチャの基礎全 11 回中第 11 回

You Can't Fix What You Can't See: Logs, Metrics, Traces, and SLOs at Scale

In a monolith, debugging was almost cosy — one log file, one process, one place the truth lived. Distributed systems quietly took that away: one request now fans out across a dozen services, and when it breaks there is no single log to read. A no-hype guide to seeing your system at scale: the three pillars (metrics, logs, traces) and the question each one answers, why a single propagated trace ID is the highest-leverage habit you can adopt, how SLOs turn reliability into an error budget you can spend, and how to alert on symptoms so on-call doesn't burn out.

When a system was a monolith on one machine, debugging was almost cosy. One log file. One process you could attach a debugger to. One place where the truth lived. Reproduce the bug, read the stack trace, fix it. Distributed systems quietly took that comfort away. A single checkout request now fans out across a dozen services on a dozen machines, and when it is slow or broken, there is no one log file and nobody to ssh into. You cannot fix what you cannot see — and at scale, seeing is a discipline of its own.

That discipline is worth separating from its older cousin. Monitoring is watching the dashboards you built in advance, for the failures you predicted — “CPU high,” “disk full.” Observability is being able to ask brand-new questions of a running system — including ones you never thought to predict — without shipping new code to answer them. Monitoring tells you that something is wrong; observability lets you find out why. At scale you need both, and both are built from three kinds of data.

Three columns labelled Metrics, Logs, and Traces, each answering a different question, tied together at the bottom by a shared trace ID. THREE VIEWS OF ONE RUNNING SYSTEM Metrics “Is something wrong?” numbers over time · cheap · aggregated Logs “What exactly happened?” detailed events · costly at volume Traces “Where did the time go?” one request, across services one trace ID stamped on all three → ten log files become one story
The three pillars answer three different questions — the mistake is forcing one to do all three jobs. A shared trace ID is the thread that stitches them into a single story.

The three pillars: metrics, logs, traces

Each pillar answers a different question. The classic mistake is trying to make one of them do all three jobs — drowning in logs to compute what a metric should, or building dashboards to chase what a trace would show in seconds.

  • Metrics are numbers over time: request count, error count, p99 latency, queue depth. They are cheap to store because they are aggregated, which makes them perfect for dashboards and “is something wrong right now?” alerts. What they cannot tell you is which request or which user. Metrics are the smoke alarm, not the investigation.
  • Logs are discrete, detailed events: “order 8123 failed validation: missing shipping address.” They answer “what exactly happened in this specific case?” Their cost is volume — at scale, logs are often the biggest line on the bill — and an unstructured string log is nearly useless to query. Structure them.
  • Traces follow one request as it travels across services, broken into timed spans. This is the pillar that distributed systems forced into existence: it answers “where did the time go?” and “which hop failed?” when the answer is smeared across ten machines.
// Unstructured — fine on one machine, useless across fifty
console.log(`Order ${id} failed for ${userId}: ${err.message}`)

// Structured — queryable, and tied to the whole request by trace_id
log.error({
  event: "order_failed",
  trace_id: ctx.traceId,   // the same id flows through every service
  order_id: id,
  user_id: userId,
  error: err.message,
})

The thread that ties them together: one trace ID

Here is the single highest-leverage habit in this whole article. Give every request one ID at the edge, and propagate it through every service it touches — stamping it on every log line and every span. This is context propagation, and it is the difference between ten separate, useless log files and one coherent story: “show me everything that happened to request abc-123.”

The open standard for generating and propagating this data is OpenTelemetry (OTel) — one vendor-neutral way to emit metrics, logs, and traces so you are not married to a single backend. Adopt it early; retrofitting trace propagation across a mature fleet is genuinely miserable work.

Reading a distributed request: the trace waterfall

A trace draws every span of one request on a single timeline. Suddenly the slow span is obvious at a glance — “checkout is slow” becomes “payment-service is slow” in seconds, instead of an afternoon of cross-team Slack archaeology. This is the payoff that justifies the cost of instrumenting everything.

A distributed trace drawn as a waterfall: each service's span is a bar on a shared time axis, and the payment-service span is by far the widest, revealing it as the bottleneck. ONE REQUEST · EVERY HOP · ON ONE TIMELINE API Gateway ├ auth-service ├ orders-service │ └ db query └ payment-service payment · 200 ms ← bottleneck 0 100ms 200ms 300ms a trace turns “checkout is slow” into “payment-service is slow” in seconds
A distributed trace draws every span of one request on a single timeline. The slow span is obvious at a glance — no cross-team archaeology required.

Metrics that scale: RED, USE, and the cardinality trap

You do not need a thousand dashboards; you need a few disciplined signals. Two well-worn recipes cover most of it:

  • RED, for request-driven services: Rate (requests/sec), Errors (failures/sec), Duration (the latency distribution). Three numbers per service catch most user-facing pain.
  • USE, for resources like CPU, disks, and connection pools: Utilization, Saturation, Errors.

And then the trap that quietly bankrupts teams: cardinality. A metric label with unbounded values — user_id, request_id — multiplies into millions of distinct time series, and your metrics bill explodes. Keep metric labels low-cardinality (status code, endpoint, region). The high-cardinality detail — which user, which request — belongs in traces and logs, not on a metric.

The label that blows up the bill

Never put an unbounded value (a user ID, an email, a full URL with query params) into a metric label. Each unique value is a brand-new time series stored forever. This is the most common way a small metrics bill becomes a frightening one overnight. High-cardinality questions go to traces and logs.

From “is it up?” to “is it good enough?”: SLIs, SLOs, and error budgets

The real maturity leap is to stop measuring the machine and start measuring the user's experience — as a number you can act on.

  • An SLI (Service Level Indicator) is a measured ratio of good events: the percentage of requests served under 300 ms, or the percentage that are not 5xx.
  • An SLO (Objective) is the target you commit to: 99.9% over a rolling 30 days.
  • The error budget is the magic that falls out of it: 100% − 99.9% = 0.1%, which over a month is roughly 43 minutes of “allowed” failure. Reliability becomes a budget you are permitted to spend.
A horizontal bar representing one month's error budget at a 99.9% SLO, partly filled to show how much of the allowed downtime has already been spent. RELIABILITY AS A NUMBER YOU CAN SPEND budget burned: ≈ 19 min ≈ 24 min left an SLO of 99.9% over 30 days ≈ 43 minutes of “allowed” failure per month budget healthy → ship features · budget gone → stop and stabilise
An error budget turns reliability into something you can spend. While it's healthy you ship boldly; when it's nearly gone, the same number tells you to stop and stabilise.

This reframes the eternal fight between product (“ship faster”) and engineering (“be reliable”) into a shared number instead of a shouting match. Budget healthy? Ship boldly, take risks. Budget nearly gone? Stop shipping features and stabilise. And, crucially, it tells you what is actually worth waking someone up for.

Alerting that wakes you for the right reasons

The fastest way to make on-call hell is to alert on causes. “Disk at 80%” at 3 a.m., every night, for a disk that auto-expands — and soon nobody reads the alerts, including the night the real one fires. That is alert fatigue, and it is how genuine outages slip through.

# Alert on the SYMPTOM (users in pain), not the cause (one hot CPU).
# Page only when we are burning the monthly error budget fast.
alert: HighErrorBudgetBurn
expr:  error_rate_5m > 14.4 * (1 - 0.999)   # 99.9% SLO, fast-burn
for:   5m
then: page on-call · attach the runbook link

Two rules fix most of it. Alert on symptoms, not causes — page when users are actually in pain (error-budget burn rate, elevated error rate, latency past the SLO), not when one internal metric twitches. And every page must be urgent and actionable — if there is nothing to do this minute, it is a ticket or a dashboard, not a 3 a.m. page. Attach a runbook to every alert so the woken engineer has a first step, not just a fright.

The bill is real: sampling and retention

Observability data can quietly cost more than the application it watches — traces and logs especially. Three levers keep it sane. Sample traces: keep 100% of errors and slow requests, but only a small fraction of the boring successful ones (“tail sampling” decides after seeing the outcome). Tier retention: hot and searchable for days, cheap and cold for months. And hold the cardinality line. Decide these on purpose — or your vendor will decide them for you, with an invoice.

What to reach for

The question you're actually askingThe pillar that answers it
Is something wrong right now?Metrics + alerts on SLO burn
Where is the time going across services?Distributed traces (the waterfall)
What exactly happened to this one request?Structured logs, found by trace ID
Are we reliable enough to ship, or should we stabilise?SLOs + error budget
Why did this whole class of errors start at 14:02?All three, correlated by trace ID

The honest view by company size

  • Solo / early startup. You do not need an observability platform. An error tracker (crash and exception alerts), structured logs you can search, and two or three uptime/latency alerts cover the vast majority of incidents. Do one cheap thing now, though: put a trace ID on every log line. It is nearly free today and you will be grateful the first time you grow a second service.
  • Growing scale-up. Adopt OpenTelemetry before you are locked in, and propagate the trace ID through every service. Stand up RED metrics and a small set of dashboards. Write your first SLOs on the one or two journeys that truly matter (checkout, login). Start an on-call rotation and ruthlessly delete every alert that is not actionable. Begin sampling traces before the bill teaches you to.
  • Enterprise. The three pillars are a platform with an owner. Error-budget policy formally gates releases. Each team owns its own SLOs and on-call. Sampling, retention, and cardinality are governed centrally, because the cost is now a real line item. By here the hard part is no longer collecting data — it is culture: actually acting on the budgets, keeping alerts honest, and writing the runbooks before the incident, not during it.

Key takeaways

  • You can't fix what you can't see. Distributed systems removed the single log file, so seeing the system became a discipline you have to build on purpose.
  • Three pillars, three jobs. Metrics ask “is something wrong?”, logs ask “what exactly happened?”, traces ask “where did the time go?” Don't force one to do all three.
  • One propagated trace ID is the highest-leverage habit. It stitches ten log files into one story — adopt OpenTelemetry early so you are not locked in.
  • Measure the user, not the machine. SLIs and SLOs turn reliability into an error budget: ship boldly when it's healthy, stabilise when it's spent.
  • Alert on symptoms, not causes. Every page must be urgent and actionable, or you train people to ignore the one alert that matters.
  • Observability costs real money. Sample traces, tier retention, and guard cardinality on purpose — or the invoice will do it for you.

Resilience is building the system to survive failure; observability is the sense organ that lets you notice it failing, find the cause in minutes instead of hours, and learn from it afterwards. Build for failure, then make failure visible. Together, that is most of what “operating” a large system actually means — and it is the quiet difference between a 3 a.m. mystery and a 3 a.m. fix.

記事はいかがでしたか?