What observability actually means
Monitoring asks predefined questions: "is CPU above 80%?" Observability is the stronger property: can you answer questions you didn't predict — "why are checkouts from Android in Mumbai slow since the 2pm deploy?" — from the system's outputs alone, without shipping new code to find out?
The distinction matters because distributed systems fail in combinations nobody pre-imagines. You can't dashboard your way to every answer; you can instrument your way to answerable.
Analogy: monitoring is the warning lights on a car's dashboard —
predefined, binary, useful. Observability is being able to plug in
a diagnostic reader and interrogate anything the engine ever did.
The three pillars (and what each is for)
Logs — what happened, in detail
Discrete events with context. The non-negotiable upgrade from
print (Level 1's debugging) is
structured logging — machine-parseable fields, not prose:
{"ts": "2026-06-11T14:02:11Z", "level": "error", "service": "payments",
"trace_id": "a8f3...", "user_id": 4471, "order_id": 88123,
"msg": "charge declined", "provider": "stripe", "decline_code": "insufficient_funds"}
Prose logs are grep-able; structured logs are queryable ("all declines by provider, last hour, grouped by code"). Rules that pay rent: log once per event at the right layer (throw low, catch high, log once); always carry the correlation/trace id (below); never log secrets or PII (tokens, passwords, card numbers — log redaction is a compliance control, not a style choice). Logs are expensive at volume — sample the routine, keep every error.
Metrics — how much, how fast, numerically
Pre-aggregated numbers over time: cheap to store, fast to query, ideal for dashboards and alerts. The two acronyms that organize everything:
- RED (per service): Rate (requests/sec), Errors (failure %), Duration (latency — as percentiles, never averages: p50 is the typical user, p99 is your unlucky-1% user (the Node doc's metric), and averages hide exactly the pain you're looking for).
- USE (per resource): Utilization, Saturation (queue depth — queue depth is destiny), Errors — for hosts, DBs, pools, brokers.
A RED dashboard per service + USE per shared resource answers "is it broken and where" in one glance; that's the bar.
Traces — where the time went, across services
One request, six services: a distributed trace stitches its whole journey into a waterfall of spans (operation, service, start, duration, parent):
trace a8f3: gateway ──────────────────────────── 1,240ms
└─ orders ──────────────────────── 1,180ms
├─ inventory ── 95ms
├─ payments ─────────────────── 1,020ms ← there it is
│ └─ stripe API ─────────── 980ms
└─ notifications (async) 12ms
The mechanism is humble: the gateway mints a trace id, and every hop propagates it (HTTP headers, queue message headers) while reporting its spans — the standard is OpenTelemetry, the propagation rule is "every client and handler forwards the context," and the discipline must be universal: one service that drops the header cuts every trace that crosses it. The trace id doubles as the correlation id that turns six services' logs into one story (the microservices doc's lifeline).
How the pillars compose in a real incident: alert fires on a metric → trace shows which hop → logs of that hop show why. Three tools, one workflow — build all three or the workflow has a hole.
SLOs and error budgets (the management layer)
Observability data becomes decisions through three letters:
- SLI — a measured indicator of user-visible health: "% of checkout requests under 500 ms, successful."
- SLO — the target you commit to: "99.9% over 30 days." Chosen deliberately — every extra nine multiplies cost (the availability math) and 100% is a lie; pick the number where users stop noticing.
- Error budget — the allowed failure: 99.9% = ~43 minutes of badness per month. The budget is spendable: plenty left → ship fast, take risks; budget burned → feature work pauses for reliability work. This single mechanism converts the eternal speed-vs-stability argument into arithmetic — it's SRE's (Google's Site Reliability Engineering practice) most exportable idea, and naming it well in interviews signals operational maturity.
Alerting philosophy (the part everyone gets wrong)
- Page on symptoms, not causes: page when users hurt (SLO burn rate, error spike, p99 through the ceiling) — not when CPU is high. High CPU with healthy latency is Tuesday; paging on it trains humans to ignore pages (alert fatigue — the failure mode that eventually swallows a real incident).
- Every page must be actionable and urgent. Not urgent → a ticket. Not actionable → a dashboard. The 2 AM test: would you act on this, right now? No → it doesn't page.
- Alert on absence too — "no orders processed in 10 minutes" beats every threshold; silence is the failure your error-rate alert can't see (the queue consumer that hung without erroring — lag is the metric).
- Burn-rate alerting is the modern default: page when the error budget is being consumed too fast (fast burn = page now; slow burn = ticket) — it self-calibrates to the SLO instead of needing per-metric thresholds.
Common mistakes
- Averages on dashboards — the mean of 99 fast and 1 ten-second request looks fine; the percentile screams. p50/p95/p99 or it didn't happen.
- Logging without correlation ids — six services, six private diaries, no story. This question has no answer without it.
- Instrumenting after the incident — observability is built in fair weather; mid-outage is too late (the modular-monolith rule: infrastructure you'll need is cheapest before you need it).
- Cardinality explosions — a metric labeled by user_id creates millions of time series and a dead metrics bill; high-cardinality questions belong to logs/traces, not metrics labels.
- Dashboards as wallpaper — 40 charts nobody reads; one RED per service, one USE per resource, one SLO page. Curate or drown.
- No runbooks — every alert links to "what to check, what to do"; a page without a runbook is a riddle at 2 AM.
Interview perspective
Practice
- Structure your logs: take any practice API from
Level 7 and convert prints to structured JSON
logs with a request-scoped trace id middleware; verify one
request's story is one
grep. - RED in 30 minutes: add Prometheus counters/histograms (requests, errors, duration) and a three-panel dashboard. Break the service; watch the panels tell you before the user does.
- Trace a chain: two services calling each other with OpenTelemetry auto-instrumentation + Jaeger; find a deliberately slow downstream in the waterfall.
- Write one SLO: for your project, define one SLI, pick a defensible target, compute the monthly budget in minutes, and write the two alerts (fast burn, slow burn) with their runbook stubs.
Level 10 is complete. Observability is the lens for everything else you've built — next on the roadmap: the remaining database internals and AI foundations.