Microservices

Splitting the backend honestly: what microservices buy, the distributed problems they import, sagas, and why monolith-first is the senior answer.

backendmicroservicesdistributed-systemsarchitecture

The actual definition

A monolith is one deployable: all features in one codebase, one process, one database. Microservices split the backend into independently deployable services — each owning one business capability and its own data — talking over the network.

The crucial honesty up front: microservices are not a performance technique or a modernity badge. They are an organizational scaling technique that costs technical complexity. Every "should we?" conversation is really about that trade.

What they buy — and the price

You gainYou pay
Independent deploys — checkout team ships 20×/day without coordinating with search teamEvery function call you split becomes a network call: latency, timeouts, retries, partial failure
Independent scaling — 50 instances of search, 2 of invoicing (scalability)No more ACID across features — the transaction that updated orders + inventory together is gone
Fault isolation — recommendations down ≠ checkout downDebugging crosses machines: distributed tracing or blindness
Tech freedom per servicethe right runtime per workloadOperational surface × N: deploys, monitoring, on-call, versioned contracts between teams
Team autonomy — Amazon's "two-pizza teams," each owning services end-to-endA platform tax you pay before the first feature: service discovery, CI/CD per service, Level 10 machinery

The rule that survives contact with reality: split when team coordination costs exceed distributed-system costs — rarely before several teams are stepping on each other's deploys. "Monolith-first" (build modular inside one deployable; extract services along the seams that prove painful) is not a junior compromise; it's what Amazon, Shopify and Stack Overflow stories actually teach.

Boundaries: the design problem

Wrong splits are worse than no splits. The unit is a business capability (orders, payments, inventory, notifications) — not a layer (controllers-service! database-service!) and not a table. Two tests for a proposed boundary:

  1. Does it own its data? Each service gets its own database — others may never reach in; they ask via API/events. Shared databases recreate the coupling you split to escape, minus the transactions.
  2. Can it change alone? If every feature touches three services, you've built a distributed monolith — all of the costs, none of the autonomy. (This is the modal failure, and naming it is interview gold.)

The encapsulation principle, at organization scale: services are objects, APIs/events are their public methods, databases are their private fields.

Communication: sync vs async

  • Synchronous (REST/gRPC): caller needs the answer now — checkout must know payment succeeded. Simple to reason about; couples uptime (callee down = caller failing) and adds latency per hop. gRPC is the internal-call standard: binary, typed contracts (the API-contract discipline, compiled).
  • Asynchronous (events via queues): caller announces a fact — OrderPlaced — and moves on; interested services react on their own time. Decouples uptime and teams (adding a consumer requires zero changes upstream); costs you eventual consistency and harder reasoning.

Default that earns senior nods: sync for queries and must-know-now commands; async events for everything that's a consequence. Most "service A calls B calls C calls D" latency chains were consequences wearing sync clothing.

Surviving sync failure

Network calls fail; the patterns are vocabulary now: timeouts (always, on everything), retries with backoff + jitter (only on idempotent calls — idempotency keys exist for this), circuit breakers (after K failures, fail fast for a cooldown instead of hammering a dying service — preventing cascade failure), and fallbacks (recommendations down → show bestsellers, not a 500).

The hard part: transactions across services

Order placement must: create order, charge payment, reserve inventory — three services, three databases, no shared transaction. If inventory fails after payment charged?

The saga pattern: a sequence of local transactions, each publishing an event that triggers the next — and for every step, a defined compensating action that undoes it on later failure:

OrderCreated → PaymentCharged → InventoryReserved → OrderConfirmed
                     ↑                  ✗ fails
              RefundPayment  ←  ReservationFailed     (compensation flows back)
              CancelOrder    ←

Two flavors: choreography (services react to each other's events — no coordinator, but the flow lives nowhere and is hard to follow) vs orchestration (an order-saga coordinator commands each step — the flow is explicit, the coordinator is a dependency). Either way you've traded ACID's automatic rollback for explicit, designed rollback — and accepted eventual consistency: for a few seconds, the system is mid-saga. (CAP & consistency is the theory; this is where it bites.)

Idempotency is the saga's load-bearing wall: events get redelivered (at-least-once), so every handler must tolerate duplicates — process-once semantics built from dedupe keys.

Seeing anything: observability

A request now crosses 6 services; "it's slow" means where? Three pillars, non-negotiable at this scale: structured logs with a correlation id propagated through every hop (the one header that turns 6 services' logs into one story), metrics per service (rate/errors/duration — the load balancer's health checks plus dashboards), and distributed tracing (OpenTelemetry/Jaeger: one waterfall showing the request's path and where the 800 ms went). Level 10 operationalizes all three.

Common mistakes

  • Microservices at product-search time — a two-person startup paying Amazon's coordination tax with nobody to coordinate. Modular monolith first.
  • Shared database "just for now" — the distributed monolith's origin story.
  • Splitting by noun-table instead of capability — UserService, OrderService, and every request visiting both.
  • Sync chains five deep — availability multiplies down: if each service is up 99.9% of the time, a request needing all five in a row succeeds only 0.999⁵ ≈ 99.5% of the time — you made everything less reliable by wiring it in series. Worst-case latency stacks the same way (estimation instincts).
  • No timeouts/circuit breakers — one slow dependency exhausts your thread pool and the cascade takes the platform down.
  • Events without idempotent consumers — duplicate PaymentCharged, duplicate refund, real money.

Interview perspective

Practice

  1. Paper-split BookMyShow (the LLD): propose services, mark each arrow sync or event, and find the saga (booking = seat-lock + payment + ticket + notification). Where does the compensation flow run when payment fails?
  2. Build a mini-saga: two FastAPI/Express services (orders, payments) + Redis as a bus: OrderPlaced → charge → PaymentResult → confirm/cancel. Kill the payment service mid-flow; make recovery work; then send a duplicate event and watch idempotency save you (or not).
  3. Break a chain: add a 5 s sleep to one downstream service and watch the caller's latency without a timeout; add timeout + fallback; then a counter-based circuit breaker. Three patterns, one afternoon.
  4. Connect: Praxivo's deep-dive — argue against microservices for it in three sentences. (Knowing when not to is the Level 7 graduation exam.)

Level 7 complete. The roadmap turns to what the user sees: Level 8 — Frontend Development.