Scaling, Load Balancing & Statelessness

Vertical vs horizontal scaling, why stateless services are the unlock, load-balancer types and algorithms, and how to handle sessions when any server can take any request.

scalabilityload-balancingavailability

Two ways to scale

  • Vertical (scale up): a bigger box (more CPU/RAM). Simple, no app changes — but there's a hard ceiling and it's a single point of failure.
  • Horizontal (scale out): more boxes behind a load balancer. Near-unlimited and fault-tolerant, but requires your services to be stateless and your data layer to handle concurrency.

Real systems do both: scale up until it's uneconomical, then scale out.

Statelessness is the unlock

If a server stores session state in memory, request #2 must hit the same server (sticky sessions) — which breaks load balancing and fails when that box dies. Fix: keep app servers stateless and push state to a shared store.

Now any server can serve any request; you add/remove servers freely, and a crash loses no session. This is exactly why StockVision puts JWT/session data in cookies + Redis rather than server memory — it can run behind N identical nodes.

Trace a request down the tier ladder: the load balancer round-robins to a stateless app server, which checks the cache first and only falls through to the database on a miss. Repeat a key and watch it serve from the warm cache instead:

Request flow — LB → stateless app → cache → DBtime cache hit ≈ O(1)space
requestGET /u/7
Client
LB
app ×3
012
Cache
DB

1/11GET /u/7: the load balancer routes to app server 0 (round-robin). Servers are stateless, so any one can serve it.

cached keys = 0→ server = 0

Load balancers

  • L4 (transport): routes by IP/port; fast, protocol-agnostic, no payload awareness.
  • L7 (application): routes by URL/header/cookie; enables path-based routing, TLS termination, and smarter health checks. Most web stacks use L7.

Algorithms: round-robin, least-connections, weighted, and consistent hashing (route the same key to the same node — vital for cache affinity). Always pair with health checks so dead nodes are pulled out.

The LB itself can be a SPOF

A single load balancer is a single point of failure. Run it in an active-passive (or active-active) pair with a floating IP / DNS failover, or use a managed LB that's redundant by design.

Don't forget the data tier

Scaling app servers is easy; the database is usually the real bottleneck. The standard ladder: cache reads → read replicas for read scale → shard when writes or storage exceed one box (covered in Databases at scale).

Design drills

Scaling is a ladder you climb under pressure. Practise naming the next rung and its cost.

Design drills: Scaling & load balancing0/4 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

One app server is at 80% CPU. What's the cheapest next move, and when does it stop working?

Core

Reads are the bottleneck at 50k QPS with a 100:1 read:write ratio. Walk the read-scaling ladder in order.

Core

The load balancer is now a single point of failure. How do you make the entry tier itself highly available?

Stretch

Writes now exceed what one primary can handle. What changes, and what do you lose?