Two ways to scale
- Vertical (scale up): a bigger box (more CPU/RAM). Simple, no app changes — but there's a hard ceiling and it's a single point of failure.
- Horizontal (scale out): more boxes behind a load balancer. Near-unlimited and fault-tolerant, but requires your services to be stateless and your data layer to handle concurrency.
Real systems do both: scale up until it's uneconomical, then scale out.
Statelessness is the unlock
If a server stores session state in memory, request #2 must hit the same server (sticky sessions) — which breaks load balancing and fails when that box dies. Fix: keep app servers stateless and push state to a shared store.
Now any server can serve any request; you add/remove servers freely, and a crash loses no session. This is exactly why StockVision puts JWT/session data in cookies + Redis rather than server memory — it can run behind N identical nodes.
Trace a request down the tier ladder: the load balancer round-robins to a stateless app server, which checks the cache first and only falls through to the database on a miss. Repeat a key and watch it serve from the warm cache instead:
1/11GET /u/7: the load balancer routes to app server 0 (round-robin). Servers are stateless, so any one can serve it.
Load balancers
- L4 (transport): routes by IP/port; fast, protocol-agnostic, no payload awareness.
- L7 (application): routes by URL/header/cookie; enables path-based routing, TLS termination, and smarter health checks. Most web stacks use L7.
Algorithms: round-robin, least-connections, weighted, and consistent hashing (route the same key to the same node — vital for cache affinity). Always pair with health checks so dead nodes are pulled out.
A single load balancer is a single point of failure. Run it in an active-passive (or active-active) pair with a floating IP / DNS failover, or use a managed LB that's redundant by design.
Don't forget the data tier
Scaling app servers is easy; the database is usually the real bottleneck. The standard ladder: cache reads → read replicas for read scale → shard when writes or storage exceed one box (covered in Databases at scale).
Design drills
Scaling is a ladder you climb under pressure. Practise naming the next rung and its cost.
Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.
One app server is at 80% CPU. What's the cheapest next move, and when does it stop working?
Reads are the bottleneck at 50k QPS with a 100:1 read:write ratio. Walk the read-scaling ladder in order.
The load balancer is now a single point of failure. How do you make the entry tier itself highly available?
Writes now exceed what one primary can handle. What changes, and what do you lose?