Design Uber

The geospatial one: indexing moving drivers, the matching flow, trip state machines, and why this design is about location, not content.

system-designgeospatialmatchingcase-study

Prompt

Design ride-hailing: riders request a trip, nearby drivers are matched, both sides track each other live, pricing responds to demand. Millions of concurrent drivers and riders.

1. Requirements

Functional: rider requests with pickup/destination; match to a nearby available driver; live location tracking both ways; trip lifecycle (request → match → pickup → ride → payment); dynamic ("surge") pricing per area. Non-functional: matching in a few seconds; location freshness of a few seconds; a trip must never be double-assigned or lost (correctness over availability for the match itself); city-scale locality (Mumbai's traffic never needs Chicago's data).

The opening move: "Unlike the feed/video systems, the hard data here is a continuously moving geospatial index plus a transactional matching step. I'll deep-dive those two."

2. Estimation (shapes the design)

~5 M drivers online at peak, each phone reporting location every ~4 s → ~1.25 M location writes/sec — the single biggest stream in the system, and it's ephemeral (only the latest few points matter). Trip requests: ~1 M trips/hour peak ≈ ~300 matches/sec — tiny by comparison, but each one is a small distributed transaction. The asymmetry is the architecture: a firehose of disposable writes next to a trickle of precious ones.

3. High-level design

4. Deep dives

The geospatial index ("find drivers near me")

The naive query — WHERE distance(driver, rider) < 2km — computes distance to every driver per request. The fix is the same move as every index (B-trees, hashes): precompute structure that narrows the search. For geography, divide the world into cells:

  • Geohash / S2 / H3 — encode (lat, lng) into a cell id where nearby points share prefixes (geohash) or neighboring ids (S2/H3 hexagons). The index becomes an ordinary hash map: cell id → set of driver ids.
  • "Nearby drivers" = look up the rider's cell plus its 8 neighbors (the neighbor lookup is why purpose-built geo cells beat naive grid squares at the poles/meridians — say "H3" and move on).
  • A driver's 4-second update: remove from old cell's set, add to new — two O(1) operations on an in-memory store (Redis-class), sharded by region. The quadtree/k-d-tree alternative earns a mention for static points ("find restaurants"), but cells beat trees when everything moves — constant-time updates, no rebalancing.

Durability stance worth stating: locations are deliberately not durable — a crashed index shard rebuilds within seconds from the next wave of updates (the same "derived data" argument as feed caches). Only trips are precious.

Matching (the transactional trickle)

  1. Rider requests → matching service queries the geo index → ranks candidates (distance/ETA, driver rating, acceptance rate).
  2. Offer the trip to the top driver with a ~15 s TTL — the seat-lock pattern wearing a seatbelt: the offer reserves the driver (state: OFFERED), so they're excluded from other matches; expiry releases them automatically.
  3. Accept → atomic check-and-set on the trip (PENDING → MATCHED, driver AVAILABLE → ON_TRIP). Two matching services racing for one driver is the double-booking race; the driver's state transition is the serialization point.
  4. Decline/timeout → next candidate. Nobody available → expand radius, then surge (below).

The trip state machine (REQUESTED → MATCHED → ARRIVING → IN_PROGRESS → COMPLETED/CANCELLED) lives in a durable, replicated store — this is the system's BookMyShow booking / order saga: every transition is an event, payments hang off COMPLETED, and both apps render the same machine.

Live tracking during a trip

Rider watches the car crawl toward them: the driver's location stream is routed — during a trip, updates also push to the paired rider over the existing WebSocket machinery (session registry says where the rider's socket lives, the WhatsApp routing table). Client-side smoothing/map-snapping makes 4-second updates look continuous — a frontend trick saving 4× the backend load, worth one sentence in any interview.

Surge pricing

Per cell, per minute: demand (open requests) vs supply (available drivers) → a multiplier, computed by a streaming job over the two firehoses (the windowed-aggregation pattern again — it's everywhere). Published to riders before they request. Two honest notes: it's a control system (the multiplier exists to move supply toward demand), and it's lag-tolerant — a 60-second-stale multiplier is fine, which is why it's a streaming aggregate and not a synchronous query.

Think it through like the interview

Think it through: Design UberHLD Classic — geospatial + matching0/5 stages

PROBLEMRide-hailing: riders request, nearby drivers match, both track each other live, pricing responds to demand. Millions of concurrent drivers and riders.

  1. 1

    Find the asymmetry

    Run the estimate: 5M drivers reporting every 4s vs ~300 matches/sec. What does that tell you?

  2. 2

    Index moving points with cells, not trees

    'Drivers near me' — why not a quadtree, and why not WHERE distance() < 2km?

    unlocks after the stage above
  3. 3

    Matching = seat-lock with a TTL

    Two matching servers pick the same driver in the same second. Where does correctness live?

    unlocks after the stage above
  4. 4

    Defend non-durability

    You're proposing NOT persisting 1.25M writes/sec. Justify it before the interviewer pounces.

    unlocks after the stage above
  5. 5

    Close with locality

    What makes this system EASY to scale — and what are the designed relief valves for a demand spike?

    unlocks after the stage above

5. Bottlenecks & failure modes

  • Geo-index shard dies → that city's matching degrades for seconds; index rebuilds from live updates. Riders retry; trips in-flight are unaffected (they live in the durable store). This failure story — ephemeral layer crashes are non-events — is the payoff of the durability split.
  • Event spike (concert ends, rain starts) → request volume × 10 in one cell: matching queue grows, radius expansion + surge are the designed relief valves; the geo index itself doesn't care.
  • Offer storms on the best driver — serialization on driver state prevents double-assignment; ranking diversity (don't offer everyone the same top driver) prevents convoy effects.
  • Cross-city is embarrassingly shardable — Mumbai and Delhi share nothing at runtime; regional isolation is both a scaling and a blast-radius strategy (cellular architecture).

Design drills

Ride-hailing is geospatial matching + atomic assignment + data classification.

Design drills: Ride-hailing (Uber)0/4 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

Match a rider to the nearest available driver. What data structure and flow?

Core

Driver locations update every few seconds at huge volume. How do you store and serve them?

Core

Two requests try to assign the same driver to different trips at once. Guarantee one wins.

Stretch

Classify the system's data by its useful lifetime and pay durability accordingly.