Design WhatsApp

The messaging system design: persistent connections, the online/offline delivery split, message ordering, receipts, and group fan-out.

system-designmessagingwebsocketcase-study

Prompt

Design a 1:1 and group messaging app: real-time delivery, offline users get messages when they return, sent/delivered/read receipts, media. Billions of users.

1. Requirements

Functional: 1:1 chat; group chat (cap it: ≤1024 members); delivery states ✓/✓✓/blue; online/last-seen presence; media (images/video). Non-functional: real-time feel (under 500 ms end-to-end); no message loss, ever (the invariant users trust); messages from one sender appear in order; works on flaky mobile networks; end-to-end encrypted.

Scope cut to say aloud: "calls, status/stories and multi-device out; deep-dive 1:1 delivery and group fan-out — OK?"

2. Estimation (shapes the design)

~2 B users, ~100 B messages/day ≈ 1.2 M messages/sec average, ~3–4 M peak. A text message is tiny (~100 bytes + metadata) → message volume is a connection problem, not a bandwidth problem. The number that actually designs the system: hundreds of millions of simultaneously open connections — you can't poll ("any messages for me?" × 2 B phones × every second), so the phones must hold persistent connections and servers must push. A tuned server holds ~1 M idle connections → a fleet of a few hundred chat gateways.

3. High-level design

  • Chat gateways — stateless-ish WebSocket terminators; their only job is holding sockets and forwarding frames. Scale by adding boxes.
  • Session registryuser → gateway mapping (Redis): the routing table that answers "where is B connected right now?"
  • Chat service — the brain: persist, look up recipient, route or store-for-later.
  • Message store — write-heavy, append-mostly, queried as "recent messages per conversation" → wide-column store (Cassandra/HBase lineage), partition key = conversation, clustering by time (databases at scale).
  • Media — never through the chat path: upload to object storage, send the URL + key as the message; recipients fetch via CDN.

4. Deep dive — the delivery flow

The heart of the design is one decision repeated forever: is the recipient online?

  1. A sends; gateway acks to A's device (✓ sent — server has it, durably: persisted/enqueued before the ack, or the no-loss invariant is fiction).
  2. Chat service looks up B in the session registry.
    • Online → push down B's gateway socket. B's device acks → ✓✓ delivered (a receipt is just a tiny system message flowing back).
    • Offline → append to B's offline inbox (per-user durable queue). When B reconnects, the gateway drains the inbox, B acks each, ✓✓ fires late.
  3. Read (blue) is purely client-emitted when the chat is viewed — product semantics layered on the same receipt channel.

The whole flow, both branches of the one decision:

No-loss mechanics: every hop is ack-or-retry with the message kept until acked — at-least-once delivery, so duplicates can occur (network blips, retries) → client dedupes by message id (sender-generated UUID). At-least-once + idempotent receive is the standard answer (messaging); exactly-once-on-the-wire is a trap to walk past, out loud.

Ordering: global ordering is unnecessary and unscalable — what users need is per-conversation ordering. Per-sender sequence numbers per conversation; receiver sorts/gap-detects. Wall-clock timestamps lie (clock skew); say "sequence numbers, not timestamps."

Groups: sender uploads once; the chat service (or a group service with the member list) fans out server-side — N inbox-or-push deliveries, reusing the entire 1:1 machinery. The 1024-member cap exists precisely to bound fan-out cost; "what if groups were 1 M?" → that's a broadcast channel, a different design (pull/feed, not push — closer to Netflix's read path).

E2EE in one line: with the Signal protocol, devices exchange keys and the server forwards ciphertext it cannot read — which forces server-side features (search, spam filtering) onto the client. One sentence of this earns the depth point; the crypto itself is out of scope.

Think it through like the interview

Think it through: Design WhatsAppHLD Classic — realtime messaging0/5 stages

PROBLEM1:1 and group messaging: real-time delivery, offline users catch up on return, sent/delivered/read receipts, media. Billions of users. No message loss, ever.

  1. 1

    Find the number that designs the system

    100B messages/day sounds scary. Run the estimate — what's the ACTUAL hard number?

  2. 2

    One decision, repeated forever

    Strip the system to its core loop. What single question does it answer per message?

    unlocks after the stage above
  3. 3

    No-loss is an ack chain

    Where can a message die between two phones, and what plugs each gap?

    unlocks after the stage above
  4. 4

    Ordering without a global clock

    Messages must appear in order. Global ordering? Timestamps? What's actually needed?

    unlocks after the stage above
  5. 5

    Groups, and the push→pull boundary

    Group of 1024 → fine. Group of 1M → different design. Why, and where's the line?

    unlocks after the stage above

5. Bottlenecks & failure modes

  • A gateway dies → its ~1 M sockets drop; phones auto-reconnect (with jittered backoff — avoid the thundering herd), land on other gateways, registry updates, inboxes drain. Messages survived because durability lives in the queue/store, never in the gateway.
  • Presence storms — last-seen updates from 2 B phones would melt a naive design: batch, debounce, and only fan out presence to users actively viewing that contact's chat.
  • Hot conversations (a 1024-member group going viral) → fan-out spikes; queue absorbs, workers scale (the queue's whole job).
  • Registry staleness — B's socket moved between lookup and push → push fails → fall back to offline inbox; reconnect drains it. Every online-path failure degrades to the offline path: the offline path is the system's safety net, design it first.

Design drills

Messaging is delivery guarantees + fan-out. Defend the hard calls.

Design drills: Chat / messaging0/4 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

A message is sent to an offline recipient. Trace how it becomes visible exactly once when they reconnect.

Core

Design the acknowledgement chain behind the sent / delivered / read ticks.

Core

Group fan-out: when does push-per-member stop scaling, and what do you do then?

Stretch

End-to-end encryption changes the server's role. What can it no longer do, and what still works?