What we're building
When something happens in our platform (a payment succeeds, an order ships), we must call each subscribed merchant's HTTP endpoint with that event — and keep calling until they acknowledge it, without losing events, melting a slow merchant, or letting one noisy merchant starve the rest.
Scope it out loud:
- Functional: register endpoints + event subscriptions; deliver each event; retry failures; expose delivery status; let merchants replay.
- Non-functional: at-least-once delivery, durability (never lose an event), per-merchant ordering (best-effort), isolation between merchants, and security (signed, verifiable payloads).
- Scale (state an estimate): say 50k events/sec at peak, fan-out 1–3×, p99 delivery < a few seconds when merchants are healthy.
Architecture
The spine: durably record the event first, then deliver asynchronously. The producer's job ends at the durable write; delivery is a separate, retriable concern. That decoupling is what makes "guaranteed" delivery possible.
Guaranteed delivery & the outbox
The trap is the dual write: if a producer writes its DB row and publishes to
Kafka as two steps, a crash between them either loses the event or emits one for a
rolled-back change. Fix with the transactional outbox — write the event to an
outbox table in the same DB transaction as the business change, then a relay
(CDC/poller) publishes it to Kafka. The event is durable the instant the business
fact is, and the relay is free to retry publishing.
From Kafka onward we get at-least-once: a worker only commits its offset after
the merchant acknowledges (2xx). Crash before the ack → the event is
redelivered. Duplicates are therefore expected, which is why every payload
carries a unique event_id and we tell merchants to dedupe on it — the
same idempotency contract as a payment retry, now across a
service boundary.
Ordering
Strict global ordering doesn't scale and merchants rarely need it. Per-merchant
(or per-resource) ordering is the useful guarantee: partition the Kafka topic by
merchant_id (or merchant_id + resource_id) so all of one merchant's events
land on one partition and one consumer processes them in order.
If event #1 fails and you retry it while #2 succeeds, you've reordered. Options:
block the partition until #1 drains (strong order, head-of-line blocking) or
allow out-of-order with sequence numbers and let merchants reorder (more
throughput, weaker guarantee). Say the trade-off and pick per requirements — most
webhook products choose best-effort order + an event_id/timestamp.
SQL vs NoSQL for the event store
- The outbox / source of truth → relational. It rides the business transaction (you need ACID with the order/payment write), and volume is bounded by your own writes. Postgres.
- The delivery log / status (every attempt, for every event, for every
merchant) → high-write, time-series, queried by
(merchant, time). This is the part that explodes: a wide-column store (Cassandra) or a partitioned/sharded table with TTL fits the append-heavy, range-scan pattern. "SQL for the truth, scale-out store for the firehose of attempts" is the mature answer.
Retries & the dead-letter queue
A merchant endpoint is flaky or down. Per attempt: exponential backoff + jitter (re-publish to a delay topic, don't busy-loop), a max attempt cap, and then the dead-letter queue — the event is parked with its failure history, never silently dropped, and a merchant can replay from it once they're healthy. A per-endpoint circuit breaker stops you wasting workers hammering a merchant who's been 500-ing for an hour; trip it, shed to retry, probe periodically.
Rate limiting per merchant
A small merchant's server must not be flooded, and a backlog for one merchant must not consume all workers. Give each merchant a token bucket (R req/s, burst B); a delivery only proceeds if it takes a token, else it waits in that merchant's lane. This is exactly the rate-limiting primitive, keyed per merchant — watch the burst-then-throttle behaviour:
1/11Bucket starts full: 5 tokens. R = 2/s bounds the long-run rate; B = 5 is the burst a quiet period earns you.
Observability & alerting
You can't run delivery you can't see. Track, per merchant and global:
- Delivery success rate, attempt counts, and p50/p99 delivery latency.
- Queue lag (events waiting) and DLQ depth (the single best health signal — a rising DLQ means something is broken).
- Per-merchant dashboards + alerts: page on DLQ growth, a collapsed
success rate, or a tripped breaker. Emit structured logs keyed by
event_idso one delivery is traceable end to end (observability).
Security
Sign each payload with an HMAC over the body using the merchant's secret
(X-Signature header) so they can verify authenticity and reject forgeries;
include a timestamp to stop replay. Deliver only over HTTPS, with a tight timeout
so a hanging merchant can't tie up a worker.
Think it through like the interview
PROBLEMDeliver platform events to merchants' HTTP endpoints with guaranteed delivery, ordering, retries, and per-merchant isolation. ~45 minutes.
- 1
Pin the guarantees
“Before any boxes: which delivery and ordering guarantees are you promising?”
- 2
Make it durable before async
“A producer's DB commit succeeds but the process dies before publishing. What did you lose?”
unlocks after the stage above - 3
Decouple delivery
“Why put Kafka between the event and the HTTP call at all?”
unlocks after the stage above - 4
Fail without losing or flooding
“A merchant is down for an hour. What happens to their events — and to your workers?”
unlocks after the stage above - 5
Prove it works
“How do you know delivery is healthy at 2am?”
unlocks after the stage above
Design drills
You've seen the architecture — now defend the calls an interviewer will push on.
Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.
Pin the delivery and ordering guarantees you'll promise a merchant, and the contract that makes them livable.
The producer commits its DB change but crashes before publishing the event. Show how you lose nothing.
A merchant is down for an hour. Walk through what happens to their events and to your workers.
Choose the storage for (a) the outbox/source of truth and (b) the per-attempt delivery log, and justify.
A merchant strictly requires events in order, but retries can reorder them. Resolve it.
It's 2am and deliveries are failing. What metrics and alerts tell you what's wrong?