Design a Webhook Delivery System

Deliver events to merchants' URLs reliably: at-least-once delivery, per-merchant ordering, an append-only event store, retries with a dead-letter queue, Kafka for async fan-out, per-merchant rate limiting, signing, and the observability that proves it works.

What we're building

When something happens in our platform (a payment succeeds, an order ships), we must call each subscribed merchant's HTTP endpoint with that event — and keep calling until they acknowledge it, without losing events, melting a slow merchant, or letting one noisy merchant starve the rest.

Scope it out loud:

Functional: register endpoints + event subscriptions; deliver each event; retry failures; expose delivery status; let merchants replay.
Non-functional: at-least-once delivery, durability (never lose an event), per-merchant ordering (best-effort), isolation between merchants, and security (signed, verifiable payloads).
Scale (state an estimate): say 50k events/sec at peak, fan-out 1–3×, p99 delivery < a few seconds when merchants are healthy.

Architecture

The spine: durably record the event first, then deliver asynchronously. The producer's job ends at the durable write; delivery is a separate, retriable concern. That decoupling is what makes "guaranteed" delivery possible.

Guaranteed delivery & the outbox

The trap is the dual write: if a producer writes its DB row and publishes to Kafka as two steps, a crash between them either loses the event or emits one for a rolled-back change. Fix with the transactional outbox — write the event to an outbox table in the same DB transaction as the business change, then a relay (CDC/poller) publishes it to Kafka. The event is durable the instant the business fact is, and the relay is free to retry publishing.

From Kafka onward we get at-least-once: a worker only commits its offset after the merchant acknowledges (2xx). Crash before the ack → the event is redelivered. Duplicates are therefore expected, which is why every payload carries a unique event_id and we tell merchants to dedupe on it — the same idempotency contract as a payment retry, now across a service boundary.

Ordering

Strict global ordering doesn't scale and merchants rarely need it. Per-merchant (or per-resource) ordering is the useful guarantee: partition the Kafka topic by merchant_id (or merchant_id + resource_id) so all of one merchant's events land on one partition and one consumer processes them in order.

Ordering and retries fight each other

If event #1 fails and you retry it while #2 succeeds, you've reordered. Options: block the partition until #1 drains (strong order, head-of-line blocking) or allow out-of-order with sequence numbers and let merchants reorder (more throughput, weaker guarantee). Say the trade-off and pick per requirements — most webhook products choose best-effort order + an event_id/timestamp.

SQL vs NoSQL for the event store

The outbox / source of truth → relational. It rides the business transaction (you need ACID with the order/payment write), and volume is bounded by your own writes. Postgres.
The delivery log / status (every attempt, for every event, for every merchant) → high-write, time-series, queried by (merchant, time). This is the part that explodes: a wide-column store (Cassandra) or a partitioned/sharded table with TTL fits the append-heavy, range-scan pattern. "SQL for the truth, scale-out store for the firehose of attempts" is the mature answer.

Retries & the dead-letter queue

A merchant endpoint is flaky or down. Per attempt: exponential backoff + jitter (re-publish to a delay topic, don't busy-loop), a max attempt cap, and then the dead-letter queue — the event is parked with its failure history, never silently dropped, and a merchant can replay from it once they're healthy. A per-endpoint circuit breaker stops you wasting workers hammering a merchant who's been 500-ing for an hour; trip it, shed to retry, probe periodically.

Rate limiting per merchant

A small merchant's server must not be flooded, and a backlog for one merchant must not consume all workers. Give each merchant a token bucket (R req/s, burst B); a delivery only proceeds if it takes a token, else it waits in that merchant's lane. This is exactly the rate-limiting primitive, keyed per merchant — watch the burst-then-throttle behaviour:

Token bucket — burst, then throttletime O(1) per requestspace O(1)

↓ refill 2/st = 0s

bucket

5/5

request— (refilling)

1/11Bucket starts full: 5 tokens. R = 2/s bounds the long-run rate; B = 5 is the burst a quiet period earns you.

tokens = 5/5R = 2/s

rate R/scapacity Brequest times (s)

Observability & alerting

You can't run delivery you can't see. Track, per merchant and global:

Delivery success rate, attempt counts, and p50/p99 delivery latency.
Queue lag (events waiting) and DLQ depth (the single best health signal — a rising DLQ means something is broken).
Per-merchant dashboards + alerts: page on DLQ growth, a collapsed success rate, or a tripped breaker. Emit structured logs keyed by event_id so one delivery is traceable end to end (observability).

Security

Sign each payload with an HMAC over the body using the merchant's secret (X-Signature header) so they can verify authenticity and reject forgeries; include a timestamp to stop replay. Deliver only over HTTPS, with a tight timeout so a hanging merchant can't tie up a worker.

Think it through like the interview

Think it through: Design a Webhook Delivery SystemHLD — reliable delivery0/5 stages

PROBLEMDeliver platform events to merchants' HTTP endpoints with guaranteed delivery, ordering, retries, and per-merchant isolation. ~45 minutes.

1
Pin the guarantees
“Before any boxes: which delivery and ordering guarantees are you promising?”
2
Make it durable before async
“A producer's DB commit succeeds but the process dies before publishing. What did you lose?”
unlocks after the stage above
3
Decouple delivery
“Why put Kafka between the event and the HTTP call at all?”
unlocks after the stage above
4
Fail without losing or flooding
“A merchant is down for an hour. What happens to their events — and to your workers?”
unlocks after the stage above
5
Prove it works
“How do you know delivery is healthy at 2am?”
unlocks after the stage above

Design drills

You've seen the architecture — now defend the calls an interviewer will push on.

Design drills: Webhook delivery0/6 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

Pin the delivery and ordering guarantees you'll promise a merchant, and the contract that makes them livable.

Core

The producer commits its DB change but crashes before publishing the event. Show how you lose nothing.

Core

A merchant is down for an hour. Walk through what happens to their events and to your workers.

Stretch

Choose the storage for (a) the outbox/source of truth and (b) the per-attempt delivery log, and justify.

Stretch

A merchant strictly requires events in order, but retries can reorder them. Resolve it.

Stretch

It's 2am and deliveries are failing. What metrics and alerts tell you what's wrong?