Design a Webhook Delivery System

Deliver events to merchants' URLs reliably: at-least-once delivery, per-merchant ordering, an append-only event store, retries with a dead-letter queue, Kafka for async fan-out, per-merchant rate limiting, signing, and the observability that proves it works.

HLDwebhookskafkadelivery-guaranteesdead-letter-queue

What we're building

When something happens in our platform (a payment succeeds, an order ships), we must call each subscribed merchant's HTTP endpoint with that event — and keep calling until they acknowledge it, without losing events, melting a slow merchant, or letting one noisy merchant starve the rest.

Scope it out loud:

  • Functional: register endpoints + event subscriptions; deliver each event; retry failures; expose delivery status; let merchants replay.
  • Non-functional: at-least-once delivery, durability (never lose an event), per-merchant ordering (best-effort), isolation between merchants, and security (signed, verifiable payloads).
  • Scale (state an estimate): say 50k events/sec at peak, fan-out 1–3×, p99 delivery < a few seconds when merchants are healthy.

Architecture

The spine: durably record the event first, then deliver asynchronously. The producer's job ends at the durable write; delivery is a separate, retriable concern. That decoupling is what makes "guaranteed" delivery possible.

Guaranteed delivery & the outbox

The trap is the dual write: if a producer writes its DB row and publishes to Kafka as two steps, a crash between them either loses the event or emits one for a rolled-back change. Fix with the transactional outbox — write the event to an outbox table in the same DB transaction as the business change, then a relay (CDC/poller) publishes it to Kafka. The event is durable the instant the business fact is, and the relay is free to retry publishing.

From Kafka onward we get at-least-once: a worker only commits its offset after the merchant acknowledges (2xx). Crash before the ack → the event is redelivered. Duplicates are therefore expected, which is why every payload carries a unique event_id and we tell merchants to dedupe on it — the same idempotency contract as a payment retry, now across a service boundary.

Ordering

Strict global ordering doesn't scale and merchants rarely need it. Per-merchant (or per-resource) ordering is the useful guarantee: partition the Kafka topic by merchant_id (or merchant_id + resource_id) so all of one merchant's events land on one partition and one consumer processes them in order.

Ordering and retries fight each other

If event #1 fails and you retry it while #2 succeeds, you've reordered. Options: block the partition until #1 drains (strong order, head-of-line blocking) or allow out-of-order with sequence numbers and let merchants reorder (more throughput, weaker guarantee). Say the trade-off and pick per requirements — most webhook products choose best-effort order + an event_id/timestamp.

SQL vs NoSQL for the event store

  • The outbox / source of truth → relational. It rides the business transaction (you need ACID with the order/payment write), and volume is bounded by your own writes. Postgres.
  • The delivery log / status (every attempt, for every event, for every merchant) → high-write, time-series, queried by (merchant, time). This is the part that explodes: a wide-column store (Cassandra) or a partitioned/sharded table with TTL fits the append-heavy, range-scan pattern. "SQL for the truth, scale-out store for the firehose of attempts" is the mature answer.

Retries & the dead-letter queue

A merchant endpoint is flaky or down. Per attempt: exponential backoff + jitter (re-publish to a delay topic, don't busy-loop), a max attempt cap, and then the dead-letter queue — the event is parked with its failure history, never silently dropped, and a merchant can replay from it once they're healthy. A per-endpoint circuit breaker stops you wasting workers hammering a merchant who's been 500-ing for an hour; trip it, shed to retry, probe periodically.

Rate limiting per merchant

A small merchant's server must not be flooded, and a backlog for one merchant must not consume all workers. Give each merchant a token bucket (R req/s, burst B); a delivery only proceeds if it takes a token, else it waits in that merchant's lane. This is exactly the rate-limiting primitive, keyed per merchant — watch the burst-then-throttle behaviour:

Token bucket — burst, then throttletime O(1) per requestspace O(1)
↓ refill 2/st = 0s
bucket
5/5
request— (refilling)

1/11Bucket starts full: 5 tokens. R = 2/s bounds the long-run rate; B = 5 is the burst a quiet period earns you.

tokens = 5/5R = 2/s

Observability & alerting

You can't run delivery you can't see. Track, per merchant and global:

  • Delivery success rate, attempt counts, and p50/p99 delivery latency.
  • Queue lag (events waiting) and DLQ depth (the single best health signal — a rising DLQ means something is broken).
  • Per-merchant dashboards + alerts: page on DLQ growth, a collapsed success rate, or a tripped breaker. Emit structured logs keyed by event_id so one delivery is traceable end to end (observability).

Security

Sign each payload with an HMAC over the body using the merchant's secret (X-Signature header) so they can verify authenticity and reject forgeries; include a timestamp to stop replay. Deliver only over HTTPS, with a tight timeout so a hanging merchant can't tie up a worker.

Think it through like the interview

Think it through: Design a Webhook Delivery SystemHLD — reliable delivery0/5 stages

PROBLEMDeliver platform events to merchants' HTTP endpoints with guaranteed delivery, ordering, retries, and per-merchant isolation. ~45 minutes.

  1. 1

    Pin the guarantees

    Before any boxes: which delivery and ordering guarantees are you promising?

  2. 2

    Make it durable before async

    A producer's DB commit succeeds but the process dies before publishing. What did you lose?

    unlocks after the stage above
  3. 3

    Decouple delivery

    Why put Kafka between the event and the HTTP call at all?

    unlocks after the stage above
  4. 4

    Fail without losing or flooding

    A merchant is down for an hour. What happens to their events — and to your workers?

    unlocks after the stage above
  5. 5

    Prove it works

    How do you know delivery is healthy at 2am?

    unlocks after the stage above

Design drills

You've seen the architecture — now defend the calls an interviewer will push on.

Design drills: Webhook delivery0/6 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

Pin the delivery and ordering guarantees you'll promise a merchant, and the contract that makes them livable.

Core

The producer commits its DB change but crashes before publishing the event. Show how you lose nothing.

Core

A merchant is down for an hour. Walk through what happens to their events and to your workers.

Stretch

Choose the storage for (a) the outbox/source of truth and (b) the per-attempt delivery log, and justify.

Stretch

A merchant strictly requires events in order, but retries can reorder them. Resolve it.

Stretch

It's 2am and deliveries are failing. What metrics and alerts tell you what's wrong?