Why this is really a correctness problem
A payment call crosses a network that lies: it can succeed while the response
is lost, fail cleanly, or hang ambiguously. Naive retries then either
double-charge the customer or lose the payment. So a "retry system" is
not a for loop with sleep — it's a small state machine wrapped around two
guarantees: every attempt is idempotent, and retries back off without
stampeding. Get those two right and the rest is bookkeeping.
Clarify scope first: "One payment, many attempts against a flaky gateway. I'll design the per-payment retry lifecycle, idempotency, the backoff policy, and where exhausted payments go (DLQ). Out of scope: the ledger and reconciliation, though I'll note where they hook in — OK?"
The lifecycle is a state machine
Model the attempt explicitly so illegal moves are unrepresentable: you can't
"retry" a Done payment, and a non-retryable rejection skips straight to the
dead-letter state. Walk a transient failure that recovers, then try an illegal
event and watch the machine refuse it:
1/7Start in Pending. Each event is handled by the current state — the State pattern moves this branching out of one giant switch and into the state objects themselves.
- Pending → Sending (
send): claim the work, attach the idempotency key. - Sending → Done (
ok): the gateway confirmed success. - Sending → Backoff (
fail): a transient error (timeout, 5xx, network) — schedule a retry after a delay. - Backoff → Sending (
retry): the delay elapsed; try again with the same idempotency key. - Sending → Dead (
reject): a permanent error (4xx like "card declined") — retrying can never help, so don't. - Backoff → Dead (
exhaust): max attempts reached → dead-letter for a human or a separate recovery flow.
The two hard parts
1. Idempotency across retries
Because delivery is at-least-once (you retry on ambiguity), the effect must be at-most-once. Every payment carries a unique idempotency key, generated once and reused on every retry. The gateway's idempotent endpoint either applies the charge once or returns the original result — so a retry after an ambiguous timeout is safe (the same Stripe pattern as the ATM and REST idempotency). Locally you journal the key before sending, so a crash mid-flight still knows the attempt is in flight.
2. Exponential backoff with jitter
Retrying immediately hammers an already-struggling gateway. Back off
exponentially — base * 2^attempt — capped at a ceiling. Then add jitter
(randomness): without it, a thousand payments that failed in the same outage all
retry at the exact same instant and re-create the spike — a
thundering herd. Full jitter: sleep = random(0, min(cap, base * 2^attempt)).
Backoff schedule, max attempts, and which errors are retryable all vary by
gateway and by payment type. Put them behind a RetryPolicy interface so a new
gateway is a new policy class, not edits to the retry engine (Open/Closed).
Entities & responsibilities
RetryEngineis the orchestrator — it owns no policy detail; it asks the injectedRetryPolicyand drives the state machine.RetryPolicy(Strategy) decides delay, retryability and the attempt cap.PaymentGateway(port) is the boundary to the outside world; aStripeGateway/MockGatewayadapter implements it (and tests use the mock).IdempotencyStoremakes retries safe;Schedulerdefers aBackoffattempt without blocking a thread;DeadLetterQueuecatches the exhausted.
Think it through like the interview
PROBLEMA payment must be charged against a flaky third-party gateway. Calls can time out ambiguously. Design the retry lifecycle, idempotency, backoff, and where failed payments end up. ~40 minutes.
- 1
Find the real problem
“Why isn't this just retry in a loop until it works?”
- 2
Model the lifecycle
“What are the states, and which transitions are illegal?”
unlocks after the stage above - 3
Idempotency
“A charge times out. You don't know if it landed. What do you send?”
unlocks after the stage above - 4
Backoff that doesn't stampede
“An outage fails 10,000 payments at 12:00:00. They all retry. When?”
unlocks after the stage above - 5
Give up well
“Retries are exhausted. Now what — drop it? Retry forever?”
unlocks after the stage above
Practice — level up
A payment retry engine is idempotent at-least-once delivery wrapped in a state machine. These drills train the pieces.
Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.
Warm-up — explicit, legal-only states
- Design an ATM MachineMedium
A finite machine where actions are legal only in the right state — the retry lifecycle's backbone.
Core — idempotency & budgets
Dedupe an effect; expire a budget of attempts.Allow an action once per window — the dedupe behind an idempotency key.
Issue / renew / expire with a TTL — the attempt budget and backoff clock.
Stretch — guarded transactions
- Simple Bank SystemMedium
Money operations that validate before mutating — the invariant a retry must never violate.