Design a Payment Retry System

The retry round payments teams love: a retry state machine, idempotency across attempts, exponential backoff with jitter, non-retryable vs transient failures, and a dead-letter queue — with clean OOP boundaries.

LLDpaymentsidempotencyretriesstate-machine

Why this is really a correctness problem

A payment call crosses a network that lies: it can succeed while the response is lost, fail cleanly, or hang ambiguously. Naive retries then either double-charge the customer or lose the payment. So a "retry system" is not a for loop with sleep — it's a small state machine wrapped around two guarantees: every attempt is idempotent, and retries back off without stampeding. Get those two right and the rest is bookkeeping.

Clarify scope first: "One payment, many attempts against a flaky gateway. I'll design the per-payment retry lifecycle, idempotency, the backoff policy, and where exhausted payments go (DLQ). Out of scope: the ledger and reconciliation, though I'll note where they hook in — OK?"

The lifecycle is a state machine

Model the attempt explicitly so illegal moves are unrepresentable: you can't "retry" a Done payment, and a non-retryable rejection skips straight to the dead-letter state. Walk a transient failure that recovers, then try an illegal event and watch the machine refuse it:

Payment retry lifecycletime O(1) per eventspace O(states)
sendokfailretryexhaustrejectPendingSendingBackoffDeadDone
events:sendfailretryfailretryok

1/7Start in Pending. Each event is handled by the current state — the State pattern moves this branching out of one giant switch and into the state objects themselves.

state = Pending
  • Pending → Sending (send): claim the work, attach the idempotency key.
  • Sending → Done (ok): the gateway confirmed success.
  • Sending → Backoff (fail): a transient error (timeout, 5xx, network) — schedule a retry after a delay.
  • Backoff → Sending (retry): the delay elapsed; try again with the same idempotency key.
  • Sending → Dead (reject): a permanent error (4xx like "card declined") — retrying can never help, so don't.
  • Backoff → Dead (exhaust): max attempts reached → dead-letter for a human or a separate recovery flow.

The two hard parts

1. Idempotency across retries

Because delivery is at-least-once (you retry on ambiguity), the effect must be at-most-once. Every payment carries a unique idempotency key, generated once and reused on every retry. The gateway's idempotent endpoint either applies the charge once or returns the original result — so a retry after an ambiguous timeout is safe (the same Stripe pattern as the ATM and REST idempotency). Locally you journal the key before sending, so a crash mid-flight still knows the attempt is in flight.

2. Exponential backoff with jitter

Retrying immediately hammers an already-struggling gateway. Back off exponentially — base * 2^attempt — capped at a ceiling. Then add jitter (randomness): without it, a thousand payments that failed in the same outage all retry at the exact same instant and re-create the spike — a thundering herd. Full jitter: sleep = random(0, min(cap, base * 2^attempt)).

Make the policy a Strategy, not an if-ladder

Backoff schedule, max attempts, and which errors are retryable all vary by gateway and by payment type. Put them behind a RetryPolicy interface so a new gateway is a new policy class, not edits to the retry engine (Open/Closed).

Entities & responsibilities

  • RetryEngine is the orchestrator — it owns no policy detail; it asks the injected RetryPolicy and drives the state machine.
  • RetryPolicy (Strategy) decides delay, retryability and the attempt cap.
  • PaymentGateway (port) is the boundary to the outside world; a StripeGateway/MockGateway adapter implements it (and tests use the mock).
  • IdempotencyStore makes retries safe; Scheduler defers a Backoff attempt without blocking a thread; DeadLetterQueue catches the exhausted.

Think it through like the interview

Think it through: Design a Payment Retry SystemLLD Core — payments0/5 stages

PROBLEMA payment must be charged against a flaky third-party gateway. Calls can time out ambiguously. Design the retry lifecycle, idempotency, backoff, and where failed payments end up. ~40 minutes.

  1. 1

    Find the real problem

    Why isn't this just retry in a loop until it works?

  2. 2

    Model the lifecycle

    What are the states, and which transitions are illegal?

    unlocks after the stage above
  3. 3

    Idempotency

    A charge times out. You don't know if it landed. What do you send?

    unlocks after the stage above
  4. 4

    Backoff that doesn't stampede

    An outage fails 10,000 payments at 12:00:00. They all retry. When?

    unlocks after the stage above
  5. 5

    Give up well

    Retries are exhausted. Now what — drop it? Retry forever?

    unlocks after the stage above

Practice — level up

A payment retry engine is idempotent at-least-once delivery wrapped in a state machine. These drills train the pieces.

Practice ladder: Idempotency, state & retries0/4 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up — explicit, legal-only states

  1. A finite machine where actions are legal only in the right state — the retry lifecycle's backbone.

Core — idempotency & budgets

Dedupe an effect; expire a budget of attempts.
  1. Allow an action once per window — the dedupe behind an idempotency key.

  2. Issue / renew / expire with a TTL — the attempt budget and backoff clock.

Stretch — guarded transactions

  1. Money operations that validate before mutating — the invariant a retry must never violate.