Design a Payment Retry System

The retry round payments teams love: a retry state machine, idempotency across attempts, exponential backoff with jitter, non-retryable vs transient failures, and a dead-letter queue — with clean OOP boundaries.

The real-world analogy: household circuit breakers and thundering stampedes

Imagine the electrical wiring system of your home:

The Circuit Breaker: You plug in three heavy space heaters. A massive current spike triggers. The physical metal strip inside the wall switch panel heats up and trips open (OPEN State), stopping all electricity flow to prevent a house fire. You can't just flip it back instantly; you must let it cool down (HALF-OPEN State), check for remaining shorts, and finally reset it fully (CLOSED State).
Synchronized Retries (Stampeding): Imagine a city-wide power outage. Suddenly, at 3:00 PM, the lights flicker back on. Instantly, 10,000 refrigerators, air conditioners, and TVs hum to life at the exact same millisecond. This synchronized startup creates a second power grid surge, tripping the neighborhood substations again. If appliances random-delayed their startups (using Jitter), the grid load would scale up smoothly instead of collapsing.

In LLD, we handle flaky network connections using a polymorphic State-based Circuit Breaker to reject calls early when the server is down, and append Full Jitter to exponential backoff delay formulas to prevent synchronized stampedes.

Why this is really a correctness problem

A payment call crosses a network that lies: it can succeed while the response is lost, fail cleanly, or hang ambiguously. Naive retries then either double-charge the customer or lose the payment. So a "retry system" is not a for loop with sleep — it's a small state machine wrapped around two guarantees: every attempt is idempotent, and retries back off without stampeding. Get those two right and the rest is bookkeeping.

Clarify scope first: "One payment, many attempts against a flaky gateway. I'll design the per-payment retry lifecycle, idempotency, the backoff policy, and where exhausted payments go (DLQ). Out of scope: the ledger and reconciliation, though I'll note where they hook in — OK?"

UML Class Diagram

The lifecycle is a state machine

Model the attempt explicitly so illegal moves are unrepresentable: you can't "retry" a Done payment, and a non-retryable rejection skips straight to the dead-letter state.

Payment retry lifecycletime O(1) per eventspace O(states)

events:sendfailretryfailretryok

1/7Start in Pending. Each event is handled by the current state — the State pattern moves this branching out of one giant switch and into the state objects themselves.

state = Pending

events (try an illegal one)

Pending → Sending (send): claim the work, attach the idempotency key.
Sending → Done (ok): the gateway confirmed success.
Sending → Backoff (fail): a transient error (timeout, 5xx, network) — schedule a retry after a delay.
Backoff → Sending (retry): the delay elapsed; try again with the same idempotency key.
Sending → Dead (reject): a permanent error (4xx like "card declined") — retrying can never help, so don't.
Backoff → Dead (exhaust): max attempts reached → dead-letter for a human or a separate recovery flow.

The two hard parts

1. Idempotency across retries

Because delivery is at-least-once (you retry on ambiguity), the effect must be at-most-once. Every payment carries a unique idempotency key, generated once and reused on every retry. The gateway's idempotent endpoint either applies the charge once or returns the original result — so a retry after an ambiguous timeout is safe (the same Stripe pattern as the ATM and REST idempotency). Locally you journal the key before sending, so a crash mid-flight still knows the attempt is in flight.

2. Exponential backoff with jitter

Retrying immediately hammers an already-struggling gateway. Back off exponentially — base * 2^attempt — capped at a ceiling. Then add jitter (randomness): without it, a thousand payments that failed in the same outage all retry at the exact same instant and re-create the spike — a thundering herd. Full jitter: sleep = random(0, min(cap, base * 2^attempt)).

Make the policy a Strategy, not an if-ladder

Backoff schedule, max attempts, and which errors are retryable all vary by gateway and by payment type. Put them behind a RetryPolicy interface so a new gateway is a new policy class, not edits to the retry engine (Open/Closed).

Implementation

Below are complete implementations featuring state transitions, backoff math with full jitter, and a Circuit Breaker model.

Python

import random
import time
from enum import Enum
from typing import Optional

class PaymentState(Enum):
    PENDING = 1
    SENDING = 2
    DONE = 3
    BACKOFF = 4
    DEAD = 5

class CircuitState(Enum):
    CLOSED = 1
    OPEN = 2
    HALF_OPEN = 3

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, recovery_time_sec: float = 10.0):
        self.failure_threshold = failure_threshold
        self.recovery_time_sec = recovery_time_sec
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0

    def record_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

    def allow_execution(self) -> bool:
        now = time.time()
        if self.state == CircuitState.OPEN:
            if now - self.last_failure_time > self.recovery_time_sec:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        return True

class PaymentRequest:
    def __init__(self, request_id: str, amount: float):
        self.request_id = request_id
        self.amount = amount
        self.attempt = 0
        self.state = PaymentState.PENDING
        self.idempotency_key = f"{request_id}_key"

class RetryEngine:
    def __init__(self, breaker: CircuitBreaker):
        self.breaker = breaker
        self.base_delay = 1.0
        self.max_delay = 60.0
        
    def execute_payment(self, req: PaymentRequest, gateway_mock_send) -> None:
        if not self.breaker.allow_execution():
            req.state = PaymentState.BACKOFF
            print(f"Circuit open. Rejecting request {req.request_id} early.")
            return

        req.state = PaymentState.SENDING
        req.attempt += 1
        
        try:
            success, is_transient = gateway_mock_send(req.idempotency_key, req.amount)
            if success:
                req.state = PaymentState.DONE
                self.breaker.record_success()
            else:
                self.breaker.record_failure()
                if is_transient and req.attempt < 3:
                    req.state = PaymentState.BACKOFF
                    # Calculate exponential backoff with full jitter
                    raw_delay = min(self.max_delay, self.base_delay * (2 ** req.attempt))
                    jitter_delay = random.uniform(0, raw_delay)
                    print(f"Transient failure. Backing off for {jitter_delay:.2f} seconds.")
                else:
                    req.state = PaymentState.DEAD
        except Exception:
            self.breaker.record_failure()
            req.state = PaymentState.DEAD

Java

import java.time.Instant;
import java.util.Random;
import java.util.concurrent.locks.ReentrantLock;

enum PaymentState { PENDING, SENDING, DONE, BACKOFF, DEAD }
enum CircuitState { CLOSED, OPEN, HALF_OPEN }

class CircuitBreaker {
    private final int failureThreshold = 3;
    private final long recoveryTimeSec = 10;
    private CircuitState state = CircuitState.CLOSED;
    private int failureCount = 0;
    private long lastFailureTime = 0;
    private final ReentrantLock lock = new ReentrantLock();

    public boolean allowExecution() {
        lock.lock();
        try {
            long now = Instant.now().getEpochSecond();
            if (state == CircuitState.OPEN) {
                if (now - lastFailureTime > recoveryTimeSec) {
                    state = CircuitState.HALF_OPEN;
                    return true;
                }
                return false;
            }
            return true;
        } finally {
            lock.unlock();
        }
    }

    public void recordSuccess() {
        lock.lock();
        try {
            failureCount = 0;
            state = CircuitState.CLOSED;
        } finally {
            lock.unlock();
        }
    }

    public void recordFailure() {
        lock.lock();
        try {
            failureCount++;
            lastFailureTime = Instant.now().getEpochSecond();
            if (failureCount >= failureThreshold) {
                state = CircuitState.OPEN;
            }
        } finally {
            lock.unlock();
        }
    }
}

class PaymentRequest {
    String requestId;
    double amount;
    int attempt = 0;
    PaymentState state = PaymentState.PENDING;
    String idempotencyKey;

    PaymentRequest(String id, double amt) {
        this.requestId = id;
        this.amount = amt;
        this.idempotencyKey = id + "_key";
    }
}

public class RetryEngine {
    private final CircuitBreaker breaker;
    private final double baseDelay = 1.0;
    private final double maxDelay = 60.0;
    private final Random random = new Random();

    public RetryEngine(CircuitBreaker breaker) { this.breaker = breaker; }

    public interface GatewayMock {
        boolean call(String key, double amt) throws Exception;
        boolean isTransient();
    }

    public void executePayment(PaymentRequest req, GatewayMock gateway) {
        if (!breaker.allowExecution()) {
            req.state = PaymentState.BACKOFF;
            return;
        }

        req.state = PaymentState.SENDING;
        req.attempt++;

        try {
            boolean success = gateway.call(req.idempotencyKey, req.amount);
            if (success) {
                req.state = PaymentState.DONE;
                breaker.recordSuccess();
            } else {
                breaker.recordFailure();
                if (gateway.isTransient() && req.attempt < 3) {
                    req.state = PaymentState.BACKOFF;
                    double rawDelay = Math.min(maxDelay, baseDelay * Math.pow(2, req.attempt));
                    double jitterDelay = random.nextDouble() * rawDelay;
                    System.out.println("Backing off for " + jitterDelay + " seconds.");
                } else {
                    req.state = PaymentState.DEAD;
                }
            }
        } catch (Exception e) {
            breaker.recordFailure();
            req.state = PaymentState.DEAD;
        }
    }
}

C++

#include <iostream>
#include <string>
#include <chrono>
#include <mutex>
#include <cmath>
#include <random>
#include <stdexcept>
#include <memory>

enum class PaymentState { PENDING, SENDING, DONE, BACKOFF, DEAD };
enum class CircuitState { CLOSED, OPEN, HALF_OPEN };

class CircuitBreaker {
private:
    int failureThreshold = 3;
    long long recoveryTimeSec = 10;
    CircuitState state = CircuitState::CLOSED;
    int failureCount = 0;
    long long lastFailureTime = 0;
    std::mutex mtx;

public:
    bool allowExecution() {
        std::lock_guard<std::mutex> lock(mtx);
        long long now = std::chrono::duration_cast<std::chrono::seconds>(
            std::chrono::system_clock::now().time_since_epoch()
        ).count();

        if (state == CircuitState::OPEN) {
            if (now - lastFailureTime > recoveryTimeSec) {
                state = CircuitState::HALF_OPEN;
                return true;
            }
            return false;
        }
        return true;
    }

    void recordSuccess() {
        std::lock_guard<std::mutex> lock(mtx);
        failureCount = 0;
        state = CircuitState::CLOSED;
    }

    void recordFailure() {
        std::lock_guard<std::mutex> lock(mtx);
        failureCount++;
        lastFailureTime = std::chrono::duration_cast<std::chrono::seconds>(
            std::chrono::system_clock::now().time_since_epoch()
        ).count();
        if (failureCount >= failureThreshold) {
            state = CircuitState::OPEN;
        }
    }
};

class PaymentRequest {
public:
    std::string requestId;
    double amount;
    int attempt = 0;
    PaymentState state = PaymentState::PENDING;
    std::string idempotencyKey;

    PaymentRequest(std::string id, double amt) : requestId(id), amount(amt), idempotencyKey(id + "_key") {}
};

class RetryEngine {
private:
    std::shared_ptr<CircuitBreaker> breaker;
    double baseDelay = 1.0;
    double maxDelay = 60.0;
    std::mt19937 rng{std::random_device{}()};

public:
    RetryEngine(std::shared_ptr<CircuitBreaker> cb) : breaker(cb) {}

    void executePayment(std::shared_ptr<PaymentRequest> req, bool success, bool isTransient) {
        if (!breaker->allowExecution()) {
            req->state = PaymentState::BACKOFF;
            return;
        }

        req->state = PaymentState::SENDING;
        req->attempt++;

        if (success) {
            req->state = PaymentState::DONE;
            breaker->recordSuccess();
        } else {
            breaker->recordFailure();
            if (isTransient && req->attempt < 3) {
                req->state = PaymentState::BACKOFF;
                double rawDelay = std::min(maxDelay, baseDelay * std::pow(2, req->attempt));
                std::uniform_real_distribution<double> dist(0.0, rawDelay);
                double jitterDelay = dist(rng);
                std::cout << "Backing off for " << jitterDelay << " seconds.\n";
            } else {
                req->state = PaymentState::DEAD;
            }
        }
    }
};

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up — explicit, legal-only states

Design an ATM MachineMedium
A finite machine where actions are legal only in the right state — the retry lifecycle's backbone.

Core — idempotency & budgets

Dedupe an effect; expire a budget of attempts.

Logger Rate LimiterEasy
Allow an action once per window — the dedupe behind an idempotency key.
Design Authentication ManagerMedium
Issue / renew / expire with a TTL — the attempt budget and backoff clock.

Stretch — guarded transactions

Simple Bank SystemMedium
Money operations that validate before mutating — the invariant a retry must never violate.