Rate Limiting

The five algorithms (fixed window, sliding log, sliding counter, token bucket, leaky bucket), how to make them work across many servers with Redis, and the trade-offs.

rate-limitingredisapi-gateway

Why & where

Rate limiting protects you from abuse, accidental floods, and cost blowups, and enforces fair use / billing tiers. Put it at the edge / API gateway so bad traffic dies before it touches your services. StockVision uses slowapi for this and meters its B2B API keys; StockStump's per-player trade lock is a specialized one-token limiter.

The algorithms

AlgorithmIdeaTrade-off
Fixed windowcount per clock window (per minute)cheap; burst at the boundary (2× at the edges)
Sliding logstore every request timestampexact; memory-heavy
Sliding window counterweighted blend of current + previous windowgreat accuracy/cost balance
Token buckettokens refill at rate R, each request takes oneallows controlled bursts; the usual default
Leaky bucketqueue drains at fixed ratesmooths output to a constant rate

Token bucket is the most common answer: it permits short bursts (up to the bucket size) while bounding the long-run rate.

The two knobs map directly to behavior: R bounds the long-run rate, B bounds the burst. A full bucket after a quiet period is the "controlled burst" — that's a feature, not a bug.

Python
import time

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate, self.capacity = rate, capacity
        self.tokens = capacity
        self.ts = time.monotonic()

    def allow(self, cost=1):
        now = time.monotonic()
        self.tokens = min(self.capacity, self.tokens + (now - self.ts) * self.rate)
        self.ts = now
        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False

Here it is running. The bucket starts full, so an opening burst of 5 passes straight through; the 6th request in that burst is refused with a 429, and then the steady drip lets traffic through again. Change R, the capacity B, or the arrival times and watch the burst-versus-throttle trade-off play out:

Token bucket — burst, then throttletime O(1) per requestspace O(1)
↓ refill 2/st = 0s
bucket
5/5
request— (refilling)

1/11Bucket starts full: 5 tokens. R = 2/s bounds the long-run rate; B = 5 is the burst a quiet period earns you.

tokens = 5/5R = 2/s

Distributed rate limiting

With many app servers, an in-memory counter per server multiplies the real limit by the server count. So keep the counter in a shared store (Redis) and mutate it atomically — a Lua script or INCR + EXPIRE makes the read-modify-write a single atomic op, avoiding the same race that bites naive balance checks.

Always return the right signal

On limit, respond 429 Too Many Requests with a Retry-After header so good clients back off correctly. (LandAI's ingestion respects upstream Retry-After for the same reason — be a polite client and expect politeness back.)

Design drills

You know the five algorithms — now rehearse the system decisions an interviewer actually probes.

Design drills: Rate limiting0/5 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

A public API sells a 1,000 requests/minute tier that must tolerate short bursts. Pick an algorithm and set its knobs — what are R and B, and what do you return on reject?

Core

Now run it behind 10 gateway nodes. Where does the counter live, and how do you avoid the 'real limit = limit × 10' bug?

Core

A customer complains they get throttled in bursts right at each minute boundary. Diagnose it and fix it without storing every timestamp.

Stretch

Redis — your shared counter — goes down. What happens to request handling, and what should happen?

Stretch

A multi-tenant SaaS needs per-user, per-organization, and global caps enforced at once. How do you layer them?