Why & where
Rate limiting protects you from abuse, accidental floods, and cost blowups, and
enforces fair use / billing tiers. Put it at the edge / API gateway so bad
traffic dies before it touches your services. StockVision uses slowapi for
this and meters its B2B API keys; StockStump's per-player trade lock is a
specialized one-token limiter.
The algorithms
| Algorithm | Idea | Trade-off |
|---|---|---|
| Fixed window | count per clock window (per minute) | cheap; burst at the boundary (2× at the edges) |
| Sliding log | store every request timestamp | exact; memory-heavy |
| Sliding window counter | weighted blend of current + previous window | great accuracy/cost balance |
| Token bucket | tokens refill at rate R, each request takes one | allows controlled bursts; the usual default |
| Leaky bucket | queue drains at fixed rate | smooths output to a constant rate |
Token bucket is the most common answer: it permits short bursts (up to the bucket size) while bounding the long-run rate.
The two knobs map directly to behavior: R bounds the long-run rate, B bounds the burst. A full bucket after a quiet period is the "controlled burst" — that's a feature, not a bug.
import time
class TokenBucket:
def __init__(self, rate, capacity):
self.rate, self.capacity = rate, capacity
self.tokens = capacity
self.ts = time.monotonic()
def allow(self, cost=1):
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.ts) * self.rate)
self.ts = now
if self.tokens >= cost:
self.tokens -= cost
return True
return False
Here it is running. The bucket starts full, so an opening burst of 5 passes straight through; the 6th request in that burst is refused with a 429, and then the steady drip lets traffic through again. Change R, the capacity B, or the arrival times and watch the burst-versus-throttle trade-off play out:
1/11Bucket starts full: 5 tokens. R = 2/s bounds the long-run rate; B = 5 is the burst a quiet period earns you.
Distributed rate limiting
With many app servers, an in-memory counter per server multiplies the real
limit by the server count. So keep the counter in a shared store (Redis) and
mutate it atomically — a Lua script or INCR + EXPIRE makes the
read-modify-write a single atomic op, avoiding the same race that bites naive
balance checks.
On limit, respond 429 Too Many Requests with a Retry-After header so good
clients back off correctly. (LandAI's ingestion respects upstream Retry-After
for the same reason — be a polite client and expect politeness back.)
Design drills
You know the five algorithms — now rehearse the system decisions an interviewer actually probes.
Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.
A public API sells a 1,000 requests/minute tier that must tolerate short bursts. Pick an algorithm and set its knobs — what are R and B, and what do you return on reject?
Now run it behind 10 gateway nodes. Where does the counter live, and how do you avoid the 'real limit = limit × 10' bug?
A customer complains they get throttled in bursts right at each minute boundary. Diagnose it and fix it without storing every timestamp.
Redis — your shared counter — goes down. What happens to request handling, and what should happen?
A multi-tenant SaaS needs per-user, per-organization, and global caps enforced at once. How do you layer them?