The problem crosses machines
You've met mutual exclusion in-process: a mutex around the parking spot, a lock per elevator dispatch. But your service now runs as N identical instances — and some work must happen exactly once across the fleet:
- The nightly settlement job — 12 instances each running the same cron means 12 settlements.
- "Refresh this cache entry on expiry" — without coordination, a popular key expiring triggers a stampede: 200 instances recompute the same value (cache stampede).
- "Process this user's import" — two workers grabbing the same job.
A process-local mutex protects one process's threads. None of these problems live in one process. Enter the distributed lock: a lease on a name, stored somewhere all instances can see.
The Redis lock (the workhorse)
One atomic command does it:
SET lock:settlement-2026-06-11 <my-unique-token> NX PX 30000
NX— set only if not exists: the atomic check-and-set that makes it a lock; exactly one caller wins.PX 30000— a 30-second TTL: if the holder crashes, the lock self-releases. Locks without TTLs are outages waiting for a holder to die (the seat-lock lesson, verbatim).<my-unique-token>— a random value identifying this holder, so release can be safe:
-- release: delete ONLY if it's still mine (atomic via Lua script)
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
end
Why the token check matters: if your work overran the TTL, the lock
expired and someone else acquired it — a bare DEL would release
their lock. This trio — NX, TTL, owner-checked release — is the
minimum competent answer.
The uncomfortable truth: TTL locks can't be perfect
Here is the failure every senior interviewer eventually asks about:
t=0 instance A acquires lock (TTL 30s), starts work
t=5 A hits a 40-second GC pause / network hang ← it doesn't KNOW it's paused
t=30 TTL expires; lock self-releases
t=31 instance B acquires the lock, starts the same work
t=45 A wakes up, BELIEVES it still holds the lock, finishes its write
→ two writers. The lock "worked" and you still lost.
A paused process cannot know it's been paused (GC pauses are real). TTL too short → spurious takeovers; too long → slow failure recovery. The fix is fencing tokens: the lock service hands out a monotonically increasing number with each grant (A gets 33, B gets 34), and the protected resource — the database, the storage layer — rejects writes carrying a stale token:
UPDATE jobs SET result = :r, fence = 34
WHERE id = :id AND fence < 34; -- A's late write with fence 33: 0 rows. Blocked.
Note what just happened: correctness moved from the lock to a conditional write at the resource — which quietly reveals the deepest truth on this page: if the resource can check a condition, the resource was the real serialization point all along.
For locks that must be highly available themselves, the coordination-service tier exists — ZooKeeper/etcd hold locks as ephemeral entries under consensus, with session-based liveness instead of bare TTLs (and Kubernetes' leader election rides on exactly this). Redlock — Redis's multi-node lock algorithm — is worth naming alongside the famous Kleppmann critique: without fencing, no lock survives the pause problem; with fencing, single-instance Redis is usually enough.
The senior move: not needing the lock
The strongest answer to "how would you lock this?" is often "I wouldn't." The alternatives, in the order to reach for them:
- Idempotency — if running twice is harmless, duplicates need no
prevention. The settlement job writes results keyed by date
(
INSERT ... ON CONFLICT DO NOTHING); twelve starts, one effect (the universal toolkit). - Unique constraints / conditional writes — let the database's
atomicity be the arbiter: whoever inserts the
(job_id)row first owns the job (the BookMyShow backstop). This is a lock — one the database enforces perfectly, pauses and all. - Single-consumer queues — partition work through a queue; a Kafka partition has exactly one consumer per group by construction. Ownership without locks.
- Leader election — for "exactly one instance runs the schedulers," elect one leader (etcd/Kubernetes lease) instead of locking per task.
The honest hierarchy: locks for efficiency (avoiding duplicate expensive work — a stampede guard where a rare duplicate is just wasted CPU) need only the simple Redis pattern; locks for correctness (two writers corrupt data) need fencing at the resource — at which point consider whether the conditional write alone suffices. Classifying the need before choosing the mechanism is the interview.
Common mistakes
- No TTL — holder dies, system deadlocks until a human notices.
- Unchecked release — deleting a lock you no longer hold, cascading the overlap you tried to prevent.
- TTL shorter than the work — guaranteed double-execution under load, discovered in production. Estimate work time honestly; renew ("heartbeat") long tasks — and know renewal doesn't fix the pause problem either.
- Locking for correctness without fencing — the t=45 scenario above ships to prod in most codebases that "use Redis locks."
- One global lock for throughput-critical work — you've built a Stage-1 bottleneck on purpose; lock at the finest natural granularity (per user, per job), like per-show seat locks.
- Reaching for ZooKeeper for a cache stampede — efficiency locks don't deserve consensus-grade machinery; match the tool to the stakes.
Interview perspective
Practice
- Build the lock: Redis locally — implement acquire (NX+PX), owner-checked release (Lua), and a heartbeat renewer. Two terminal processes contending; kill the holder and watch TTL recovery.
- Reproduce the pause bug: TTL 2 s, work =
sleep(4), two processes — watch both "complete" the same job. Then add a fence column with a conditional UPDATE and watch the stale writer get 0 rows. (The most instructive 30 lines in distributed systems.) - Delete the lock: redesign the cron problem with
ON CONFLICT DO NOTHINGonly. Measure what the lock was actually buying you. - Classify: for each — cache stampede guard, seat checkout, nightly billing, "one scheduler per cluster" — efficiency or correctness? Which mechanism from this page, and why?
Next: DDD & Modular Monoliths — drawing the boundaries that all this infrastructure serves.