Databases at Scale

SQL vs NoSQL with a real reason, leader-follower replication, the four sharding strategies, and the hotspot/rebalancing traps that consistent hashing solves.

SQL vs NoSQL — pick with a reason

Relational (Postgres/MySQL): strong schema, ACID transactions, joins, rich queries. Default when you have relationships and invariants (users, orders, balances). All three of my backends use Postgres.
NoSQL: document (Mongo), key-value (Redis/DynamoDB), wide-column (Cassandra), graph. Choose for a specific access pattern at scale — massive write throughput, flexible schema, or simple key lookups — accepting weaker guarantees.

The honest answer

"SQL unless I have a reason." Most apps are well served by Postgres until a clear pressure (write volume, scale-out, schema-flexibility, or a key-value access pattern) justifies a specialized store. Saying that signals maturity.

Replication — scale reads, survive failures

Leader-follower: writes go to the leader, which streams its log to followers that serve reads.

Wins: read scale + availability (promote a follower if the leader dies). Catch: replication lag → a follower read can be stale (a read-your-writes problem). Fixes: route a user's own reads to the leader briefly, or read from the leader for must-be-fresh queries.

Sharding — when one box can't hold the writes/data

Partition data across nodes by a shard key:

Strategy	How	Watch out for
Range	by key ranges (A–M, N–Z)	hotspots on sequential keys (timestamps!)
Hash	`hash(key) % N`	resharding moves almost everything when N changes
Consistent hashing	keys + nodes on a ring	the standard; minimal movement on resize
Directory	a lookup service maps key→shard	flexible, but the directory is a SPOF

Consistent hashing is the one to know: adding/removing a node only remaps the keys between adjacent points on the ring (≈ 1/N of keys), not the whole dataset — and virtual nodes smooth out skew.

Here's why it beats hash(key) % N. Each key walks clockwise to the next node; when you add a node, only the keys in its new arc move — watch the "keys moved" counter when N4 appears, versus the near-total remap that % N would force:

Consistent hashing — add a node, move ~1/N keystime O(log N) lookupspace O(N + keys)

1/83 nodes on the ring. Each key belongs to the next node clockwise from its hash position — that's the whole rule.

nodes = 3

keys (hashed onto the ring)

Choosing the shard key is the whole game

A bad key creates hotspots (one shard gets all the traffic) and cross-shard queries/joins (slow, hard to make transactional). Pick a key with high cardinality and even access — and design so most queries hit a single shard.

Replication Topologies

To scale reads, writes, and handle hardware failovers, databases distribute data across multiple nodes using different replication topologies:

1. Single-Leader Replication (Master-Follower)

How it works: Writes are sent only to the leader. The leader writes to its log and sends updates to the followers. Followers only serve reads.
Sync vs. Async vs. Semi-Sync:
- Synchronous: Leader waits for all followers to confirm before returning success. (Guarantees zero data loss, but high latency and blocking if a follower dies).
- Asynchronous: Leader returns success immediately after writing locally. (Low latency, but risk of data loss if leader crashes before logs reach followers).
- Semi-Synchronous: Leader waits for one follower to acknowledge, while others replicate asynchronously. (Best compromise between safety and latency).

2. Multi-Leader Replication (Active-Active / Master-Master)

How it works: Multiple nodes act as leaders, accepting both reads and writes. They replicate state updates to each other. Used across multiple geographical datacenters.
Conflict Resolution: If two leaders receive conflicting writes (e.g. updating same row to different values simultaneously):
- Last-Write-Wins (LWW): Uses timestamps to discard the older write. (Risky due to clock skew).
- Conflict-Free Replicated Data Types (CRDTs): Mathematical data structures (like G-Counters or PN-Counters) that merge automatically without conflicts.

3. Leaderless Replication (Dynamo-Style)

How it works: Clients send writes and reads directly to multiple nodes simultaneously. Popularized by Amazon's DynamoDB and Apache Cassandra.
Data Consistency:
- Hinted Handoff: If a node is down, another node stores the write temporary "hint" and delivers it when the target node recovers.
- Read Repair: When a client reads from multiple nodes and detects a version mismatch, the client writes the newest value back to the stale nodes.
- Anti-Entropy: Background processes run hashes (Merkle Trees) to identify and fix out-of-sync data.

CQRS & Event Sourcing

For high-throughput enterprise applications, standard database updates can become performance bottlenecks. We can decouple the read and write paths using CQRS and Event Sourcing:

1. CQRS (Command Query Responsibility Segregation)

Instead of using a single schema for both reads and writes, CQRS splits them:

Write DB (Command): Highly normalized, optimized for insertions, updates, and enforcing transactions/business rules.
Read DB (Query): Highly denormalized (e.g., Elasticsearch or read-replicas), optimized for fast query execution and search.
Replication: Commands write to the Command database, which triggers an async event message (via Kafka/RabbitMQ) to update the Read database.

2. Event Sourcing

Rather than storing the current state of a row directly, Event Sourcing stores the state changes as a sequence of immutable events (an audit log):

Current State: Account Balance = $150

Event Store (Immutable Ledger):
  1. AccountOpened (ID=101)
  2. FundsDeposited (Amount=$200)
  3. FundsWithdrawn (Amount=$50)

Pros: Complete, unalterable historical audit trail (perfect for financial/medical systems). Reconstructing state to any historical point is trivial.
Cons: Reconstructing the current state requires replaying all historical events. To optimize, we must periodically store snapshots of the current state.

Indexing & denormalization (one line each)

An index turns an O(n) scan into an O(log n) B-tree lookup at the cost of write speed and space. Denormalization duplicates data to avoid expensive joins on the read path — you trade write complexity (keep copies in sync) for read speed.

Design drills

You can name the strategies — now make the calls a scaling system forces on you.

Design drills: Databases at scale0/5 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

Reads are crushing your single Postgres box, but writes are fine. What's the first move — and what new problem does it create?

Core

Writes now exceed one leader. Pick a shard key for an orders table and defend it against hotspots and cross-shard queries.

Core

You sharded with hash(key) % N. Now you must add a node. What breaks, and what should you have used?

Stretch

The leader dies. Walk through failover and the consistency risks.

Stretch

Pick SQL or NoSQL for (a) accounts + balances, (b) a 10M-row/day event firehose, (c) a product catalog with flexible attributes — and justify each.