Opening the black box
You've been using hash tables since Level 1 —
every Python dict/set, Java HashMap, C++ unordered_map is one. They
answer "what's the value for this key?" in O(1). This page is about how,
because interviewers ask, and because the failure modes only make sense once
you've seen the machinery.
The big idea: compute the location
An array finds element i instantly because the index is the location. A hash table extends that trick to arbitrary keys: run the key through a function that turns it into an index.
key "milk" → hash("milk") = 8,613,972,041 → % 8 buckets → index 5
buckets: 0 1 2 3 4 5 6 7
┌─────┬─────┬─────┬─────┬─────┬─────────┬─────┬─────┐
│ │bread│ │ │ │ milk:60 │ │eggs │
└─────┴─────┴─────┴─────┴─────┴─────────┴─────┴─────┘
A hash function converts any key into a number (the hash); modulo the array size gives a bucket index. Lookup repeats the same computation — no searching, just arithmetic, exactly like array indexing. That's the whole O(1).
A good hash function is: deterministic (same key → same hash, always), fast, and uniform (keys spread evenly across buckets — clumping ruins everything).
Collisions: the inevitable problem
Infinite possible keys, finite buckets — two keys will land in the same
bucket ("milk" and "yogurt" both → 5). This is a collision, and
handling it is most of hash-table design:
- Chaining (the classic): each bucket holds a small linked list of (key, value) pairs. Lookup hashes to the bucket, then walks the short chain comparing keys. Java's HashMap works this way (upgrading long chains to trees since Java 8).
- Open addressing: on collision, probe the next slot(s) until an empty one is found. Everything lives in the array itself — cache-friendly; Python's dict does a sophisticated version of this.
Load factor and resizing
How full the table is — entries / buckets — is the load factor. As it
rises, chains lengthen and probes multiply; too full, and O(1) decays toward
O(n). So tables resize: past a threshold (~0.75 for Java's HashMap),
allocate ~2× the buckets and re-hash every entry into its new position
(indices change because % size changed!). One expensive O(n) operation,
rare enough to stay amortized O(1) — the
dynamic-array argument again.
| Operation | Average | Worst case | Worst when |
|---|---|---|---|
| insert / lookup / delete | O(1) | O(n) | all keys collide into one bucket |
That worst case is real: adversaries who know your hash function can manufacture colliding keys and turn your API's dict into a linked list — a hash-flooding DoS attack. Languages randomize their string hashes per process precisely to prevent it. (Great security trivia that signals depth.)
Watch it run
Drawers, a placement rule, and collisions handled with little chains — insert a batch of keys, then watch a lookup jump straight to the right drawer. Change the drawer count and see chains shrink (more drawers) or pile up (fewer).
1/187 drawers, all empty. The rule: a key goes into drawer (key % 7) — the remainder after dividing by 7. Same key, same drawer, every time. That's all a hash function is.
Why keys must be immutable
The table files a key under the bucket computed from its contents. Mutate
the key afterwards and its hash changes — but it's still filed in the old
bucket. It becomes unfindable: lookups compute the new hash and walk the
wrong bucket. This is why Python rejects lists as dict keys
(tuples are fine) and why "don't mutate
something used as a HashMap key" is a real Java bug class. Java's contract:
equal objects must have equal hashes — override equals() and
hashCode() together, or your objects vanish inside HashMaps.
The interview superpower
The strategic fact: a hash table converts "have I seen X?" and "what goes with X?" from O(n) scans into O(1) asks. A huge share of "optimize this O(n²) solution" interview problems are solved by exactly one hash table:
# Two Sum — THE canonical example
# Brute force: try all pairs — O(n²)
# Hash table: one pass — O(n)
def two_sum(nums, target):
seen = {} # value → index
for i, x in enumerate(nums):
if target - x in seen: # have I seen my complement? O(1)
return [seen[target - x], i]
seen[x] = i
The reusable patterns, each O(n) where the naive way is O(n²):
- Complement lookup — two-sum and friends: store what you've seen, ask for what you need.
- Counting — frequency dicts (Level 1): anagrams, majority element, top-k (with a heap).
- Canonical form as key — group anagrams by sorted letters; the key is the insight.
- Seen-set — duplicates, cycle detection, visited nodes in graphs.
- Prefix-sum + hash — count subarrays summing to k: store running-sum frequencies (prefix sums from the arrays page meet complement lookup).
When an interviewer says "can you do better?", your first reflex should be: what would I need to look up in O(1) to avoid the inner loop?
Production perspective
- Hash tables are arguably the single most-used data structure in production software: every JSON object (Level 0), HTTP header map, database index of the hash variety, deduplication pass, and cache.
- Redis (Level 0's key-value store) is essentially a giant networked hash table — and Level 6's caching is hash-table thinking applied at datacenter scale, including consistent hashing, the distributed answer to "what happens when the number of buckets (servers!) changes."
- The resize pause is real in production too: latency-sensitive systems
pre-size their maps, just like
reservefor vectors.
Common mistakes
- Assuming order. Hash tables order by bucket, not by insertion or value. (Python ≥3.7 dicts do preserve insertion order as a language guarantee, but the moment you rely on order across languages, you're in bug country. If order is the point, sort or use the right structure.)
- Mutating keys — the unfindable-entry bug above.
dict[key]on missing keys → KeyError; use.get/getOrDefault(exceptions vs defaults — choose deliberately).- Using a hash table where an array does — keys 0…n-1? A plain array is faster, smaller, ordered. The dict is not a status symbol.
- Java: forgetting hashCode/equals symmetry when using your own class as a key — entries silently disappear.
Interview perspective
Practice
- Build it: implement a hash table with chaining —
put,get,delete— usinghash(key) % 8and lists of pairs. Then add resizing at load factor 0.75 and verify entries survive it. - The canon: two-sum (from memory, one pass); first non-repeating character in a string; group anagrams (what's your key?); contains-duplicate.
- Pattern transfer: count subarrays with sum exactly k using prefix-sums in a dict. Start by solving it O(n²), then ask the magic question: "what lookup kills the inner loop?"
- Reason: your service maps 10 million session-ids → user objects. Estimate the memory overhead vs a sorted array of pairs, and name what you gain and lose. (Numbers in the cheat sheet help.)
Next: Trees & BSTs — when data has hierarchy, and how sorted order becomes a shape.
Practice — climb the ladder
The decoder line for this whole rung: "have I seen this before?" / "group these by key" / "count in O(1)" — all of it is a hash map.
Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.
Warm-up
seen-sets and countersThe seen-set in its purest form — set membership beats sorting here.
- Valid AnagramEasy
Frequency counting — the counter dict you will write a thousand times.
- Two SumEasy
Now fix your warm-up brute force: value → index map, one pass, check-then-store.
Core
maps as grouping and bookkeeping engines- Group AnagramsMedium
Canonical-form keys — choosing WHAT to hash is the whole problem.
- Top K Frequent ElementsMedium
Counter + bucket sort (or heap) — frequency problems end here.
Set lookups replacing sorting — only start counting at sequence starts.
- Subarray Sum Equals KMedium
Prefix sums in a map — the array/hash crossover interviewers love.
Stretch
design with O(1) contractsMap + array swap-delete — meeting THREE O(1) requirements at once.
Map inside a sliding window — preview of the window rung.