Hash Tables · PrepDeck

How dicts actually work — hash functions, buckets, collisions, resizing — and why O(1) lookup powers half of all interview solutions.

Why We Need Hash Tables & How to Think About It

Imagine you are running a student record room with 10,000 folders. If you keep them in a heap, finding a folder named "Nitesh" requires scanning them one by one (Linear Search, $O(n)$). If you sort them alphabetically, you can binary-search them in $O(\log n)$, but keeping the list sorted is slow because inserting a new student forces you to slide thousands of folders over.

What if there was a magic system where the name itself tells you exactly which drawer to open? For example, a rule: "Take the first letter of the name and map it to a drawer index: A=1, B=2, ... N=14." Now, to find "Nitesh", you walk straight to drawer 14.

This is a Hash Table. The rule that translates a key ("Nitesh") into a specific location is a Hash Function. It bypasses searching entirely, giving you $O(1)$ random access for arbitrary keys.

Opening the black box

You've been using hash tables since Level 1 — every Python dict/set, Java HashMap, C++ unordered_map is one. They answer "what's the value for this key?" in O(1). This page is about how, because interviewers ask, and because the failure modes only make sense once you've seen the machinery.

The big idea: compute the location

An array finds element i instantly because the index is the location. A hash table extends that trick to arbitrary keys: run the key through a function that turns it into an index.

key "milk" → hash("milk") = 8,613,972,041 → % 8 buckets → index 5

buckets:  0     1     2     3     4     5         6     7
        ┌─────┬─────┬─────┬─────┬─────┬─────────┬─────┬─────┐
        │     │bread│     │     │     │ milk:60 │     │eggs │
        └─────┴─────┴─────┴─────┴─────┴─────────┴─────┴─────┘

A hash function converts any key into a number (the hash); modulo the array size gives a bucket index. Lookup repeats the same computation — no searching, just arithmetic, exactly like array indexing. That's the whole O(1).

A good hash function is: deterministic (same key → same hash, always), fast, and uniform (keys spread evenly across buckets — clumping ruins everything).

Collisions: the inevitable problem

Infinite possible keys, finite buckets — two keys will land in the same bucket ("milk" and "yogurt" both → 5). This is a collision, and handling it is most of hash-table design:

Chaining (the classic): each bucket holds a small linked list of (key, value) pairs. Lookup hashes to the bucket, then walks the short chain comparing keys. Java's HashMap works this way (upgrading long chains to trees since Java 8).
Open addressing: on collision, probe the next slot(s) until an empty one is found. Everything lives in the array itself — cache-friendly; Python's dict does a sophisticated version of this.

Load factor and resizing

How full the table is — entries / buckets — is the load factor. As it rises, chains lengthen and probes multiply; too full, and O(1) decays toward O(n). So tables resize: past a threshold (~0.75 for Java's HashMap), allocate ~2× the buckets and re-hash every entry into its new position (indices change because % size changed!). One expensive O(n) operation, rare enough to stay amortized O(1) — the dynamic-array argument again.

Operation	Average	Worst case	Worst when
insert / lookup / delete	O(1)	O(n)	all keys collide into one bucket

That worst case is real: adversaries who know your hash function can manufacture colliding keys and turn your API's dict into a linked list — a hash-flooding DoS attack. Languages randomize their string hashes per process precisely to prevent it. (Great security trivia that signals depth.)

Watch it run

Drawers, a placement rule, and collisions handled with little chains — insert a batch of keys, then watch a lookup jump straight to the right drawer. Change the drawer count and see chains shrink (more drawers) or pile up (fewer).

Hash table — insert & lookup (chaining)time O(1) averagespace O(n)

drawer 0

—

drawer 1

—

drawer 2

—

drawer 3

—

drawer 4

—

drawer 5

—

drawer 6

—

1/187 drawers, all empty. The rule: a key goes into drawer (key % 7) — the remainder after dividing by 7. Same key, same drawer, every time. That's all a hash function is.

keysdrawers (3–10)look up

Why keys must be immutable

The table files a key under the bucket computed from its contents. Mutate the key afterwards and its hash changes — but it's still filed in the old bucket. It becomes unfindable: lookups compute the new hash and walk the wrong bucket. This is why Python rejects lists as dict keys (tuples are fine) and why "don't mutate something used as a HashMap key" is a real Java bug class. Java's contract: equal objects must have equal hashes — override equals() and hashCode() together, or your objects vanish inside HashMaps.

The interview superpower

The strategic fact: a hash table converts "have I seen X?" and "what goes with X?" from O(n) scans into O(1) asks. A huge share of "optimize this O(n²) solution" interview problems are solved by exactly one hash table:

Python

# Python — value → index
def two_sum(nums, target):
    seen = {}
    for i, x in enumerate(nums):
        complement = target - x
        if complement in seen:
            return [seen[complement], i]
        seen[x] = i
    return []

Java

// Java — value -> index
import java.util.HashMap;
import java.util.Map;

public int[] twoSum(int[] nums, int target) {
    Map<Integer, Integer> seen = new HashMap<>();
    for (int i = 0; i < nums.length; i++) {
        int complement = target - nums[i];
        if (seen.containsKey(complement)) {
            return new int[] { seen.get(complement), i };
        }
        seen.put(nums[i], i);
    }
    return new int[] {};
}

C++

// C++ — value -> index
#include <unordered_map>
#include <vector>

std::vector<int> twoSum(std::vector<int>& nums, int target) {
    std::unordered_map<int, int> seen;
    for (int i = 0; i < nums.size(); i++) {
        int complement = target - nums[i];
        if (seen.count(complement)) {
            return { seen[complement], i };
        }
        seen[nums[i]] = i;
    }
    return {};
}

The reusable patterns, each O(n) where the naive way is O(n²):

Complement lookup — two-sum and friends: store what you've seen, ask for what you need.
Counting — frequency dicts (Level 1): anagrams, majority element, top-k (with a heap).
Canonical form as key — group anagrams by sorted letters; the key is the insight.
Seen-set — duplicates, cycle detection, visited nodes in graphs.
Prefix-sum + hash — count subarrays summing to k: store running-sum frequencies (prefix sums from the arrays page meet complement lookup).

When an interviewer says "can you do better?", your first reflex should be: what would I need to look up in O(1) to avoid the inner loop?

Production perspective

Hash tables are arguably the single most-used data structure in production software: every JSON object (Level 0), HTTP header map, database index of the hash variety, deduplication pass, and cache.
Redis (Level 0's key-value store) is essentially a giant networked hash table — and Level 6's caching is hash-table thinking applied at datacenter scale, including consistent hashing, the distributed answer to "what happens when the number of buckets (servers!) changes."
The resize pause is real in production too: latency-sensitive systems pre-size their maps, just like reserve for vectors.

Common mistakes

Assuming order. Hash tables order by bucket, not by insertion or value. (Python ≥3.7 dicts do preserve insertion order as a language guarantee, but the moment you rely on order across languages, you're in bug country. If order is the point, sort or use the right structure.)
Mutating keys — the unfindable-entry bug above.
dict[key] on missing keys → KeyError; use .get/getOrDefault (exceptions vs defaults — choose deliberately).
Using a hash table where an array does — keys 0…n-1? A plain array is faster, smaller, ordered. The dict is not a status symbol.
Java: forgetting hashCode/equals symmetry when using your own class as a key — entries silently disappear.

Think it through

You saw Two Sum turn O(n²) into O(n) with one map. Now apply the same "what would I look up in O(1)?" reflex to a problem that also needs prefix sums — the array/hash crossover interviewers love. Think before revealing.

Think it through: Subarray Sum Equals KMedium — LeetCode 5600/5 stages

PROBLEMCount the CONTIGUOUS subarrays whose sum is exactly k. nums = [1,2,3], k = 3 → 2 (the subarrays [1,2] and [3]). Note: numbers can be negative.

1
Restate & edges
“What's the trap hidden in 'numbers can be negative'?”
2
Brute force first
“Dumbest correct solution and its cost?”
unlocks after the stage above
3
Find the pattern
“sum(nums[i..j]) = prefix[j+1] − prefix[i]. Set that equal to k and solve for prefix[i].”
unlocks after the stage above
4
Code the template
“Why seed the map with {0: 1} before the loop?”
unlocks after the stage above
5
Cost & edge check
“Cost, and why is this strictly more general than a window?”
unlocks after the stage above

Java

// Java — Subarray Sum Equals K
import java.util.HashMap;
import java.util.Map;

public int subarraySum(int[] nums, int k) {
    Map<Integer, Integer> count = new HashMap<>();
    count.put(0, 1); // seed for subarrays starting at index 0
    int running = 0;
    int total = 0;
    for (int x : nums) {
        running += x;
        total += count.getOrDefault(running - k, 0);
        count.put(running, count.getOrDefault(running, 0) + 1);
    }
    return total;
}

C++

// C++ — Subarray Sum Equals K
#include <unordered_map>
#include <vector>

int subarraySum(std::vector<int>& nums, int k) {
    std::unordered_map<int, int> count;
    count[0] = 1; // seed for subarrays starting at index 0
    int running = 0;
    int total = 0;
    for (int x : nums) {
        running += x;
        if (count.count(running - k)) {
            total += count[running - k];
        }
        count[running]++;
    }
    return total;
}

Check yourself

Check yourself0/4 answered

1. What makes hash-table lookup O(1) on average?

2. Resizing a hash table from 8 to 16 buckets re-hashes every entry. Why?

3. Why can't a Python list be used as a dict key?

4. Counting subarrays that sum to k where numbers can be NEGATIVE — why use prefix-sums-in-a-hash instead of a sliding window?

Interview perspective

Practice

Build it: implement a hash table with chaining — put, get, delete — using hash(key) % 8 and lists of pairs. Then add resizing at load factor 0.75 and verify entries survive it.
The canon: two-sum (from memory, one pass); first non-repeating character in a string; group anagrams (what's your key?); contains-duplicate.
Pattern transfer: count subarrays with sum exactly k using prefix-sums in a dict. Start by solving it O(n²), then ask the magic question: "what lookup kills the inner loop?"
Reason: your service maps 10 million session-ids → user objects. Estimate the memory overhead vs a sorted array of pairs, and name what you gain and lose. (Numbers in the cheat sheet help.)

Next: Trees & BSTs — when data has hierarchy, and how sorted order becomes a shape.

Practice — climb the ladder

The decoder line for this whole rung: "have I seen this before?" / "group these by key" / "count in O(1)" — all of it is a hash map.

Practice ladder: Hash Tables0/9 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up

seen-sets and counters

Contains DuplicateEasy
The seen-set in its purest form — set membership beats sorting here.
Valid AnagramEasy
Frequency counting — the counter dict you will write a thousand times.
Two SumEasy
Now fix your warm-up brute force: value → index map, one pass, check-then-store.

Core

maps as grouping and bookkeeping engines

Group AnagramsMedium
Canonical-form keys — choosing WHAT to hash is the whole problem.
Top K Frequent ElementsMedium
Counter + bucket sort (or heap) — frequency problems end here.
Longest Consecutive SequenceMedium
Set lookups replacing sorting — only start counting at sequence starts.
Subarray Sum Equals KMedium
Prefix sums in a map — the array/hash crossover interviewers love.

Stretch

design with O(1) contracts

Insert Delete GetRandom O(1)Medium
Map + array swap-delete — meeting THREE O(1) requirements at once.
Longest Substring Without Repeating CharactersMedium
Map inside a sliding window — preview of the window rung.