The problem: find a needle in a haystack
Pattern matching — does pattern p (length m) occur in text t
(length n)? — is "l".indexOf, in, Ctrl-F, grep, plagiarism
detection and DNA search. The naive approach tries every starting
position and compares:
def naive_find(t, p):
for i in range(len(t) - len(p) + 1): # every start position
if t[i:i+len(p)] == p: # compare m characters
return i
return -1
That's O(n·m) worst case — fine for everyday strings, painful at
scale, and built on pure waste: after matching "aaaab" almost to the
end and failing, it restarts from scratch one position over, forgetting
everything it just learned. Both classic algorithms below are schemes
for never forgetting.
Calibration up front: these appear in interviews occasionally and at senior/competitive levels — far less often than two pointers or DP. The realistic goals: rolling hashes you can use (they generalize beyond strings), KMP's idea you can explain (deriving it live is rare), and the judgment to say "in production I'd call the library."
Watch the obvious way struggle
First, feel the problem: the naive method slides the pattern one step per mismatch and re-compares letters it already confirmed. Count the comparisons on this repetitive input — then come back after the KMP section and run the same search with the toggle flipped.
1/30Find "ABABD" inside "ABABABCABABABD", the obvious way: line the pattern up at every position and compare letter by letter.
Rolling hash & Rabin-Karp — compare numbers, not strings
Comparing two m-character windows costs O(m). Comparing two numbers costs O(1). So: hash the pattern once, hash each text window, and only compare characters when the hashes match.
The trick that makes it fast is the rolling hash: treat the window as a number in base B, and when the window slides one position, update the hash in O(1) instead of recomputing —
window "abc" → hash = a·B² + b·B + c (everything mod a large prime M)
slide to "bcd":
subtract a·B² (drop the old left char)
multiply by B (shift everything up)
add d (bring in the new right char)
def rabin_karp(t, p):
n, m = len(t), len(p)
B, M = 256, (1 << 61) - 1 # base, big prime modulus
if m > n: return -1
ph = th = 0
for i in range(m): # initial hashes, O(m)
ph = (ph * B + ord(p[i])) % M
th = (th * B + ord(t[i])) % M
power = pow(B, m - 1, M) # B^(m-1), for removals
for i in range(n - m + 1):
if ph == th and t[i:i+m] == p: # verify on hash hit!
return i
if i + m < n: # roll the window
th = ((th - ord(t[i]) * power) * B + ord(t[i+m])) % M
return -1
Average O(n + m); the character-comparison on hash hits guards against collisions (two different windows, same hash — rare with a big prime, but possible; skipping the verify step is the classic bug).
Why this one earns its place even if you never search text:
- Many patterns at once — hash all patterns into a set; one scan of the text checks every window against all of them (the naive approach multiplies by the number of patterns). This is how plagiarism detectors and antivirus scanners work.
- The rolling idea generalizes — it's a sliding window over any incremental computation: rolling checksums in rsync (sync only changed file blocks), chunking in backup systems and git's packfiles, duplicate-detection over data streams. Saying "rolling hash" in a system-design room is often more valuable than in a coding round.
KMP — never re-read the text
Knuth-Morris-Pratt achieves guaranteed O(n + m) with zero hashing, via one observation: when a match fails partway, the characters you already matched are known — so you know exactly how far the pattern could shift without skipping a possible match.
The precomputation: for each prefix of the pattern, the failure
function f[i] = the length of the longest proper prefix of
p[:i+1] that is also its suffix:
pattern: a b a b c
f: 0 0 1 2 0
f[3] = 2 because "abab"’s longest prefix-that-is-also-suffix is "ab" (length 2)
Why that quantity: if you've matched "abab" and the next character
fails, the text you just consumed ends with "abab" — and since its
suffix "ab" equals the pattern's prefix "ab", the pattern can jump
to state 2 (as if "ab" were already matched) without moving
backward in the text. The text pointer only ever advances → O(n)
scanning; building f is O(m) by the same logic applied to the pattern
against itself.
def kmp_find(t, p):
# build failure table
f, k = [0] * len(p), 0
for i in range(1, len(p)):
while k and p[i] != p[k]:
k = f[k - 1] # fall back through shorter borders
if p[i] == p[k]:
k += 1
f[i] = k
# scan the text — the SAME loop shape
k = 0
for i, ch in enumerate(t):
while k and ch != p[k]:
k = f[k - 1]
if ch == p[k]:
k += 1
if k == len(p):
return i - k + 1 # full match ends here
return -1
The mental model that survives interviews: KMP is a tiny state
machine — the state is "how many pattern characters currently
match," each text character either advances the state or follows
fallback arrows (the failure function), and the text is read exactly
once, never backward. That sentence, plus the "abab" example, is the
expected depth; reciting the table-building loop from memory is not.
Two names to recognize, not implement: the Z-algorithm (computes, for each position, the length of the longest substring starting there that matches the string's own prefix — an alternative O(n) toolkit with the same applications), and Aho-Corasick (KMP generalized to many patterns simultaneously by building the failure links over a trie — the industrial-strength multi-pattern matcher).
Watch KMP never look back
Same search, two phases: first the pattern studies itself (the "survives" row — how many letters of progress outlive a mismatch), then the scan where the text pointer only ever moves forward. Compare the final comparison count against the naive run above.
1/25KMP prep — study the pattern alone. For each position, write down: "if I fail here, how much of the pattern's own beginning is still alive in what I matched?" (Computers hate redoing work; this table is the cure.)
Where the failure function sneaks into interviews
The table itself answers questions, no searching involved — these are the actual KMP interview appearances:
- Shortest Palindrome (LC 214) — longest palindromic prefix via the
failure function of
s + "#" + reverse(s). - Repeated Substring Pattern (LC 459) —
sis a repetition iffn % (n − f[n−1]) == 0withf[n−1] > 0; the failure function exposes the period of the string. - Longest happy prefix (LC 1392) — literally "compute f[n−1]."
Common mistakes
- Rabin-Karp without the verification compare — a hash collision returns a phantom match; with adversarial input this is exploitable (hash flooding's cousin). Verify on hit, or use double hashing.
- Modulus/overflow sloppiness — in Java/C++ the rolling hash must
consciously use
longand careful mod arithmetic; Python's big ints hide the issue and then the same code fails in another language. - Hand-rolling in production —
str.find,indexOf,std::searchare optimized (often SIMD-accelerated) and correct. These algorithms are for understanding and for the cases libraries don't cover (streams, many patterns, your own equality notion). - Reaching for KMP when n is small — naive matching's O(n·m) with tiny constants beats clever algorithms below thousands of characters. State the crossover, then write the simple thing.
- Confusing substring problems with subsequence problems — "substring" = contiguous → this page + sliding window; "subsequence" = gaps allowed → DP. Misfiling wastes ten minutes.
Interview perspective
Practice
- Build Rabin-Karp from this page; test on a text where naive
matching is visibly slow (
"a"*10**6searching"a"*10**3 + "b"), and time both. - Failure-function reps: compute f by hand for
"aabaaab"and"abcabcab", then verify with the code. (Hand-computing two tables teaches more than reading ten explanations.) - The canon: Implement strStr (LC 28 — any method, then KMP);
Repeated Substring Pattern (LC 459) via the period trick — and prove
the trick to yourself on
"abab"and"aba". - Generalize the roll: use a rolling hash to find the longest substring that appears at least twice (binary search the length + hash-set of window hashes — a beautiful binary-search-on-answer combo).
That completes the Level 3 pattern arc. On to applying everything: the problem bank and Blind 75.
Practice — climb the ladder
String problems split three ways: scanning (windows/counters), matching (KMP/rolling hash), and structure (palindromes). Know which one you are in.
Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.
Warm-up
scanning and normalizingSubstring search naively first — the baseline KMP will later destroy.
Periodicity reasoning — the (s+s) trick and why it works.
Core
palindromes + parsingExpand-around-center — 2n−1 centers, odd and even handled cleanly.
- String to Integer (atoi)Medium
State-machine parsing — whitespace, sign, digits, overflow clamps; edge-case discipline.
- Zigzag ConversionMedium
Index simulation — direction flips as row bookkeeping.
Stretch
where KMP and 2D DP earn their keepKMP failure function applied to s + # + reverse(s) — prefix machinery for real.
Counting matches in a 2D table — LCS thinking, counting variant.