String Algorithms

Finding patterns in text fast: rolling hashes and Rabin-Karp, KMP's failure function demystified, and when to just call the library.

stringskmprabin-karphashingpattern-matching

The problem: find a needle in a haystack

Pattern matching — does pattern p (length m) occur in text t (length n)? — is "l".indexOf, in, Ctrl-F, grep, plagiarism detection and DNA search. The naive approach tries every starting position and compares:

Python
def naive_find(t, p):
    for i in range(len(t) - len(p) + 1):     # every start position
        if t[i:i+len(p)] == p:               # compare m characters
            return i
    return -1

That's O(n·m) worst case — fine for everyday strings, painful at scale, and built on pure waste: after matching "aaaab" almost to the end and failing, it restarts from scratch one position over, forgetting everything it just learned. Both classic algorithms below are schemes for never forgetting.

Calibration up front: these appear in interviews occasionally and at senior/competitive levels — far less often than two pointers or DP. The realistic goals: rolling hashes you can use (they generalize beyond strings), KMP's idea you can explain (deriving it live is rare), and the judgment to say "in production I'd call the library."

Watch the obvious way struggle

First, feel the problem: the naive method slides the pattern one step per mismatch and re-compares letters it already confirmed. Count the comparisons on this repetitive input — then come back after the KMP section and run the same search with the toggle flipped.

Substring search — the obvious waytime O(n · m)space O(1)
012345678910111213
text
A
B
A
B
A
B
C
A
B
A
B
A
B
D
pattern
A
B
A
B
D

1/30Find "ABABD" inside "ABABABCABABABD", the obvious way: line the pattern up at every position and compare letter by letter.

method

Rolling hash & Rabin-Karp — compare numbers, not strings

Comparing two m-character windows costs O(m). Comparing two numbers costs O(1). So: hash the pattern once, hash each text window, and only compare characters when the hashes match.

The trick that makes it fast is the rolling hash: treat the window as a number in base B, and when the window slides one position, update the hash in O(1) instead of recomputing —

window "abc" → hash = a·B² + b·B + c          (everything mod a large prime M)

slide to "bcd":
   subtract a·B²       (drop the old left char)
   multiply by B       (shift everything up)
   add d               (bring in the new right char)
Python
def rabin_karp(t, p):
    n, m = len(t), len(p)
    B, M = 256, (1 << 61) - 1                  # base, big prime modulus
    if m > n: return -1

    ph = th = 0
    for i in range(m):                          # initial hashes, O(m)
        ph = (ph * B + ord(p[i])) % M
        th = (th * B + ord(t[i])) % M
    power = pow(B, m - 1, M)                    # B^(m-1), for removals

    for i in range(n - m + 1):
        if ph == th and t[i:i+m] == p:          # verify on hash hit!
            return i
        if i + m < n:                           # roll the window
            th = ((th - ord(t[i]) * power) * B + ord(t[i+m])) % M
    return -1

Average O(n + m); the character-comparison on hash hits guards against collisions (two different windows, same hash — rare with a big prime, but possible; skipping the verify step is the classic bug).

Why this one earns its place even if you never search text:

  • Many patterns at once — hash all patterns into a set; one scan of the text checks every window against all of them (the naive approach multiplies by the number of patterns). This is how plagiarism detectors and antivirus scanners work.
  • The rolling idea generalizes — it's a sliding window over any incremental computation: rolling checksums in rsync (sync only changed file blocks), chunking in backup systems and git's packfiles, duplicate-detection over data streams. Saying "rolling hash" in a system-design room is often more valuable than in a coding round.

KMP — never re-read the text

Knuth-Morris-Pratt achieves guaranteed O(n + m) with zero hashing, via one observation: when a match fails partway, the characters you already matched are known — so you know exactly how far the pattern could shift without skipping a possible match.

The precomputation: for each prefix of the pattern, the failure function f[i] = the length of the longest proper prefix of p[:i+1] that is also its suffix:

pattern:  a  b  a  b  c
f:        0  0  1  2  0

f[3] = 2 because "abab"’s longest prefix-that-is-also-suffix is "ab" (length 2)

Why that quantity: if you've matched "abab" and the next character fails, the text you just consumed ends with "abab" — and since its suffix "ab" equals the pattern's prefix "ab", the pattern can jump to state 2 (as if "ab" were already matched) without moving backward in the text. The text pointer only ever advances → O(n) scanning; building f is O(m) by the same logic applied to the pattern against itself.

Python
def kmp_find(t, p):
    # build failure table
    f, k = [0] * len(p), 0
    for i in range(1, len(p)):
        while k and p[i] != p[k]:
            k = f[k - 1]                  # fall back through shorter borders
        if p[i] == p[k]:
            k += 1
        f[i] = k

    # scan the text — the SAME loop shape
    k = 0
    for i, ch in enumerate(t):
        while k and ch != p[k]:
            k = f[k - 1]
        if ch == p[k]:
            k += 1
        if k == len(p):
            return i - k + 1              # full match ends here
    return -1

The mental model that survives interviews: KMP is a tiny state machine — the state is "how many pattern characters currently match," each text character either advances the state or follows fallback arrows (the failure function), and the text is read exactly once, never backward. That sentence, plus the "abab" example, is the expected depth; reciting the table-building loop from memory is not.

Two names to recognize, not implement: the Z-algorithm (computes, for each position, the length of the longest substring starting there that matches the string's own prefix — an alternative O(n) toolkit with the same applications), and Aho-Corasick (KMP generalized to many patterns simultaneously by building the failure links over a trie — the industrial-strength multi-pattern matcher).

Watch KMP never look back

Same search, two phases: first the pattern studies itself (the "survives" row — how many letters of progress outlive a mismatch), then the scan where the text pointer only ever moves forward. Compare the final comparison count against the naive run above.

Substring search — KMPtime O(n + m)space O(m)
012345678910111213
text
A
B
A
B
A
B
C
A
B
A
B
A
B
D
pattern
A
B
A
B
D
survives
0

1/25KMP prep — study the pattern alone. For each position, write down: "if I fail here, how much of the pattern's own beginning is still alive in what I matched?" (Computers hate redoing work; this table is the cure.)

method

Where the failure function sneaks into interviews

The table itself answers questions, no searching involved — these are the actual KMP interview appearances:

  • Shortest Palindrome (LC 214) — longest palindromic prefix via the failure function of s + "#" + reverse(s).
  • Repeated Substring Pattern (LC 459) — s is a repetition iff n % (n − f[n−1]) == 0 with f[n−1] > 0; the failure function exposes the period of the string.
  • Longest happy prefix (LC 1392) — literally "compute f[n−1]."

Common mistakes

  • Rabin-Karp without the verification compare — a hash collision returns a phantom match; with adversarial input this is exploitable (hash flooding's cousin). Verify on hit, or use double hashing.
  • Modulus/overflow sloppiness — in Java/C++ the rolling hash must consciously use long and careful mod arithmetic; Python's big ints hide the issue and then the same code fails in another language.
  • Hand-rolling in productionstr.find, indexOf, std::search are optimized (often SIMD-accelerated) and correct. These algorithms are for understanding and for the cases libraries don't cover (streams, many patterns, your own equality notion).
  • Reaching for KMP when n is small — naive matching's O(n·m) with tiny constants beats clever algorithms below thousands of characters. State the crossover, then write the simple thing.
  • Confusing substring problems with subsequence problems — "substring" = contiguous → this page + sliding window; "subsequence" = gaps allowed → DP. Misfiling wastes ten minutes.

Interview perspective

Practice

  1. Build Rabin-Karp from this page; test on a text where naive matching is visibly slow ("a"*10**6 searching "a"*10**3 + "b"), and time both.
  2. Failure-function reps: compute f by hand for "aabaaab" and "abcabcab", then verify with the code. (Hand-computing two tables teaches more than reading ten explanations.)
  3. The canon: Implement strStr (LC 28 — any method, then KMP); Repeated Substring Pattern (LC 459) via the period trick — and prove the trick to yourself on "abab" and "aba".
  4. Generalize the roll: use a rolling hash to find the longest substring that appears at least twice (binary search the length + hash-set of window hashes — a beautiful binary-search-on-answer combo).

That completes the Level 3 pattern arc. On to applying everything: the problem bank and Blind 75.

Practice — climb the ladder

String problems split three ways: scanning (windows/counters), matching (KMP/rolling hash), and structure (palindromes). Know which one you are in.

Practice ladder: String Algorithms0/7 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up

scanning and normalizing
  1. Substring search naively first — the baseline KMP will later destroy.

  2. Periodicity reasoning — the (s+s) trick and why it works.

Core

palindromes + parsing
  1. Expand-around-center — 2n−1 centers, odd and even handled cleanly.

  2. State-machine parsing — whitespace, sign, digits, overflow clamps; edge-case discipline.

  3. Index simulation — direction flips as row bookkeeping.

Stretch

where KMP and 2D DP earn their keep
  1. KMP failure function applied to s + # + reverse(s) — prefix machinery for real.

  2. Counting matches in a 2D table — LCS thinking, counting variant.