String Algorithms · PrepDeck

Finding patterns in text fast: rolling hashes and Rabin-Karp, KMP's failure function demystified, and when to just call the library.

The problem: find a needle in a haystack

Pattern matching — does pattern p (length m) occur in text t (length n)? — is "l".indexOf, in, Ctrl-F, grep, plagiarism detection and DNA search. The naive approach tries every starting position and compares:

Python

def naive_find(t, p):
    for i in range(len(t) - len(p) + 1):     # every start position
        if t[i:i+len(p)] == p:               # compare m characters
            return i
    return -1

That's O(n·m) worst case — fine for everyday strings, painful at scale, and built on pure waste: after matching "aaaab" almost to the end and failing, it restarts from scratch one position over, forgetting everything it just learned. Both classic algorithms below are schemes for never forgetting.

💡 Visualizing the algorithms:

Rabin-Karp (Bar-codes): Imagine inspecting thousands of boxes looking for a specific item. Opening every box to check its full contents (character comparison) is slow. Instead, you scan a bar-code on the outside (hash). If the bar-code doesn't match, you skip the box immediately. If it does match, you open the box to double-check that it's the correct item (guarding against bar-code collisions).

Knuth-Morris-Pratt (Never Forgetting): Imagine searching for the word underground in a text. You've matched undergro but the next character is a mismatch. Instead of moving your text pointer all the way back to the second letter n and starting over, you look at what you've already read. If the pattern was ababac and you match ababa but fail on c, you notice that the end of what you matched (aba) matches the start of your pattern (aba). You slide the pattern over and immediately resume matching from character 4, never re-reading a single letter in the text.

Complexity Comparison

Algorithm	Time (Average)	Time (Worst)	Space	Best Use Case
Naive / Brute Force	$O(N \cdot M)$	$O(N \cdot M)$	$O(1)$	Simple cases, small strings
Rabin-Karp	$O(N + M)$	$O(N \cdot M)$	$O(1)$	Multi-pattern search, streaming data
KMP (Knuth-Morris-Pratt)	$O(N + M)$	$O(N + M)$	$O(M)$	Single pattern search with repetitiveness

Calibration up front: these appear in interviews occasionally and at senior/competitive levels — far less often than two pointers or DP. The realistic goals: rolling hashes you can use (they generalize beyond strings), KMP's idea you can explain (deriving it live is rare), and the judgment to say "in production I'd call the library."

Watch the obvious way struggle

First, feel the problem: the naive method slides the pattern one step per mismatch and re-compares letters it already confirmed. Count the comparisons on this repetitive input — then come back after the KMP section and run the same search with the toggle flipped.

Substring search — the obvious waytime O(n · m)space O(1)

012345678910111213

text

pattern

1/30Find "ABABD" inside "ABABABCABABABD", the obvious way: line the pattern up at every position and compare letter by letter.

method

text (≤ 16)pattern (≤ 8)

Rolling hash & Rabin-Karp — compare numbers, not strings

Comparing two m-character windows costs O(m). Comparing two numbers costs O(1). So: hash the pattern once, hash each text window, and only compare characters when the hashes match.

The trick that makes it fast is the rolling hash: treat the window as a number in base B, and when the window slides one position, update the hash in O(1) instead of recomputing —

window "abc" → hash = a·B² + b·B + c          (everything mod a large prime M)

slide to "bcd":
   subtract a·B²       (drop the old left char)
   multiply by B       (shift everything up)
   add d               (bring in the new right char)

Python

def rabin_karp(t, p):
    n, m = len(t), len(p)
    B, M = 256, (1 << 61) - 1                  # base, big prime modulus
    if m > n: return -1

    ph = th = 0
    for i in range(m):                          # initial hashes, O(m)
        ph = (ph * B + ord(p[i])) % M
        th = (th * B + ord(t[i])) % M
    power = pow(B, m - 1, M)                    # B^(m-1), for removals

    for i in range(n - m + 1):
        if ph == th and t[i:i+m] == p:          # verify on hash hit!
            return i
        if i + m < n:                           # roll the window
            th = ((th - ord(t[i]) * power) * B + ord(t[i+m])) % M
    return -1

Java

// Java — Rabin-Karp
public int rabinKarp(String t, String p) {
    int n = t.length(), m = p.length();
    if (m > n) return -1;
    if (m == 0) return 0;

    long B = 256;
    long M = 1000000007L; // mod to prevent overflow
    long ph = 0, th = 0;
    long power = 1;

    // Compute B^(m-1) % M
    for (int i = 0; i < m - 1; i++) {
        power = (power * B) % M;
    }

    // Compute initial hashes
    for (int i = 0; i < m; i++) {
        ph = (ph * B + p.charAt(i)) % M;
        th = (th * B + t.charAt(i)) % M;
    }

    for (int i = 0; i <= n - m; i++) {
        if (ph == th) {
            // Double-check to handle collisions
            if (t.substring(i, i + m).equals(p)) {
                return i;
            }
        }
        if (i + m < n) {
            // Roll the window: subtract left char, shift left, add right char
            long val = (th - (t.charAt(i) * power) % M) % M;
            if (val < 0) val += M;
            th = (val * B + t.charAt(i + m)) % M;
        }
    }
    return -1;
}

C++

// C++ — Rabin-Karp
#include <string>
#include <vector>

int rabinKarp(const std::string& t, const std::string& p) {
    int n = t.length(), m = p.length();
    if (m > n) return -1;
    if (m == 0) return 0;

    long long B = 256;
    long long M = 1000000007LL;
    long long ph = 0, th = 0;
    long long power = 1;

    // Compute B^(m-1) % M
    for (int i = 0; i < m - 1; i++) {
        power = (power * B) % M;
    }

    // Compute initial hashes
    for (int i = 0; i < m; i++) {
        ph = (ph * B + p[i]) % M;
        th = (th * B + t[i]) % M;
    }

    for (int i = 0; i <= n - m; i++) {
        if (ph == th) {
            // Double-check to handle collisions
            if (t.compare(i, m, p) == 0) {
                return i;
            }
        }
        if (i + m < n) {
            // Roll the window
            long long val = (th - (t[i] * power) % M) % M;
            if (val < 0) val += M;
            th = (val * B + t[i + m]) % M;
        }
    }
    return -1;
}

Average O(n + m); the character-comparison on hash hits guards against collisions (two different windows, same hash — rare with a big prime, but possible; skipping the verify step is the classic bug).

Why this one earns its place even if you never search text:

Many patterns at once — hash all patterns into a set; one scan of the text checks every window against all of them (the naive approach multiplies by the number of patterns). This is how plagiarism detectors and antivirus scanners work.
The rolling idea generalizes — it's a sliding window over any incremental computation: rolling checksums in rsync (sync only changed file blocks), chunking in backup systems and git's packfiles, duplicate-detection over data streams. Saying "rolling hash" in a system-design room is often more valuable than in a coding round.

KMP — never re-read the text

Knuth-Morris-Pratt achieves guaranteed O(n + m) with zero hashing, via one observation: when a match fails partway, the characters you already matched are known — so you know exactly how far the pattern could shift without skipping a possible match.

The precomputation: for each prefix of the pattern, the failure function f[i] = the length of the longest proper prefix of p[:i+1] that is also its suffix:

pattern:  a  b  a  b  c
f:        0  0  1  2  0

f[3] = 2 because "abab"’s longest prefix-that-is-also-suffix is "ab" (length 2)

Why that quantity: if you've matched "abab" and the next character fails, the text you just consumed ends with "abab" — and since its suffix "ab" equals the pattern's prefix "ab", the pattern can jump to state 2 (as if "ab" were already matched) without moving backward in the text. The text pointer only ever advances → O(n) scanning; building f is O(m) by the same logic applied to the pattern against itself.

Python

def kmp_find(t, p):
    # build failure table
    f, k = [0] * len(p), 0
    for i in range(1, len(p)):
        while k and p[i] != p[k]:
            k = f[k - 1]                  # fall back through shorter borders
        if p[i] == p[k]:
            k += 1
        f[i] = k

    # scan the text — the SAME loop shape
    k = 0
    for i, ch in enumerate(t):
        while k and ch != p[k]:
            k = f[k - 1]
        if ch == p[k]:
            k += 1
        if k == len(p):
            return i - k + 1              # full match ends here
    return -1

Java

// Java — KMP
public int kmpFind(String t, String p) {
    int m = p.length();
    if (m == 0) return 0;
    
    // Build failure table
    int[] f = new int[m];
    int k = 0;
    for (int i = 1; i < m; i++) {
        while (k > 0 && p.charAt(i) != p.charAt(k)) {
            k = f[k - 1];
        }
        if (p.charAt(i) == p.charAt(k)) {
            k++;
        }
        f[i] = k;
    }

    // Scan the text
    k = 0;
    for (int i = 0; i < t.length(); i++) {
        while (k > 0 && t.charAt(i) != p.charAt(k)) {
            k = f[k - 1];
        }
        if (t.charAt(i) == p.charAt(k)) {
            k++;
        }
        if (k == m) {
            return i - m + 1; // full match found
        }
    }
    return -1;
}

C++

// C++ — KMP
#include <string>
#include <vector>

int kmpFind(const std::string& t, const std::string& p) {
    int m = p.length();
    if (m == 0) return 0;

    // Build failure table
    std::vector<int> f(m, 0);
    int k = 0;
    for (int i = 1; i < m; i++) {
        while (k > 0 && p[i] != p[k]) {
            k = f[k - 1];
        }
        if (p[i] == p[k]) {
            k++;
        }
        f[i] = k;
    }

    // Scan the text
    k = 0;
    for (int i = 0; i < t.length(); i++) {
        while (k > 0 && t[i] != p[k]) {
            k = f[k - 1];
        }
        if (t[i] == p[k]) {
            k++;
        }
        if (k == m) {
            return i - m + 1; // full match found
        }
    }
    return -1;
}

The mental model that survives interviews: KMP is a tiny state machine — the state is "how many pattern characters currently match," each text character either advances the state or follows fallback arrows (the failure function), and the text is read exactly once, never backward. That sentence, plus the "abab" example, is the expected depth; reciting the table-building loop from memory is not.

Two names to recognize, not implement: the Z-algorithm (computes, for each position, the length of the longest substring starting there that matches the string's own prefix — an alternative O(n) toolkit with the same applications), and Aho-Corasick (KMP generalized to many patterns simultaneously by building the failure links over a trie — the industrial-strength multi-pattern matcher).

Watch KMP never look back

Same search, two phases: first the pattern studies itself (the "survives" row — how many letters of progress outlive a mismatch), then the scan where the text pointer only ever moves forward. Compare the final comparison count against the naive run above.

Substring search — KMPtime O(n + m)space O(m)

012345678910111213

text

pattern

survives

1/25KMP prep — study the pattern alone. For each position, write down: "if I fail here, how much of the pattern's own beginning is still alive in what I matched?" (Computers hate redoing work; this table is the cure.)

method

text (≤ 16)pattern (≤ 8)

Where the failure function sneaks into interviews

The table itself answers questions, no searching involved — these are the actual KMP interview appearances:

Shortest Palindrome (LC 214) — longest palindromic prefix via the failure function of s + "#" + reverse(s).
Repeated Substring Pattern (LC 459) — s is a repetition iff n % (n − f[n−1]) == 0 with f[n−1] > 0; the failure function exposes the period of the string.
Longest happy prefix (LC 1392) — literally "compute f[n−1]."

Common mistakes

Rabin-Karp without the verification compare — a hash collision returns a phantom match; with adversarial input this is exploitable (hash flooding's cousin). Verify on hit, or use double hashing.
Modulus/overflow sloppiness — in Java/C++ the rolling hash must consciously use long and careful mod arithmetic; Python's big ints hide the issue and then the same code fails in another language.
Hand-rolling in production — str.find, indexOf, std::search are optimized (often SIMD-accelerated) and correct. These algorithms are for understanding and for the cases libraries don't cover (streams, many patterns, your own equality notion).
Reaching for KMP when n is small — naive matching's O(n·m) with tiny constants beats clever algorithms below thousands of characters. State the crossover, then write the simple thing.
Confusing substring problems with subsequence problems — "substring" = contiguous → this page + sliding window; "subsequence" = gaps allowed → DP. Misfiling wastes ten minutes.

Think it through

The failure function answers questions with no searching at all. Reason through the cleanest one — detecting whether a string is just a smaller block repeated — and meet the famous s+s one-liner. Think before revealing.

Think it through: Repeated Substring PatternMedium — LeetCode 4590/5 stages

PROBLEMDoes the string consist of one substring repeated two or more times? 'abab' → true (ab × 2); 'abcabcabc' → true; 'aba' → false.

1
Restate & brute force
“What candidate period lengths are even possible, and how would I check one?”
2
The elegant trick: s + s
“Concatenate s with itself, then chop off the first and last character. What does searching for s in that tell you?”
unlocks after the stage above
3
The KMP connection
“How does the failure function expose the period directly?”
unlocks after the stage above
4
Code it
“Write the one-liner, and note the KMP alternative.”
unlocks after the stage above
5
Cost & judgment
“Cost of each approach, and which would you say in the room?”
unlocks after the stage above

Interview perspective

Check yourself

Check yourself0/3 answered

1. What does the Rabin-Karp algorithm use to achieve an average O(N + M) time complexity?

2. In KMP, what does the failure table value f[i] represent?

3. For matching 50,000 distinct patterns in a single pass of a large log stream, which algorithm is best suited?

Practice

Build Rabin-Karp from this page; test on a text where naive matching is visibly slow ("a"*10**6 searching "a"*10**3 + "b"), and time both.
Failure-function reps: compute f by hand for "aabaaab" and "abcabcab", then verify with the code. (Hand-computing two tables teaches more than reading ten explanations.)
The canon: Implement strStr (LC 28 — any method, then KMP); Repeated Substring Pattern (LC 459) via the period trick — and prove the trick to yourself on "abab" and "aba".
Generalize the roll: use a rolling hash to find the longest substring that appears at least twice (binary search the length + hash-set of window hashes — a beautiful binary-search-on-answer combo).

That completes the Level 3 pattern arc. On to applying everything: the problem bank and Blind 75.

Practice — climb the ladder

String problems split three ways: scanning (windows/counters), matching (KMP/rolling hash), and structure (palindromes). Know which one you are in.

Practice ladder: String Algorithms0/7 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up

scanning and normalizing

Find the Index of the First OccurrenceEasy
Substring search naively first — the baseline KMP will later destroy.
Repeated Substring PatternEasy
Periodicity reasoning — the (s+s) trick and why it works.

Core

palindromes + parsing

Longest Palindromic SubstringMedium
Expand-around-center — 2n−1 centers, odd and even handled cleanly.
String to Integer (atoi)Medium
State-machine parsing — whitespace, sign, digits, overflow clamps; edge-case discipline.
Zigzag ConversionMedium
Index simulation — direction flips as row bookkeeping.

Stretch

where KMP and 2D DP earn their keep

Shortest PalindromeHard
KMP failure function applied to s + # + reverse(s) — prefix machinery for real.
Distinct SubsequencesHard
Counting matches in a 2D table — LCS thinking, counting variant.