String Matching

String Matching covers algorithms for locating all occurrences of a pattern P (length m) within a text T (length n). The central challenge is avoiding redundant comparisons — every algorithm below exploits precomputed structure to skip work the naïve approach repeats.

Core Concepts

Pattern — the string P being searched for; length m. Text — the string T being searched in; length n. Occurrence / match — a position i in T where T[i..i+m-1] == P. Shift — how far right to advance P after a mismatch or match. Prefix function (failure function) — for KMP: π[i] = length of the longest proper prefix of P[0..i] that is also a suffix; drives shifts without backtracking. Z-array — Z[i] = length of the longest substring starting at position i in S that matches a prefix of S; computed on the concatenated string S = P + '$' + T; positions where Z[i] == m are match locations. Rolling hash — a hash of a fixed-length window that updates in O(1) by removing the outgoing character and adding the incoming one; used in Rabin-Karp. Failure link (Aho-Corasick) — for each trie node v, a pointer to the node representing the longest proper suffix of the string at v that is also a prefix of some pattern; enables multi-pattern matching in a single text pass. Bad character / good suffix (Boyer-Moore) — two independent shift heuristics computed from P; allow skipping large portions of T after a mismatch. Output link — in Aho-Corasick, a shortcut from a node to the nearest ancestor (via failure links) that is a terminal node; required to report all matches, not just the longest.

Types / Variants

Variant	What it is	When to prefer
Exact single-pattern	Find all positions where P appears verbatim in T	Default; KMP or Z-algorithm
Exact multi-pattern	Find all positions of any pattern from a set	Aho-Corasick; or Rabin-Karp with a hash set for short patterns
Approximate / fuzzy	Allow up to k mismatches or edit-distance k	DP over (text_pos × mismatch_count); bitap for small m
Wildcard pattern	'?' matches any single char, '*' matches any substring	DP (not KMP/Z); see dynamic-programming.md
Circular / rotation	P may appear as a cyclic rotation of T	Concatenate T + T, then search with KMP
Repeated-period finding	Find shortest period of a string	KMP failure function: period = m − π[m−1]
Anagram / permutation search	Find all windows of length m that are permutations of P	Sliding window + frequency array; not a string matching algorithm
2D pattern matching	Find a 2D pattern matrix inside a 2D text matrix	Rabin-Karp applied row-then-column

Key Algorithms

Naïve / Brute Force

Align P at every position i in T and compare character by character. O(nm) time, O(1) space. Acceptable for m ≤ 100 or n ≤ 10³; every algorithm below improves on this by exploiting previously seen information.

KMP (Knuth-Morris-Pratt)

Finds all occurrences of P in T in O(n + m) time, O(m) space.

Key insight: after a mismatch at pattern position j, the characters T[i−j..i−1] are already known to match P[0..j−1]. The prefix function encodes the longest proper prefix of P[0..j−1] that is also a suffix — that prefix is what we can resume matching from without moving the text pointer backwards.

Failure function construction:

j = 0
for i in range(1, m):
    while j > 0 and P[i] != P[j]: j = pi[j-1]
    if P[i] == P[j]: j += 1
    pi[i] = j

The while j > 0 fallback chain is the non-obvious part; it follows the failure links until a matching character is found or j reaches 0.

Search: maintain j = number of characters matched so far in P. On mismatch, fall back via π[j−1]. When j == m, record match at i − m + 1 and reset j = π[j−1] (not 0 — overlapping matches are valid).

Period finding: shortest period of P = m − π[m−1]; this is a valid repeating unit iff m % (m − π[m−1]) == 0.

Z-Algorithm

Computes the Z-array for S in O(|S|) time. For string matching, build S = P + '$' + T; every i > m where Z[i] == m is a match at text position i − m − 1. Same asymptotic complexity as KMP; the Z-array formulation is often more natural for prefix-matching problems.

Key insight: maintain the rightmost Z-box [l, r] (a substring S[l..r] that matches S[0..r−l]). For position i ≤ r, Z[i] ≥ min(r − i, Z[i − l]); extend naively beyond r and update [l, r] when a longer box is found.

if i < r:
    Z[i] = min(r - i, Z[i - l])
# then extend: while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]: Z[i] += 1

The sentinel '$' prevents Z-values at positions in the text portion from exceeding m by crossing the boundary — it must not appear in P or T.

Rabin-Karp

Uses polynomial rolling hash. Computes hash(P) once, then slides a length-m window across T, updating the hash in O(1) per step. On a hash match, verify the actual substring. O(n + m) average, O(nm) worst case (all collisions). O(1) extra space.

Key insight: hash(T[i+1..i+m]) = (hash(T[i..i+m−1]) − T[i] · base^(m−1)) · base + T[i+m], all mod a prime.

Multi-pattern use: store hashes of all patterns in a set; a window match against the set in O(1) triggers verification. Effective when k patterns share length m.

Choose base > alphabet size (31 for lowercase, 131 for ASCII); use a large prime modulus (10⁹ + 7). Double hashing (two independent (base, mod) pairs) reduces collision probability to ~1/MOD².

Boyer-Moore

Two heuristics applied simultaneously:

Bad character: on mismatch at T[i+j] vs P[j], shift P right so the rightmost occurrence of T[i+j] in P[0..j−1] aligns with i+j (or shift past i+j entirely if T[i+j] is absent from P).
Good suffix: after matching P[j+1..m−1], shift P so a copy of that matched suffix (or the longest prefix of P that is a suffix of it) realigns.

O(nm) worst case (e.g., P = "aaa…a" in T = "aaa…a"), O(n/m) best case on large alphabets. Preprocessing O(m + σ). In practice the fastest algorithm for large alphabets and natural-language text; rarely used in competitive programming due to implementation complexity.

Aho-Corasick

Builds a trie of all k patterns, then augments it with failure links (exactly KMP's prefix function, but over the trie structure). Processes T once in O(n + total_matches) after O(Σ|patterns| · σ) preprocessing. Reports all occurrences of all patterns simultaneously.

Failure link construction (BFS):

Root's failure link points to root.
For node v reached from parent p via character c: fail[v] = trie.go(fail[p], c), where go(node, c) follows failure links from node until a c-transition exists or reaches root.
Output links: if fail[v] is a terminal node, output_link[v] = fail[v]; otherwise output_link[v] = output_link[fail[v]].

Text traversal: maintain current trie node; for each character in T, follow the trie (or failure links if no transition); at each node, traverse output links to emit all matches.

Advanced Techniques

Aho-Corasick + DP

Solves: count (or enumerate) strings of length n built from an alphabet that must contain, or must avoid, every word from a given dictionary. Mechanism: DP state = (characters placed so far, current Aho-Corasick node). Transition: append one character, advance through trie + failure links, mark states that pass through terminal nodes as forbidden or required. Prefer over running k separate KMP+DP passes when there are 2+ patterns and the patterns can co-occur; the AC automaton fuses them into a single state machine. Complexity: O(n · |trie_states| · σ) time, O(|trie_states|) space.

Binary Search + Rolling Hash for Repeated Substrings

Solves: longest substring appearing at least twice (or k times), longest common substring of two strings, longest substring with some frequency property. Mechanism: binary search on the answer length L; for a given L, slide a length-L hash window across the string and check for hash collision in a set. O(n log n) total with O(1) per window. Prefer over suffix array when the problem only needs existence, not all occurrences, and when implementing suffix arrays is not warranted. Complexity: O(n log n) with high-probability correctness; O(n log n · m) if verification is required per candidate.

Z-Function Structural Analysis

Solves: finding all positions where a prefix of P reappears within P itself, detecting the minimal period, splitting a string into its repeating unit, or building the failure function from the Z-array. Mechanism: compute Z on P alone (not concatenated with a text). Z[i] == m − i means P[i:] is a suffix that also equals a prefix → every such i is a "border" start. Period = smallest i > 0 where Z[i] + i == m (or fallback to m). Prefer over KMP's π when the problem asks for prefix-match lengths at every position simultaneously, or when the Z-array's semantics (absolute match length vs. relative suffix length) are cleaner for the formulation. Complexity: O(m).

Problem Taxonomy

By Problem Category

Category	What it asks	Key technique
Single occurrence check	Does P appear in T?	`in` / `str.find` for simple cases; KMP for custom implementations
All occurrences	Return every start index	KMP or Z-algorithm
Multi-pattern search	Which of k patterns appear in T, and where?	Aho-Corasick
Repeated / duplicate substring	Longest or most-frequent substring	Binary search + rolling hash; suffix array + LCP
Minimal period / rotation	Shortest repeating unit; is T a rotation of P?	KMP failure function; search in T+T
String construction avoiding words	Count length-n strings not containing any word	Aho-Corasick + DP
Approximate matching	Minimum edits to embed P in T	DP (edit distance); see dynamic-programming.md
2D matching	Find a 2D pattern in a 2D grid	Rabin-Karp row-then-column

Problem Signal → Technique

Signal in problem	Likely technique
"find all occurrences of pattern in text"	KMP or Z-algorithm
"shortest period" / "repeating unit" / "smallest period"	KMP failure function: period = m − π[m−1]
"is S a rotation of T?"	Check if S appears in T + T
"k patterns, find any/all in text"	Aho-Corasick
"how many strings of length n avoid these words"	Aho-Corasick + DP
"find duplicate / longest repeated substring"	Rabin-Karp binary search or suffix array
"match with wildcards '?' or '*'"	DP, not string matching algorithms
"find all anagrams / permutations of P in T"	Sliding window + frequency array
"longest common substring of two strings"	Suffix array + LCP or DP
"rolling hash" mentioned or implied by O(1) window comparison	Rabin-Karp
large alphabet, pattern much shorter than text, offline	Boyer-Moore
"minimum number of times you can search a text with multiple queries"	Aho-Corasick (build once, query once)

Common Patterns

Pattern	When it applies	Mechanism
Build on self (Z / π)	Need prefix-suffix overlap info for P alone	Run Z or KMP on P itself — no text concatenation
Sentinel concatenation	Search P in T without cross-boundary Z-values	S = P + '$' + T; use a char absent from both strings
Prefix hash array	O(1) substring hash after O(n) preprocessing	h[i] = h[i−1] · base + ord(T[i]); query via subtraction
Double hashing	Reduce collision probability to ~1/MOD²	Maintain two independent (base, mod) pairs; both must collide for a false positive
Binary search + hash set	Find longest substring satisfying a frequency/count condition	Binary search on length L; O(n) hash-window pass per candidate L
Failure-link BFS (Aho-Corasick)	Multiple patterns, one text pass	Build trie; BFS assigns fail[v] using parent's fail; propagate output links in same BFS
Period from π	Structural property derivable from failure function	period = m − π[m−1]; valid iff m % period == 0
KMP on reversed / concatenated string	Palindrome or rotation problems	Shortest palindrome: KMP on P + '$' + reverse(P) to find longest palindromic prefix

Edge Cases & Pitfalls

Empty pattern or empty text: KMP and Z return immediately; rolling hash has undefined window. Guard at entry — return 0 (empty pattern matches everywhere) or [] (empty text has no matches) per problem definition.
Pattern longer than text (m > n): no match possible; return early to avoid negative-length window arithmetic in rolling hash.
Overlapping matches: KMP handles this correctly only if you reset j = π[j−1] (not j = 0) after recording a match. Resetting to 0 misses the second "aa" in "aaa" when searching for "aa".
Sentinel must be absent from P and T: if '$' appears in the input (e.g., URL strings), choose a different sentinel or use integer concatenation (append -1 between integer-encoded arrays).
Rolling hash subtraction underflow: in languages with fixed-size integers, (a - b) % MOD can be negative. Always write (a - b % MOD + MOD) % MOD. Python's % is always non-negative, so this is a Java/C++ concern only.
Rabin-Karp single modulus collisions: with MOD = 10⁹+7 and n = 10⁵, expected ~10⁻⁴ false-positive rate per query — usually acceptable, but always verify on collision if correctness is critical.
KMP π off-by-one: π[0] = 0 always; the loop starts at i = 1 with j = 0. A common mistake is entering the loop at i = 0, which sets π[0] incorrectly.
Boyer-Moore bad-character initialisation: characters not in P should produce a shift that moves the alignment past the current position. Initialise the table to -1 (last occurrence = "before the pattern"), so shift = j - (-1) = j + 1.
Aho-Corasick missing output links: a node may match a short pattern reachable only via failure links, not directly. Skipping output-link traversal causes missed matches for patterns that are suffixes of longer matches.
Aho-Corasick on varying-length patterns: after a full match at a terminal node, still follow output links — a shorter pattern may also end at this text position.
Period validity check: m − π[m−1] is always a candidate period length, but it is the true period only when m % (m − π[m−1]) == 0. Otherwise the string has no period shorter than m.

Implementation Notes

Python's str.find() and in use an optimised internal search (a Sunday/Boyer-Moore-Horspool variant) — for simple existence checks on Python strings, don't reimplement KMP.
Python's re.search() with a fixed string uses full NFA simulation; str.find is faster for literal patterns.
KMP exists in both 0-indexed and 1-indexed forms in textbooks; the search phase index arithmetic differs between them — pick one and be consistent throughout.
collections.deque is required for Aho-Corasick BFS; list with pop(0) is O(n) per call and will TLE on large tries.
Polynomial hash base must exceed the alphabet size: ≥ 27 for lowercase letters, ≥ 128 for ASCII, ≥ 131 is a common safe choice.
KMP failure function is iterative by nature; do not implement it recursively — Python's default recursion limit (1000) is too small for m > 1000.
For Z-algorithm: after computing Z on S = P + '$' + T, match positions in T are those indices i in [m+1, |S|) where Z[i] == m; the corresponding text index is i − m − 1.
Aho-Corasick trie nodes are typically represented as dicts (sparse alphabet) or arrays of size σ (dense alphabet); array is faster but uses 26× more memory per node.

Cross-Topic Relations

Related topic	Connection
strings.md	String matching is a sub-domain of string problems; character manipulation, palindromes, and anagrams are covered there
suffix-array.md	Suffix arrays + LCP arrays solve repeated-substring and longest-common-substring problems that go beyond what KMP/Z handle; prerequisite for competitive-level string matching
trie.md	Aho-Corasick is a trie augmented with failure links; trie insert/search operations are a prerequisite
dynamic-programming.md	Approximate matching (edit distance, wildcard `*`/`?`) uses DP; Aho-Corasick + DP combines both paradigms
sliding-window.md	Anagram and permutation search — frequently confused with string matching — use sliding window + frequency array, not KMP
hash-table.md	Rabin-Karp is hashing applied to strings; collision analysis requires understanding modular arithmetic and hash set operations

Interview Reference

Must-Know Problems

KMP / Z-Algorithm

Find the Index of the First Occurrence in a String (LC 28) — baseline KMP implementation
Repeated Substring Pattern (LC 459) — period via failure function
Shortest Palindrome (LC 214) — KMP on P + '$' + reverse(P) to find longest palindromic prefix
Longest Happy Prefix (LC 1392) — directly constructs the failure function as the answer

Rabin-Karp / Rolling Hash

Longest Duplicate Substring (LC 1044) — binary search + rolling hash; high-frequency hard problem
Repeated DNA Sequences (LC 187) — hash-set on length-10 windows
Rabin-Karp String Matching (LC 28 variant) — implement Rabin-Karp explicitly

Multi-Pattern / Aho-Corasick

Word Search II (LC 212) — trie on board; failure-link mindset applies even without full AC
Multi String Search — find which of k patterns appear in T; canonical Aho-Corasick problem

Period / Rotation

Rotate String (LC 796) — check if S appears in S + S (one line with KMP)
String Compression (LC 443) — uses period/repetition structure

Additional LeetCode Coverage

Repeated String Match (LC 686)

Clarification Questions to Ask

Is the pattern fixed, or can it contain wildcards (?, *)? Wildcards require DP, not KMP.
Is the alphabet bounded (e.g., lowercase only) or arbitrary? Affects hash base and Boyer-Moore table size.
Do overlapping occurrences count as separate matches?
Is the text streamed character by character, or available all at once? Streaming favors KMP; offline allows suffix arrays.
Are there multiple patterns, or just one? One → KMP/Z; many → Aho-Corasick.
What defines a "match"? Exact / case-insensitive / with k mismatches?
Should the answer count positions or return the actual substrings?

Common Interview Mistakes

Using O(nm) naïve search for large inputs (n, m up to 10⁵) without recognising KMP is needed.
Implementing KMP but resetting j = 0 instead of j = π[j−1] after recording a match, making the algorithm miss overlapping occurrences.
Forgetting the sentinel when concatenating P and T for Z-algorithm, causing Z-values to cross the boundary and produce spurious match lengths.
Choosing a weak hash modulus (e.g., 10⁶) without double hashing and not verifying on collision, leading to wrong answers.
Reaching for KMP to solve "find all anagrams in a string" (LC 438) — the problem is order-insensitive and requires sliding window, not character-order matching.
Conflating "period length" (m − π[m−1]) with a valid period — omitting the divisibility check.

Typical Follow-Up Escalations

"Now allow k mismatches" → DP over (text_pos × mismatch_count), or bitap for small m.
"Now search for any of 1000 patterns simultaneously" → Aho-Corasick.
"Now find the longest repeated substring" → suffix array + LCP (interview-accessible: binary search + rolling hash).
"What if the text arrives as a stream, one character at a time?" → KMP handles this naturally; Z-algorithm requires the full string.
"Can you make the search phase use O(1) extra space?" → Boyer-Moore's search uses O(1) extra per character (the preprocessing tables are fixed); KMP uses O(m) for the π array.
"Now count distinct substrings" → suffix array + LCP: answer = n(n+1)/2 − Σ LCP[i].

Explore Unread

Great job! You've read all available articles

String Matching

Core Concepts

Types / Variants

Variant	What it is	When to prefer
Exact single-pattern	Find all positions where P appears verbatim in T	Default; KMP or Z-algorithm
Exact multi-pattern	Find all positions of any pattern from a set	Aho-Corasick; or Rabin-Karp with a hash set for short patterns
Approximate / fuzzy	Allow up to k mismatches or edit-distance k	DP over (text_pos × mismatch_count); bitap for small m
Wildcard pattern	'?' matches any single char, '*' matches any substring	DP (not KMP/Z); see dynamic-programming.md
Circular / rotation	P may appear as a cyclic rotation of T	Concatenate T + T, then search with KMP
Repeated-period finding	Find shortest period of a string	KMP failure function: period = m − π[m−1]
Anagram / permutation search	Find all windows of length m that are permutations of P	Sliding window + frequency array; not a string matching algorithm
2D pattern matching	Find a 2D pattern matrix inside a 2D text matrix	Rabin-Karp applied row-then-column

Key Algorithms

Naïve / Brute Force

KMP (Knuth-Morris-Pratt)

Finds all occurrences of P in T in O(n + m) time, O(m) space.

Failure function construction:

j = 0
for i in range(1, m):
    while j > 0 and P[i] != P[j]: j = pi[j-1]
    if P[i] == P[j]: j += 1
    pi[i] = j

The while j > 0 fallback chain is the non-obvious part; it follows the failure links until a matching character is found or j reaches 0.

Period finding: shortest period of P = m − π[m−1]; this is a valid repeating unit iff m % (m − π[m−1]) == 0.

Z-Algorithm

if i < r:
    Z[i] = min(r - i, Z[i - l])
# then extend: while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]: Z[i] += 1

The sentinel '$' prevents Z-values at positions in the text portion from exceeding m by crossing the boundary — it must not appear in P or T.

Rabin-Karp

Key insight: hash(T[i+1..i+m]) = (hash(T[i..i+m−1]) − T[i] · base^(m−1)) · base + T[i+m], all mod a prime.

Multi-pattern use: store hashes of all patterns in a set; a window match against the set in O(1) triggers verification. Effective when k patterns share length m.

Choose base > alphabet size (31 for lowercase, 131 for ASCII); use a large prime modulus (10⁹ + 7). Double hashing (two independent (base, mod) pairs) reduces collision probability to ~1/MOD².

Boyer-Moore

Two heuristics applied simultaneously:

Bad character: on mismatch at T[i+j] vs P[j], shift P right so the rightmost occurrence of T[i+j] in P[0..j−1] aligns with i+j (or shift past i+j entirely if T[i+j] is absent from P).
Good suffix: after matching P[j+1..m−1], shift P so a copy of that matched suffix (or the longest prefix of P that is a suffix of it) realigns.

Aho-Corasick

Failure link construction (BFS):

Root's failure link points to root.
For node v reached from parent p via character c: fail[v] = trie.go(fail[p], c), where go(node, c) follows failure links from node until a c-transition exists or reaches root.
Output links: if fail[v] is a terminal node, output_link[v] = fail[v]; otherwise output_link[v] = output_link[fail[v]].

Text traversal: maintain current trie node; for each character in T, follow the trie (or failure links if no transition); at each node, traverse output links to emit all matches.

Advanced Techniques

Aho-Corasick + DP

Binary Search + Rolling Hash for Repeated Substrings

Z-Function Structural Analysis

Problem Taxonomy

By Problem Category

Category	What it asks	Key technique
Single occurrence check	Does P appear in T?	`in` / `str.find` for simple cases; KMP for custom implementations
All occurrences	Return every start index	KMP or Z-algorithm
Multi-pattern search	Which of k patterns appear in T, and where?	Aho-Corasick
Repeated / duplicate substring	Longest or most-frequent substring	Binary search + rolling hash; suffix array + LCP
Minimal period / rotation	Shortest repeating unit; is T a rotation of P?	KMP failure function; search in T+T
String construction avoiding words	Count length-n strings not containing any word	Aho-Corasick + DP
Approximate matching	Minimum edits to embed P in T	DP (edit distance); see dynamic-programming.md
2D matching	Find a 2D pattern in a 2D grid	Rabin-Karp row-then-column

Problem Signal → Technique

Signal in problem	Likely technique
"find all occurrences of pattern in text"	KMP or Z-algorithm
"shortest period" / "repeating unit" / "smallest period"	KMP failure function: period = m − π[m−1]
"is S a rotation of T?"	Check if S appears in T + T
"k patterns, find any/all in text"	Aho-Corasick
"how many strings of length n avoid these words"	Aho-Corasick + DP
"find duplicate / longest repeated substring"	Rabin-Karp binary search or suffix array
"match with wildcards '?' or '*'"	DP, not string matching algorithms
"find all anagrams / permutations of P in T"	Sliding window + frequency array
"longest common substring of two strings"	Suffix array + LCP or DP
"rolling hash" mentioned or implied by O(1) window comparison	Rabin-Karp
large alphabet, pattern much shorter than text, offline	Boyer-Moore
"minimum number of times you can search a text with multiple queries"	Aho-Corasick (build once, query once)

Common Patterns

Pattern	When it applies	Mechanism
Build on self (Z / π)	Need prefix-suffix overlap info for P alone	Run Z or KMP on P itself — no text concatenation
Sentinel concatenation	Search P in T without cross-boundary Z-values	S = P + '$' + T; use a char absent from both strings
Prefix hash array	O(1) substring hash after O(n) preprocessing	h[i] = h[i−1] · base + ord(T[i]); query via subtraction
Double hashing	Reduce collision probability to ~1/MOD²	Maintain two independent (base, mod) pairs; both must collide for a false positive
Binary search + hash set	Find longest substring satisfying a frequency/count condition	Binary search on length L; O(n) hash-window pass per candidate L
Failure-link BFS (Aho-Corasick)	Multiple patterns, one text pass	Build trie; BFS assigns fail[v] using parent's fail; propagate output links in same BFS
Period from π	Structural property derivable from failure function	period = m − π[m−1]; valid iff m % period == 0
KMP on reversed / concatenated string	Palindrome or rotation problems	Shortest palindrome: KMP on P + '$' + reverse(P) to find longest palindromic prefix

Edge Cases & Pitfalls

Empty pattern or empty text: KMP and Z return immediately; rolling hash has undefined window. Guard at entry — return 0 (empty pattern matches everywhere) or [] (empty text has no matches) per problem definition.
Pattern longer than text (m > n): no match possible; return early to avoid negative-length window arithmetic in rolling hash.
Overlapping matches: KMP handles this correctly only if you reset j = π[j−1] (not j = 0) after recording a match. Resetting to 0 misses the second "aa" in "aaa" when searching for "aa".
Sentinel must be absent from P and T: if '$' appears in the input (e.g., URL strings), choose a different sentinel or use integer concatenation (append -1 between integer-encoded arrays).
Rolling hash subtraction underflow: in languages with fixed-size integers, (a - b) % MOD can be negative. Always write (a - b % MOD + MOD) % MOD. Python's % is always non-negative, so this is a Java/C++ concern only.
Rabin-Karp single modulus collisions: with MOD = 10⁹+7 and n = 10⁵, expected ~10⁻⁴ false-positive rate per query — usually acceptable, but always verify on collision if correctness is critical.
KMP π off-by-one: π[0] = 0 always; the loop starts at i = 1 with j = 0. A common mistake is entering the loop at i = 0, which sets π[0] incorrectly.
Boyer-Moore bad-character initialisation: characters not in P should produce a shift that moves the alignment past the current position. Initialise the table to -1 (last occurrence = "before the pattern"), so shift = j - (-1) = j + 1.
Aho-Corasick missing output links: a node may match a short pattern reachable only via failure links, not directly. Skipping output-link traversal causes missed matches for patterns that are suffixes of longer matches.
Aho-Corasick on varying-length patterns: after a full match at a terminal node, still follow output links — a shorter pattern may also end at this text position.
Period validity check: m − π[m−1] is always a candidate period length, but it is the true period only when m % (m − π[m−1]) == 0. Otherwise the string has no period shorter than m.

Implementation Notes

Python's str.find() and in use an optimised internal search (a Sunday/Boyer-Moore-Horspool variant) — for simple existence checks on Python strings, don't reimplement KMP.
Python's re.search() with a fixed string uses full NFA simulation; str.find is faster for literal patterns.
KMP exists in both 0-indexed and 1-indexed forms in textbooks; the search phase index arithmetic differs between them — pick one and be consistent throughout.
collections.deque is required for Aho-Corasick BFS; list with pop(0) is O(n) per call and will TLE on large tries.
Polynomial hash base must exceed the alphabet size: ≥ 27 for lowercase letters, ≥ 128 for ASCII, ≥ 131 is a common safe choice.
KMP failure function is iterative by nature; do not implement it recursively — Python's default recursion limit (1000) is too small for m > 1000.
For Z-algorithm: after computing Z on S = P + '$' + T, match positions in T are those indices i in [m+1, |S|) where Z[i] == m; the corresponding text index is i − m − 1.
Aho-Corasick trie nodes are typically represented as dicts (sparse alphabet) or arrays of size σ (dense alphabet); array is faster but uses 26× more memory per node.

Cross-Topic Relations

Related topic	Connection
strings.md	String matching is a sub-domain of string problems; character manipulation, palindromes, and anagrams are covered there
suffix-array.md	Suffix arrays + LCP arrays solve repeated-substring and longest-common-substring problems that go beyond what KMP/Z handle; prerequisite for competitive-level string matching
trie.md	Aho-Corasick is a trie augmented with failure links; trie insert/search operations are a prerequisite
dynamic-programming.md	Approximate matching (edit distance, wildcard `*`/`?`) uses DP; Aho-Corasick + DP combines both paradigms
sliding-window.md	Anagram and permutation search — frequently confused with string matching — use sliding window + frequency array, not KMP
hash-table.md	Rabin-Karp is hashing applied to strings; collision analysis requires understanding modular arithmetic and hash set operations

Interview Reference

Must-Know Problems

KMP / Z-Algorithm

Find the Index of the First Occurrence in a String (LC 28) — baseline KMP implementation
Repeated Substring Pattern (LC 459) — period via failure function
Shortest Palindrome (LC 214) — KMP on P + '$' + reverse(P) to find longest palindromic prefix
Longest Happy Prefix (LC 1392) — directly constructs the failure function as the answer

Rabin-Karp / Rolling Hash

Longest Duplicate Substring (LC 1044) — binary search + rolling hash; high-frequency hard problem
Repeated DNA Sequences (LC 187) — hash-set on length-10 windows
Rabin-Karp String Matching (LC 28 variant) — implement Rabin-Karp explicitly

Multi-Pattern / Aho-Corasick

Word Search II (LC 212) — trie on board; failure-link mindset applies even without full AC
Multi String Search — find which of k patterns appear in T; canonical Aho-Corasick problem

Period / Rotation

Rotate String (LC 796) — check if S appears in S + S (one line with KMP)
String Compression (LC 443) — uses period/repetition structure

Additional LeetCode Coverage

Repeated String Match (LC 686)

Clarification Questions to Ask

Is the pattern fixed, or can it contain wildcards (?, *)? Wildcards require DP, not KMP.
Is the alphabet bounded (e.g., lowercase only) or arbitrary? Affects hash base and Boyer-Moore table size.
Do overlapping occurrences count as separate matches?
Is the text streamed character by character, or available all at once? Streaming favors KMP; offline allows suffix arrays.
Are there multiple patterns, or just one? One → KMP/Z; many → Aho-Corasick.
What defines a "match"? Exact / case-insensitive / with k mismatches?
Should the answer count positions or return the actual substrings?

Common Interview Mistakes

Using O(nm) naïve search for large inputs (n, m up to 10⁵) without recognising KMP is needed.
Implementing KMP but resetting j = 0 instead of j = π[j−1] after recording a match, making the algorithm miss overlapping occurrences.
Forgetting the sentinel when concatenating P and T for Z-algorithm, causing Z-values to cross the boundary and produce spurious match lengths.
Choosing a weak hash modulus (e.g., 10⁶) without double hashing and not verifying on collision, leading to wrong answers.
Reaching for KMP to solve "find all anagrams in a string" (LC 438) — the problem is order-insensitive and requires sliding window, not character-order matching.
Conflating "period length" (m − π[m−1]) with a valid period — omitting the divisibility check.

Typical Follow-Up Escalations

"Now allow k mismatches" → DP over (text_pos × mismatch_count), or bitap for small m.
"Now search for any of 1000 patterns simultaneously" → Aho-Corasick.
"Now find the longest repeated substring" → suffix array + LCP (interview-accessible: binary search + rolling hash).
"What if the text arrives as a stream, one character at a time?" → KMP handles this naturally; Z-algorithm requires the full string.
"Can you make the search phase use O(1) extra space?" → Boyer-Moore's search uses O(1) extra per character (the preprocessing tables are fixed); KMP uses O(m) for the π array.
"Now count distinct substrings" → suffix array + LCP: answer = n(n+1)/2 − Σ LCP[i].

Explore Unread

Great job! You've read all available articles

String Matching

Core Concepts

Types / Variants

Key Algorithms

Naïve / Brute Force

KMP (Knuth-Morris-Pratt)

Z-Algorithm

Rabin-Karp

Boyer-Moore

Aho-Corasick

Advanced Techniques

Aho-Corasick + DP

Binary Search + Rolling Hash for Repeated Substrings

Z-Function Structural Analysis

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread

String Matching

Core Concepts

Types / Variants

Key Algorithms

Naïve / Brute Force

KMP (Knuth-Morris-Pratt)

Z-Algorithm

Rabin-Karp

Boyer-Moore

Aho-Corasick

Advanced Techniques

Aho-Corasick + DP

Binary Search + Rolling Hash for Repeated Substrings

Z-Function Structural Analysis

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread