String Matching
String Matching covers algorithms for locating all occurrences of a pattern P (length m) within a text T (length n). The central challenge is avoiding redundant comparisons — every algorithm below exploits precomputed structure to skip work the naïve approach repeats.
Core Concepts
Pattern — the string P being searched for; length m. Text — the string T being searched in; length n. Occurrence / match — a position i in T where T[i..i+m-1] == P. Shift — how far right to advance P after a mismatch or match. Prefix function (failure function) — for KMP: π[i] = length of the longest proper prefix of P[0..i] that is also a suffix; drives shifts without backtracking. Z-array — Z[i] = length of the longest substring starting at position i in S that matches a prefix of S; computed on the concatenated string S = P + '$' + T; positions where Z[i] == m are match locations. Rolling hash — a hash of a fixed-length window that updates in O(1) by removing the outgoing character and adding the incoming one; used in Rabin-Karp. Failure link (Aho-Corasick) — for each trie node v, a pointer to the node representing the longest proper suffix of the string at v that is also a prefix of some pattern; enables multi-pattern matching in a single text pass. Bad character / good suffix (Boyer-Moore) — two independent shift heuristics computed from P; allow skipping large portions of T after a mismatch. Output link — in Aho-Corasick, a shortcut from a node to the nearest ancestor (via failure links) that is a terminal node; required to report all matches, not just the longest.
Types / Variants
| Variant | What it is | When to prefer |
|---|---|---|
| Exact single-pattern | Find all positions where P appears verbatim in T | Default; KMP or Z-algorithm |
| Exact multi-pattern | Find all positions of any pattern from a set | Aho-Corasick; or Rabin-Karp with a hash set for short patterns |
| Approximate / fuzzy | Allow up to k mismatches or edit-distance k | DP over (text_pos × mismatch_count); bitap for small m |
| Wildcard pattern | '?' matches any single char, '*' matches any substring | DP (not KMP/Z); see dynamic-programming.md |
| Circular / rotation | P may appear as a cyclic rotation of T | Concatenate T + T, then search with KMP |
| Repeated-period finding | Find shortest period of a string | KMP failure function: period = m − π[m−1] |
| Anagram / permutation search | Find all windows of length m that are permutations of P | Sliding window + frequency array; not a string matching algorithm |
| 2D pattern matching | Find a 2D pattern matrix inside a 2D text matrix | Rabin-Karp applied row-then-column |
Key Algorithms
Naïve / Brute Force
Align P at every position i in T and compare character by character. O(nm) time, O(1) space. Acceptable for m ≤ 100 or n ≤ 10³; every algorithm below improves on this by exploiting previously seen information.
KMP (Knuth-Morris-Pratt)
Finds all occurrences of P in T in O(n + m) time, O(m) space.
Key insight: after a mismatch at pattern position j, the characters T[i−j..i−1] are already known to match P[0..j−1]. The prefix function encodes the longest proper prefix of P[0..j−1] that is also a suffix — that prefix is what we can resume matching from without moving the text pointer backwards.
Failure function construction:
j = 0
for i in range(1, m):
while j > 0 and P[i] != P[j]: j = pi[j-1]
if P[i] == P[j]: j += 1
pi[i] = j
The while j > 0 fallback chain is the non-obvious part; it follows the failure links until a matching character is found or j reaches 0.
Search: maintain j = number of characters matched so far in P. On mismatch, fall back via π[j−1]. When j == m, record match at i − m + 1 and reset j = π[j−1] (not 0 — overlapping matches are valid).
Period finding: shortest period of P = m − π[m−1]; this is a valid repeating unit iff m % (m − π[m−1]) == 0.
Z-Algorithm
Computes the Z-array for S in O(|S|) time. For string matching, build S = P + '$' + T; every i > m where Z[i] == m is a match at text position i − m − 1. Same asymptotic complexity as KMP; the Z-array formulation is often more natural for prefix-matching problems.
Key insight: maintain the rightmost Z-box [l, r] (a substring S[l..r] that matches S[0..r−l]). For position i ≤ r, Z[i] ≥ min(r − i, Z[i − l]); extend naively beyond r and update [l, r] when a longer box is found.
if i < r:
Z[i] = min(r - i, Z[i - l])
# then extend: while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]: Z[i] += 1
The sentinel '$' prevents Z-values at positions in the text portion from exceeding m by crossing the boundary — it must not appear in P or T.
Rabin-Karp
Uses polynomial rolling hash. Computes hash(P) once, then slides a length-m window across T, updating the hash in O(1) per step. On a hash match, verify the actual substring. O(n + m) average, O(nm) worst case (all collisions). O(1) extra space.
Key insight: hash(T[i+1..i+m]) = (hash(T[i..i+m−1]) − T[i] · base^(m−1)) · base + T[i+m], all mod a prime.
Multi-pattern use: store hashes of all patterns in a set; a window match against the set in O(1) triggers verification. Effective when k patterns share length m.
Choose base > alphabet size (31 for lowercase, 131 for ASCII); use a large prime modulus (10⁹ + 7). Double hashing (two independent (base, mod) pairs) reduces collision probability to ~1/MOD².
Boyer-Moore
Two heuristics applied simultaneously:
- Bad character: on mismatch at T[i+j] vs P[j], shift P right so the rightmost occurrence of T[i+j] in P[0..j−1] aligns with i+j (or shift past i+j entirely if T[i+j] is absent from P).
- Good suffix: after matching P[j+1..m−1], shift P so a copy of that matched suffix (or the longest prefix of P that is a suffix of it) realigns.
O(nm) worst case (e.g., P = "aaa…a" in T = "aaa…a"), O(n/m) best case on large alphabets. Preprocessing O(m + σ). In practice the fastest algorithm for large alphabets and natural-language text; rarely used in competitive programming due to implementation complexity.
Aho-Corasick
Builds a trie of all k patterns, then augments it with failure links (exactly KMP's prefix function, but over the trie structure). Processes T once in O(n + total_matches) after O(Σ|patterns| · σ) preprocessing. Reports all occurrences of all patterns simultaneously.
Failure link construction (BFS):
- Root's failure link points to root.
- For node v reached from parent p via character c: fail[v] = trie.go(fail[p], c), where
go(node, c)follows failure links fromnodeuntil a c-transition exists or reaches root. - Output links: if fail[v] is a terminal node, output_link[v] = fail[v]; otherwise output_link[v] = output_link[fail[v]].
Text traversal: maintain current trie node; for each character in T, follow the trie (or failure links if no transition); at each node, traverse output links to emit all matches.
Advanced Techniques
Aho-Corasick + DP
Solves: count (or enumerate) strings of length n built from an alphabet that must contain, or must avoid, every word from a given dictionary. Mechanism: DP state = (characters placed so far, current Aho-Corasick node). Transition: append one character, advance through trie + failure links, mark states that pass through terminal nodes as forbidden or required. Prefer over running k separate KMP+DP passes when there are 2+ patterns and the patterns can co-occur; the AC automaton fuses them into a single state machine. Complexity: O(n · |trie_states| · σ) time, O(|trie_states|) space.
Binary Search + Rolling Hash for Repeated Substrings
Solves: longest substring appearing at least twice (or k times), longest common substring of two strings, longest substring with some frequency property. Mechanism: binary search on the answer length L; for a given L, slide a length-L hash window across the string and check for hash collision in a set. O(n log n) total with O(1) per window. Prefer over suffix array when the problem only needs existence, not all occurrences, and when implementing suffix arrays is not warranted. Complexity: O(n log n) with high-probability correctness; O(n log n · m) if verification is required per candidate.
Z-Function Structural Analysis
Solves: finding all positions where a prefix of P reappears within P itself, detecting the minimal period, splitting a string into its repeating unit, or building the failure function from the Z-array. Mechanism: compute Z on P alone (not concatenated with a text). Z[i] == m − i means P[i:] is a suffix that also equals a prefix → every such i is a "border" start. Period = smallest i > 0 where Z[i] + i == m (or fallback to m). Prefer over KMP's π when the problem asks for prefix-match lengths at every position simultaneously, or when the Z-array's semantics (absolute match length vs. relative suffix length) are cleaner for the formulation. Complexity: O(m).
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Single occurrence check | Does P appear in T? | in / str.find for simple cases; KMP for custom implementations |
| All occurrences | Return every start index | KMP or Z-algorithm |
| Multi-pattern search | Which of k patterns appear in T, and where? | Aho-Corasick |
| Repeated / duplicate substring | Longest or most-frequent substring | Binary search + rolling hash; suffix array + LCP |
| Minimal period / rotation | Shortest repeating unit; is T a rotation of P? | KMP failure function; search in T+T |
| String construction avoiding words | Count length-n strings not containing any word | Aho-Corasick + DP |
| Approximate matching | Minimum edits to embed P in T | DP (edit distance); see dynamic-programming.md |
| 2D matching | Find a 2D pattern in a 2D grid | Rabin-Karp row-then-column |
Problem Signal → Technique
| Signal in problem | Likely technique |
|---|---|
| "find all occurrences of pattern in text" | KMP or Z-algorithm |
| "shortest period" / "repeating unit" / "smallest period" | KMP failure function: period = m − π[m−1] |
| "is S a rotation of T?" | Check if S appears in T + T |
| "k patterns, find any/all in text" | Aho-Corasick |
| "how many strings of length n avoid these words" | Aho-Corasick + DP |
| "find duplicate / longest repeated substring" | Rabin-Karp binary search or suffix array |
| "match with wildcards '?' or '*'" | DP, not string matching algorithms |
| "find all anagrams / permutations of P in T" | Sliding window + frequency array |
| "longest common substring of two strings" | Suffix array + LCP or DP |
| "rolling hash" mentioned or implied by O(1) window comparison | Rabin-Karp |
| large alphabet, pattern much shorter than text, offline | Boyer-Moore |
| "minimum number of times you can search a text with multiple queries" | Aho-Corasick (build once, query once) |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Build on self (Z / π) | Need prefix-suffix overlap info for P alone | Run Z or KMP on P itself — no text concatenation |
| Sentinel concatenation | Search P in T without cross-boundary Z-values | S = P + '$' + T; use a char absent from both strings |
| Prefix hash array | O(1) substring hash after O(n) preprocessing | h[i] = h[i−1] · base + ord(T[i]); query via subtraction |
| Double hashing | Reduce collision probability to ~1/MOD² | Maintain two independent (base, mod) pairs; both must collide for a false positive |
| Binary search + hash set | Find longest substring satisfying a frequency/count condition | Binary search on length L; O(n) hash-window pass per candidate L |
| Failure-link BFS (Aho-Corasick) | Multiple patterns, one text pass | Build trie; BFS assigns fail[v] using parent's fail; propagate output links in same BFS |
| Period from π | Structural property derivable from failure function | period = m − π[m−1]; valid iff m % period == 0 |
| KMP on reversed / concatenated string | Palindrome or rotation problems | Shortest palindrome: KMP on P + '$' + reverse(P) to find longest palindromic prefix |
Edge Cases & Pitfalls
- Empty pattern or empty text: KMP and Z return immediately; rolling hash has undefined window. Guard at entry — return
0(empty pattern matches everywhere) or[](empty text has no matches) per problem definition. - Pattern longer than text (m > n): no match possible; return early to avoid negative-length window arithmetic in rolling hash.
- Overlapping matches: KMP handles this correctly only if you reset
j = π[j−1](notj = 0) after recording a match. Resetting to 0 misses the second "aa" in "aaa" when searching for "aa". - Sentinel must be absent from P and T: if '$' appears in the input (e.g., URL strings), choose a different sentinel or use integer concatenation (append
-1between integer-encoded arrays). - Rolling hash subtraction underflow: in languages with fixed-size integers,
(a - b) % MODcan be negative. Always write(a - b % MOD + MOD) % MOD. Python's%is always non-negative, so this is a Java/C++ concern only. - Rabin-Karp single modulus collisions: with MOD = 10⁹+7 and n = 10⁵, expected ~10⁻⁴ false-positive rate per query — usually acceptable, but always verify on collision if correctness is critical.
- KMP π off-by-one: π[0] = 0 always; the loop starts at i = 1 with j = 0. A common mistake is entering the loop at i = 0, which sets π[0] incorrectly.
- Boyer-Moore bad-character initialisation: characters not in P should produce a shift that moves the alignment past the current position. Initialise the table to -1 (last occurrence = "before the pattern"), so shift = j - (-1) = j + 1.
- Aho-Corasick missing output links: a node may match a short pattern reachable only via failure links, not directly. Skipping output-link traversal causes missed matches for patterns that are suffixes of longer matches.
- Aho-Corasick on varying-length patterns: after a full match at a terminal node, still follow output links — a shorter pattern may also end at this text position.
- Period validity check: m − π[m−1] is always a candidate period length, but it is the true period only when m % (m − π[m−1]) == 0. Otherwise the string has no period shorter than m.
Implementation Notes
- Python's
str.find()andinuse an optimised internal search (a Sunday/Boyer-Moore-Horspool variant) — for simple existence checks on Python strings, don't reimplement KMP. - Python's
re.search()with a fixed string uses full NFA simulation;str.findis faster for literal patterns. - KMP exists in both 0-indexed and 1-indexed forms in textbooks; the search phase index arithmetic differs between them — pick one and be consistent throughout.
collections.dequeis required for Aho-Corasick BFS; list withpop(0)is O(n) per call and will TLE on large tries.- Polynomial hash base must exceed the alphabet size: ≥ 27 for lowercase letters, ≥ 128 for ASCII, ≥ 131 is a common safe choice.
- KMP failure function is iterative by nature; do not implement it recursively — Python's default recursion limit (1000) is too small for m > 1000.
- For Z-algorithm: after computing Z on S = P + '$' + T, match positions in T are those indices i in [m+1, |S|) where Z[i] == m; the corresponding text index is i − m − 1.
- Aho-Corasick trie nodes are typically represented as dicts (sparse alphabet) or arrays of size σ (dense alphabet); array is faster but uses 26× more memory per node.
Cross-Topic Relations
| Related topic | Connection |
|---|---|
| strings.md | String matching is a sub-domain of string problems; character manipulation, palindromes, and anagrams are covered there |
| suffix-array.md | Suffix arrays + LCP arrays solve repeated-substring and longest-common-substring problems that go beyond what KMP/Z handle; prerequisite for competitive-level string matching |
| trie.md | Aho-Corasick is a trie augmented with failure links; trie insert/search operations are a prerequisite |
| dynamic-programming.md | Approximate matching (edit distance, wildcard */?) uses DP; Aho-Corasick + DP combines both paradigms |
| sliding-window.md | Anagram and permutation search — frequently confused with string matching — use sliding window + frequency array, not KMP |
| hash-table.md | Rabin-Karp is hashing applied to strings; collision analysis requires understanding modular arithmetic and hash set operations |
Interview Reference
Must-Know Problems
KMP / Z-Algorithm
- Find the Index of the First Occurrence in a String (LC 28) — baseline KMP implementation
- Repeated Substring Pattern (LC 459) — period via failure function
- Shortest Palindrome (LC 214) — KMP on P + '$' + reverse(P) to find longest palindromic prefix
- Longest Happy Prefix (LC 1392) — directly constructs the failure function as the answer
Rabin-Karp / Rolling Hash
- Longest Duplicate Substring (LC 1044) — binary search + rolling hash; high-frequency hard problem
- Repeated DNA Sequences (LC 187) — hash-set on length-10 windows
- Rabin-Karp String Matching (LC 28 variant) — implement Rabin-Karp explicitly
Multi-Pattern / Aho-Corasick
- Word Search II (LC 212) — trie on board; failure-link mindset applies even without full AC
- Multi String Search — find which of k patterns appear in T; canonical Aho-Corasick problem
Period / Rotation
- Rotate String (LC 796) — check if S appears in S + S (one line with KMP)
- String Compression (LC 443) — uses period/repetition structure
Additional LeetCode Coverage
- Repeated String Match (LC 686)
Clarification Questions to Ask
- Is the pattern fixed, or can it contain wildcards (
?,*)? Wildcards require DP, not KMP. - Is the alphabet bounded (e.g., lowercase only) or arbitrary? Affects hash base and Boyer-Moore table size.
- Do overlapping occurrences count as separate matches?
- Is the text streamed character by character, or available all at once? Streaming favors KMP; offline allows suffix arrays.
- Are there multiple patterns, or just one? One → KMP/Z; many → Aho-Corasick.
- What defines a "match"? Exact / case-insensitive / with k mismatches?
- Should the answer count positions or return the actual substrings?
Common Interview Mistakes
- Using O(nm) naïve search for large inputs (n, m up to 10⁵) without recognising KMP is needed.
- Implementing KMP but resetting
j = 0instead ofj = π[j−1]after recording a match, making the algorithm miss overlapping occurrences. - Forgetting the sentinel when concatenating P and T for Z-algorithm, causing Z-values to cross the boundary and produce spurious match lengths.
- Choosing a weak hash modulus (e.g., 10⁶) without double hashing and not verifying on collision, leading to wrong answers.
- Reaching for KMP to solve "find all anagrams in a string" (LC 438) — the problem is order-insensitive and requires sliding window, not character-order matching.
- Conflating "period length" (m − π[m−1]) with a valid period — omitting the divisibility check.
Typical Follow-Up Escalations
- "Now allow k mismatches" → DP over (text_pos × mismatch_count), or bitap for small m.
- "Now search for any of 1000 patterns simultaneously" → Aho-Corasick.
- "Now find the longest repeated substring" → suffix array + LCP (interview-accessible: binary search + rolling hash).
- "What if the text arrives as a stream, one character at a time?" → KMP handles this naturally; Z-algorithm requires the full string.
- "Can you make the search phase use O(1) extra space?" → Boyer-Moore's search uses O(1) extra per character (the preprocessing tables are fixed); KMP uses O(m) for the π array.
- "Now count distinct substrings" → suffix array + LCP: answer = n(n+1)/2 − Σ LCP[i].
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles