Rolling Hash
Rolling Hash is a technique for comparing fixed-length substrings in O(1) per step after O(n) preprocessing, by maintaining a hash value that updates incrementally as a window slides across a string or array. It converts exact substring comparison — naively O(m) per position — into an O(1) hash check with O(m) verification only on collision.
Core Concepts
Polynomial rolling hash — a hash of a string S[0..m−1] computed as S[0]·b^(m−1) + S[1]·b^(m−2) + … + S[m−1]·b^0 mod p, where b is a base larger than the alphabet size and p is a large prime. The polynomial structure enables O(1) removal of the leading character and addition of a trailing character.
Rolling update (slide by one) — given H = hash(S[i..i+m−1]), the hash of S[i+1..i+m] is (H − S[i] · b^(m−1)) · b + S[i+m], all mod p. The subtraction removes the outgoing character's contribution; multiplying by b shifts the remaining terms left; adding the new character appends it at degree 0.
Prefix hash array — an array H[0..n] where H[i] = S[0]·b^(i−1) + … + S[i−1]·b^0 mod p (with H[0] = 0). Enables O(1) hash of any substring S[l..r] as (H[r+1] − H[l] · b^(r−l+1)) mod p. Avoids recomputing from scratch for offline multi-length queries.
Hash collision — two distinct substrings producing the same hash value. Probability ≈ 1/p per comparison with a single modulus; with double hashing (two independent (base, mod) pairs) it drops to ≈ 1/(p₁·p₂).
Spurious match (false positive) — a window whose hash equals the pattern hash but whose content differs. Always verify with direct comparison when correctness is critical; count the expected number of spurious matches as n/p.
Base — must exceed the alphabet size: ≥ 27 for lowercase letters, ≥ 131 for general ASCII. Common choices: 31 (lowercase), 131 (ASCII).
Modulus — a large prime to minimise collisions. Common choices: 10⁹+7, 10⁹+9, 2³¹−1. Using a power-of-two modulus enables bitwise operations but increases collision risk with adversarial inputs.
Double hashing — maintaining two independent hash values (H₁ mod p₁, H₂ mod p₂) per window. A match requires both to agree, reducing false-positive rate to ≈ 1/(p₁·p₂) ≈ 10⁻¹⁸ for typical prime choices.
Power table — precomputed array pw[i] = b^i mod p. Required to evaluate b^(m−1) mod p in O(1) during the roll and to compute b^(r−l+1) for prefix hash queries.
Types / Variants
| Variant | What it is | When to use | Key property |
|---|---|---|---|
| Single hash (one base, one mod) | One polynomial hash per window | Default; fast, low implementation cost | False-positive rate ≈ 1/p; verify on match |
| Double hash | Two independent (base, mod) pairs maintained simultaneously | When false positives are unacceptable and verification is expensive | Rate ≈ 1/(p₁·p₂); virtually eliminates collisions in practice |
| Prefix hash array | Precomputed cumulative hash for offline substring queries | Multiple length queries on the same string; no sliding needed | O(n) build, O(1) per query of any substring |
| Fixed-window slide | Maintains one hash of a length-m window as it slides | Single pattern length, online left-to-right sweep | O(n) total; constant space beyond the power table |
| 2D rolling hash | Hash computed over rows then columns (or vice versa) | 2D pattern matching in a matrix | Compose two 1D hashes; collision rate is product of rates |
| Polynomial hash with negative values | Apply offset so all characters map to positive integers | Numeric arrays or mixed-sign inputs | Offset = abs(min_value) + 1; same formula applies |
Key Algorithms
Rabin-Karp Substring Search
Slides a length-m hash window across text T, comparing against the precomputed hash of pattern P. On hash match, verify the actual substring. O(n + m) average time (one hash computation + n slides), O(nm) worst case (all collisions with no verification short-circuit), O(1) extra space beyond the power table.
Key insight: the rolling update allows the hash to advance one character in O(1), making each position O(1) amortised instead of O(m).
h = pw = 0
for c in P: h = (h * BASE + ord(c)) % MOD; pw = pw * BASE % MOD # pw = BASE^m % MOD... (split: pw starts at 1, multiply m times)
After computing H_P and the initial H_W over T[0..m−1], the slide step is:
H_W = (H_W - ord(T[i]) * pw % MOD + MOD) * BASE % MOD + ord(T[i + m])
H_W %= MOD
The + MOD before the multiplication prevents negative intermediate results in languages without Python's arbitrary-precision integers.
Prefix Hash Array Build and Query
Precompute cumulative hashes so any substring hash is retrievable in O(1). O(n) build, O(1) per query, O(n) space.
H[0] = 0
for i, c in enumerate(S): H[i+1] = (H[i] * BASE + ord(c)) % MOD
Query hash of S[l..r] (0-indexed, inclusive):
hash_lr = (H[r+1] - H[l] * pw[r - l + 1]) % MOD
The pw array satisfies pw[k] = BASE^k % MOD. Negative results from subtraction are handled by Python's % automatically; in other languages add MOD before taking mod.
Double Hashing
Maintain two prefix hash arrays (or two rolling hashes) with different (BASE, MOD) pairs. A substring match requires both hashes to agree. No additional algorithmic steps — it is the same rolling or prefix structure applied twice in parallel. O(n) build, O(1) query, O(n) space (two arrays).
When to use over single hashing: when the input is adversarial (e.g., a judge with anti-hash tests), when the problem has large n (n ≥ 10⁵) and the expected number of single-hash collisions n/MOD is non-negligible, or when verification by direct comparison would cost O(m) and degrade worst-case performance.
Binary Search + Rolling Hash (Longest Duplicate Substring)
Binary search on the answer length L; for each candidate L, slide a length-L hash window over the string and check membership in a set. O(n log n) time, O(n) space.
Key insight: "does a substring of length L appear twice?" is decidable in O(n) with a hash set — binary search on L turns it into O(n log n). The same skeleton solves longest common substring of two strings (concatenate with a sentinel, track which string each hash came from).
lo, hi = 1, len(S) - 1
while lo <= hi:
mid = (lo + hi) // 2
if has_duplicate(S, mid): lo = mid + 1; best = mid
else: hi = mid - 1
The has_duplicate function runs a single O(n) hash-window pass. Use double hashing or verify candidates to avoid false positives causing incorrect binary search decisions.
Fixed-Window Anagram / Permutation Check
Slide a length-m window over T; maintain a rolling hash of the window's sorted characters — or, equivalently, use a character frequency vector instead of a positional hash. The frequency vector approach is O(1) per step with 26-element comparison. Rolling hash over sorted content requires re-sorting each step (not O(1)); the canonical solution uses frequency vectors.
Note: positional rolling hash is not suitable for order-insensitive matching (anagrams). Use frequency vectors or a frequency-based hash (sum of per-character hashes, order-independent). See sliding-window.md for the frequency-vector pattern.
2D Rolling Hash
Hash each row independently to produce a 1D array of row-hashes, then apply a second rolling hash over columns of that array. Enables O(1) hash of any submatrix after O(mn) preprocessing.
Mechanism: for a pattern of height r and width c, first hash every horizontal stripe of height r to a single value per column position, then slide a width-c window over those column values. Two independent hash dimensions keep the collision rate low.
Advanced Techniques
Binary Search + Rolling Hash for Longest Repeated / Common Substring
Solves: longest substring appearing ≥ k times; longest common substring of two or more strings; longest substring shared between two halves. Mechanism: binary search on length L; O(n) sliding hash pass per candidate stores all length-L hashes in a set (or checks for a repeat). Concatenate strings with a sentinel character (absent from both) for multi-string variants; filter set entries by source string. When to prefer over suffix array: when the problem only needs the length (not all positions), when implementation time is constrained, or when n ≤ 10⁶ makes the O(n log n) solution fast enough. Suffix arrays give all positions and support more complex queries; rolling hash binary search gives only the length/existence. Complexity: O(n log n) expected with O(1) verification probability; O(n log n · m) in the worst case if every candidate requires O(m) verification.
Hashing with Multiple Lengths (Multi-Pattern Rabin-Karp)
Solves: given k patterns of possibly different lengths, find all occurrences of any pattern in text T. Mechanism: group patterns by length; for each length group, compute a hash set of that group's hashes, then slide a window of that length over T and check membership. One pass per distinct pattern length. When to prefer over Aho-Corasick: when the number of distinct lengths is small (≤ log n), or when the patterns are given online and Aho-Corasick preprocessing is not feasible. For k patterns all of the same length, this is strictly O(n + km) vs. Aho-Corasick's O(Σ|patterns| + n); rolling hash wins when k is large and verification is rare. Complexity: O(n · distinct_lengths + total_pattern_length) expected.
See string-matching.md for Aho-Corasick and full multi-pattern coverage.
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Substring search | Does pattern P appear in text T? | Rabin-Karp single-pass hash; verify on match |
| Duplicate substring detection | Does any substring of length L appear twice? | Fixed-window hash set membership check |
| Longest duplicate substring | Maximum length L with a repeated substring | Binary search on L + rolling hash |
| Longest common substring | Longest substring shared by two strings | Concatenate with sentinel + binary search + hash |
| Repeated DNA / fixed-length repeats | All substrings of length k appearing ≥ 2 times | Hash set; collect hashes seen twice |
| Equal / matching windows | Are two windows of the same length identical? | Compare prefix hashes in O(1) |
| 2D pattern in matrix | Does a 2D pattern appear in a 2D grid? | 2D rolling hash (row hash then column hash) |
| Minimum window containing pattern | Smallest window whose hash matches | Rabin-Karp with multi-length scan |
Problem Signal → Technique
| Signal in problem | Likely technique |
|---|---|
| "find duplicate substrings of length k" | Rolling hash set over length-k windows |
| "longest substring appearing at least twice" | Binary search on length + rolling hash |
| "longest common substring of two strings" | Concatenate with sentinel + binary search + hash |
| "check if two substrings are equal" (multiple queries) | Prefix hash array, O(1) per query |
| "rolling hash" or "Rabin-Karp" mentioned explicitly | Polynomial rolling hash with slide update |
| "repeated DNA sequences" / fixed-length repeated window | Hash set on fixed-length windows |
| "string matching in O(n)" | Rabin-Karp (or KMP — check if pattern length matters) |
| "2D pattern in grid" | 2D rolling hash |
| multiple substring equality queries on the same string | Precompute prefix hash array; answer each in O(1) |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Precompute power table | Any prefix hash array usage | pw[0]=1; pw[i]=pw[i-1]*BASE%MOD; size n+1 |
| Sentinel-padded prefix hash | Offline multi-query substring comparison | H[0]=0; H[i+1]=(H[i]*BASE+ord(S[i]))%MOD; query via subtraction |
| Double hash maintenance | Adversarial inputs or critical correctness | Carry two (H1, H2) tuples through every operation; match only when both agree |
| Hash set for window membership | Detecting any repeat at a fixed length | Insert each window hash; check before inserting; seen = set() |
Rolling slide with +MOD guard | Single-modulus subtraction | H = (H - outgoing_contrib + MOD) * BASE % MOD + incoming |
| Binary search shell | Longest / minimum length satisfying a hash-checkable property | Binary search on L; call O(n) has_property(L) each iteration |
| Per-character hash for order-independent windows | Anagram hashing (not positional) | Assign each character a random value; window hash = XOR or sum of character hashes |
Edge Cases & Pitfalls
-
Negative subtraction result in rolling update. When
H - outgoing_contribgoes negative (in fixed-width integer languages), taking mod gives a negative result. The fix is(H - outgoing_contrib % MOD + MOD) % MOD. Python's%is always non-negative, so this is only a concern in Java/C++, but the+MODguard is a good habit regardless. -
Power table off-by-one. The rolling subtraction requires
BASE^m mod MOD(the power of the outgoing position, which ismcharacters back). A common mistake is usingBASE^(m−1). Derive the formula from scratch: ifH = S[i]·B^(m−1) + … + S[i+m−1]·B^0, then subtractingS[i]·B^(m−1)and multiplying by B givesS[i+1]·B^(m−1) + … + S[i+m−1]·B^1. -
Base too small for the alphabet. Using
BASE = 26for lowercase letters makesS[0]·26^kindistinguishable fromS[0]+1·26^kwhen carries wrap around mod p. Always chooseBASE > max(ord(c)) - ord('a'): 31 for lowercase, 131 for ASCII. -
Forgetting the sentinel in multi-string concatenation. When concatenating
S1 + '$' + S2to find common substrings, the sentinel must not appear in either string. A window spanning the boundary would produce a spurious match if no sentinel is present. Use a character outside the input alphabet. -
Binary search boundary on length. Searching for the longest duplicate substring:
lo = 1,hi = len(S) - 1(notlen(S)). A string of length n cannot have a duplicate substring of length n (it would require the string to appear in itself at a different offset, which is only possible for periodic strings). Settinghi = len(S)causes the loop to check an impossible length, wasting one O(n) call. -
False positive causing wrong binary search decision. If
has_duplicate(L)returnsTruedue to a hash collision when no real duplicate exists, the binary search movesloright and may report a longer length than actually exists. Double hashing or verifying the candidate on match eliminates this. -
Collision probability grows with n. With a single modulus of 10⁹+7 and n = 10⁵ windows, expected false positives ≈ 10⁵ / 10⁹ ≈ 0.0001 per run. For n = 10⁶ it's ≈ 0.001. Still small, but anti-hash test cases on judges can exploit a fixed (base, mod) pair — use random bases or double hashing for robustness.
-
Applying rolling hash to order-insensitive windows. Polynomial rolling hash is positional;
hash("ab") ≠ hash("ba"). For anagram detection, use a frequency-based hash (XOR or sum of random per-character values) or a frequency vector. Do not use positional rolling hash for anagram problems. -
Prefix hash query with wrong power index. The query
hash(S[l..r]) = H[r+1] - H[l] * BASE^(r-l+1)usespw[r-l+1], notpw[r-l]. The length of the substring isr - l + 1; that is the power needed to shiftH[l]to align withH[r+1].
Implementation Notes
- Python's arbitrary-precision integers mean subtraction never underflows, but intermediate values can grow before the
% MODis applied. Always reduce modulo at each step to avoid large intermediate products:(a * b) % MODinstead of computinga * bfirst for largea. pow(BASE, k, MOD)in Python computesBASE^k mod MODin O(log k) using fast modular exponentiation — use this to initialise the power table or compute a single power rather than looping.- Precompute
pw = [1] * (n + 1)withpw[i] = pw[i-1] * BASE % MODonce per string; do not recompute inside the sliding loop. - For the prefix hash array pattern, allocate
Hof sizelen(S) + 1to avoid index-shifting in the query formula. - Choosing a random base at runtime (e.g.,
BASE = random.randint(131, 10**6)) defeats fixed anti-hash tests on competitive programming judges without requiring double hashing. - Python's
setstores integers efficiently; hashing 64-bit integers into a set of size n uses approximately 8n bytes — feasible for n ≤ 10⁶. - For double hashing, store tuples
(h1, h2)as the set key; Python hashes tuples correctly and the collision rate is effectively zero.
Cross-Topic Relations
| Related topic | Connection |
|---|---|
| string-matching.md | Rabin-Karp is the rolling-hash-based string matching algorithm; KMP and Z-algorithm are deterministic alternatives without hash collisions |
| sliding-window.md | Rolling hash is a specialised sliding window where the "aggregate" is a hash value; anagram detection uses a frequency-vector sliding window instead |
| binary-search.md | Binary search on answer length is the primary hard-problem technique combining with rolling hash |
| prefix-sum.md | Prefix hash array mirrors the prefix sum construction: same sentinel-padded build, same subtraction-based range query |
| suffix-array.md | Suffix arrays solve the same repeated/common substring problems deterministically in O(n log n); rolling hash binary search is the probabilistic alternative with simpler implementation |
| hash-table.md | Rolling hash produces the keys stored in hash sets; understanding modular arithmetic and collision probability requires hash table foundations |
Interview Reference
Must-Know Problems
Repeated / duplicate substrings
- Repeated DNA Sequences (LC 187) — hash set on length-10 windows; canonical rolling hash warm-up
- Longest Duplicate Substring (LC 1044) — binary search + rolling hash; high-frequency hard problem
- Longest Repeating Substring (LC 1062) — same skeleton as 1044, premium variant
Substring equality queries
- Longest Common Subpath (LC 1923) — binary search + hash; multiple arrays
- String Hashing / rolling hash template problems (various OJ problems explicitly testing the technique)
Pattern search / matching
- Find the Index of the First Occurrence in a String (LC 28) — implement Rabin-Karp as an alternative to KMP
- Implement strStr() variants asking for O(n) expected time
Multi-string / 2D
- Equal Rational Numbers (LC 972) — hash-based string normalization
- Longest Common Substring of Two Strings (not a standalone LeetCode problem but appears as a sub-problem in many hards)
Clarification Questions to Ask
- Is the input alphabet bounded (lowercase letters, ASCII, arbitrary Unicode)? Determines the base choice.
- Are there multiple queries on the same string, or a single scan? Multiple queries → prefix hash array; single scan → sliding window.
- Is the answer a length, a count, or actual positions? Positions require storing indices alongside hashes.
- Does "substring" mean contiguous? (Rolling hash only applies to contiguous windows.)
- What is n? For n ≥ 10⁶, double hashing or random base selection is advisable to avoid anti-hash tests.
- Is an exact answer required, or is probabilistic correctness acceptable? (Affects whether verification is needed.)
Common Interview Mistakes
- Forgetting to add
MODbefore the subtraction step in the rolling update, producing negative hash values and wrong membership checks in non-Python languages. - Using a fixed (base, mod) pair from a textbook example, which is exploitable by anti-hash test cases on competitive programming judges — randomise the base or use two moduli.
- Applying positional rolling hash to anagram / permutation problems, where character order should not matter — the correct structure is a frequency vector or order-independent hash.
- Setting the binary search upper bound to
len(S)instead oflen(S) - 1, causing one unnecessary O(n) call and potential edge-case bugs. - Not verifying on hash match in problems where a single false positive changes the answer — the verification step is O(m) in the worst case but O(1) amortised across the full scan.
- Computing
BASE^(m-1)instead ofBASE^mfor the outgoing-character subtraction, shifting all subsequent hashes by a factor of BASE and causing every comparison to fail silently.
Typical Follow-Up Escalations
- "Now find all positions of the duplicate substring, not just the length." → Store
{hash: [start_indices]}in the hash set; collect all positions with the winning hash. - "Now handle up to 1000 pattern queries on the same text." → Build a prefix hash array once; answer each query in O(m) (hash the pattern, compare against the prefix array window).
- "What if the text arrives as a stream, one character at a time?" → Use the sliding rolling hash (not the prefix array); maintain the current window hash and the outgoing-character power incrementally.
- "Prove your solution is correct, not just probably correct." → Switch to Z-algorithm or KMP for deterministic O(n + m); rolling hash is a probabilistic technique.
- "Now find the longest substring appearing in all k strings." → Binary search on L; for each L build a hash set per string; intersect the sets; O(nk log n) total.
- "Can you reduce space to O(1)?" → Not possible with the prefix hash array; the sliding window approach uses O(1) space beyond the power value, but loses the ability to answer arbitrary substring queries.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles