Suffix Array
Type: Data Structure LeetCode problems: ~20–30 (< 50)
Core Concepts
Suffix — a substring starting at position i and ending at the last character; string s of length n has exactly n suffixes.
Suffix Array (SA) — integer array of length n where SA[i] is the starting index of the i-th lexicographically smallest suffix; encodes all suffixes in sorted order without storing them explicitly.
Rank Array (RA / ISA) — inverse of SA: RA[SA[i]] = i, so RA[j] is the sorted rank of the suffix starting at position j; enables O(1) suffix comparison.
LCP Array — integer array where LCP[i] = length of the longest common prefix between SA[i-1] and SA[i] for i ≥ 1; LCP[0] is undefined (0 by convention). Critical for substring queries.
LCP between arbitrary suffixes — lcp(SA[i], SA[j]) = min(LCP[i+1 .. j]) when i < j; this is a range minimum query on the LCP array and is the key operation enabling most advanced applications.
Suffix Tree — a compressed trie of all suffixes that encodes the same information as SA + LCP but with higher constant factors. SA + LCP is preferred in practice.
See trie.md for trie background.
Suffix Automaton (SAM) — a DAG of states representing all substrings; supports online (incremental) construction unlike SA.
See suffix-automaton.md for full coverage.
Structural Invariants
| Property | Invariant | What breaks if violated |
|---|---|---|
| Sorted order | s[SA[0]:] < s[SA[1]:] < … < s[SA[n-1]:] lexicographically | Binary search for pattern matching returns a wrong range |
| LCP consistency | LCP[i] equals the exact shared prefix length of s[SA[i-1]:] and s[SA[i]:] | All LCP-based queries return wrong answers |
| Rank is inverse of SA | RA[SA[i]] = i for all i | O(1) rank lookups are wrong; prefix doubling produces incorrect order |
| Completeness | Every index 0..n-1 appears exactly once in SA | Some suffixes uncounted; occurrence queries miss entries |
Internal Representation
SA: integer array of length n storing starting indices. Storing suffixes explicitly would cost O(n²) space; SA uses O(n).
LCP array: integer array of length n (LCP[0] unused). Always built alongside SA using Kasai's algorithm.
Rank array: integer array of length n; maintained as a working buffer during construction, then rebuilt from the final SA by a single pass.
Sparse table on LCP: O(n log n) preprocessing for O(1) range minimum queries on LCP, needed for arbitrary lcp(i, j) queries.
Construction only retains SA and LCP as outputs; the rank array is a construction artifact rebuilt on demand.
Types / Variants
| Variant | Time | Space | Use when |
|---|---|---|---|
| Prefix Doubling (Manber-Myers) | O(n log n) with radix sort | O(n) | Default choice; straightforward to implement and fast in practice |
| DC3 / Skew | O(n) | O(n) | Theoretical O(n) needed; complex, almost never used in interviews |
| SA-IS | O(n) | O(n) | Best practical O(n) implementation; used in libdivsufsort and CP libraries |
Prefix doubling is the standard choice for interviews and most competitive programming.
Topic-Specific Operations
Construction — Prefix Doubling
Sort suffixes by doubling the comparison key length each round. Round 0: rank by single character. Round k: sort by (rank[i], rank[i + 2^k]). After log n rounds all ranks are unique. Use radix sort per round for O(n log n) total; comparison sort gives O(n log² n).
LCP Array Construction — Kasai's Algorithm
Given SA and RA, compute LCP in O(n). Key insight: if lcp(SA[RA[i]-1], SA[RA[i]]) = h, then lcp(SA[RA[i+1]-1], SA[RA[i+1]]) ≥ h-1. Walk suffixes in string order; reuse the previous LCP value minus 1 as a starting point, amortizing total work to O(n).
Pattern Matching
Binary search on SA for the leftmost and rightmost position where pattern p is a prefix of s[SA[k]:]. All occ occurrences are at indices SA[lo], SA[lo+1], …, SA[hi-1].
Time: O(m log n) where m = len(p).
Arbitrary LCP Query
lcp(i, j) = RMQ(LCP, RA[i]+1, RA[j]) (assuming RA[i] < RA[j]). Requires a sparse table on the LCP array.
Time: O(1) per query after O(n log n) preprocessing.
Number of Distinct Substrings
Total substrings = n*(n+1)/2. Each LCP[i] counts how many substrings of SA[i] are already represented by SA[i-1].
Formula: n*(n+1)/2 − sum(LCP[1..n-1]).
Key Algorithms
Prefix Doubling (Manber-Myers)
Iteratively refines suffix ranks using the observation that a length-2^k comparison can be assembled from two length-2^(k-1) ranks. After log n rounds, ranks uniquely identify each suffix and SA is derived directly.
- O(n log n) with radix sort; the invariant after round k is that
rank[i]reflects the sorted order ofs[i : i + 2^k]
Key line: sort by pair (rank[i], rank[i + gap] if i + gap < n else -1), then reassign integer ranks by comparing adjacent pairs.
Kasai's LCP Algorithm
Constructs the LCP array in O(n) by iterating over suffixes in string order rather than sorted order. Starting LCP from the previous iteration, decrement by at most 1 each step, so total increments across all iterations ≤ n.
- O(n) time; run immediately after building SA
h = 0
for i in range(n):
if ra[i] > 0:
j = sa[ra[i] - 1]
while i + h < n and j + h < n and s[i+h] == s[j+h]:
h += 1
lcp[ra[i]] = h
if h: h -= 1
Longest Common Substring of Two Strings
Concatenate s1 + sep + s2 where sep is a character below both alphabets (e.g., chr(0)). Build SA + LCP on the combined string. Scan LCP array: the maximum LCP[i] where SA[i-1] and SA[i] come from different original strings is the answer.
- O((n+m) log(n+m)) construction + O(n+m) scan
Longest Repeated Substring
Maximum value in the LCP array after construction. The substring is s[SA[i] : SA[i] + LCP[i]] at the maximizing index.
- O(n) after SA + LCP construction
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Pattern search | All occurrences of pattern p | Binary search on SA |
| Distinct substrings | Count unique substrings | n*(n+1)/2 − sum(LCP) |
| Longest repeated substring | Longest substring appearing ≥ 2 times | Max of LCP array |
| Longest common substring | Longest shared substring of 2+ strings | Concatenate with separators, scan LCP |
| Lexicographic k-th substring | Find the k-th distinct substring | Walk SA using (n − SA[i]) − LCP[i] new-substring counts |
| Suffix comparison | Compare two suffixes in O(1) | RA lookup; ties impossible since all suffixes are distinct |
| Shortest unique substring | Shortest substring appearing exactly once | SA + LCP; for each suffix find min gap to both neighbors |
Problem Signal → Technique
| Signal in problem statement | Likely technique |
|---|---|
| "all occurrences of pattern" | Binary search on SA |
| "number of distinct substrings" | n*(n+1)/2 − sum(LCP) |
| "longest repeated substring" | Max LCP value |
| "longest common substring" of two strings | Concatenate + SA + LCP scan |
| "compare suffixes efficiently" | RA array + sparse table RMQ |
| "lexicographically smallest/largest rotation" | SA on s + s |
| "k-th lexicographically smallest substring" | Walk SA with LCP-gap counting |
| "shortest unique substring" | SA + LCP, find suffix with largest min-neighbor-LCP |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Separator concatenation | Query spans two strings | Join s1 + chr(0) + s2; separator sorts before all real chars, never matches |
| Source-check LCP scan | Find longest common substring | Walk sorted SA, track which original string each SA entry belongs to, record max LCP between adjacent different-source entries |
| Sparse table RMQ on LCP | Arbitrary suffix LCP needed | Precompute sparse table; lcp(i,j) = min(LCP[RA[i]+1 .. RA[j]]) |
| Binary search on SA | Pattern matching | Binary search for leftmost and rightmost suffix with p as prefix |
| k-th substring via LCP gaps | Enumerate distinct substrings in order | SA[i] contributes (n − SA[i]) − LCP[i] new substrings; walk until cumulative count ≥ k |
| Sliding window over SA | Common substring across k strings | Window over sorted SA ensuring all k string sources are represented; shrink window when min LCP in range exceeds boundary |
Edge Cases & Pitfalls
-
Separator character: must be strictly smaller than every character in the input alphabet.
chr(0)is safe for any printable string;'#'fails if the input can contain'#'. Wrong separator causes cross-string matches to be counted as within-string repeats. -
LCP[0] access:
LCP[0]is undefined (no preceding suffix). All LCP scans must start at index 1; reading LCP[0] as a real value produces a spurious result. -
Stale rank array: the working rank buffer used during prefix doubling reflects intermediate state, not the final sorted order. Rebuild
RAfrom the finalSAwith a single O(n) pass before answering any rank-based query. -
O(n log² n) TLE: using Python's
sorted()in prefix doubling gives O(n log² n). For n > 10^5, implement counting sort per round or switch to SA-IS. -
Empty string or single character: SA =
[0], LCP =[0]. The distinct-substrings formula and max-LCP queries both return 0 — correct, but guard againstmax([])errors in code. -
All identical characters: LCP array values are
n-1, n-2, …, 1, 0; sum is O(n²). The distinct-substrings formula still works (result = n), but intermediatesum(LCP)can overflow 32-bit integers for large n. -
Two strings with same length: when checking "different source" in the separator-concatenation pattern, the separator occupies index
len(s1), so an index belongs tos2if it is> len(s1), not≥ len(s1). -
Rank vs. SA confusion:
SA[rank] = positionandRA[position] = rank. Swapping them in RMQ or binary search produces silently wrong results with no obvious runtime error.
Implementation Notes
- Python has no built-in suffix array. For n ≤ 2×10^4, the O(n log² n) prefix-doubling with
sorted()is acceptable. Above that, implement radix sort or use a C extension. sys.stdout.reconfigure(encoding="utf-8")at the top of any script printing Unicode output (project standard).- Kasai's algorithm requires a fully finalized SA and a correctly derived RA. Passing a stale rank array from mid-construction is a common silent bug.
- Sparse table for RMQ: precompute
log2as an integer array to avoid floating-pointmath.logcalls inside the query loop. - SA indices are 0-based in almost all implementations; mixing 0-based and 1-based indexing between SA and LCP is the most common source of off-by-one bugs.
- When concatenating strings with a separator, convert to a list of integers (ord values) rather than strings; this avoids Python string slicing overhead and makes the separator comparison explicit.
Cross-Topic Relations
| Related topic | Connection |
|---|---|
| String Hashing / Rolling Hash | Alternative O(n log n) approach for single-pattern search; SA is strictly more powerful for multi-query scenarios |
| Trie | Suffix trie is the conceptual precursor; SA + LCP encodes an implicit suffix tree in O(n) space |
| Segment Tree / Sparse Table | RMQ on LCP array for O(1) arbitrary LCP queries; see segment-tree.md |
| Binary Search | Pattern matching on SA is a direct application; see binary-search.md |
| Z-Algorithm / KMP | Linear-time alternatives for single-pattern matching; SA generalizes to arbitrary substring queries and multiple patterns |
| Dynamic Programming | Some SA problems reduce to DP over the sorted suffix sequence (e.g., k-th distinct substring) |
| Sliding Window | Used in the k-string common-substring window scan over sorted SA |
Interview Reference
Must-Know Problems
Longest Repeated / Distinct Substrings
- Longest Duplicate Substring (LC 1044) — binary search + rolling hash is the standard approach; SA + max(LCP) is the direct construction
- Number of Distinct Substrings (CSES 2402) —
n*(n+1)/2 − sum(LCP)
Multi-String
- Longest Common Substring of Two Strings (LC 718 is solved by DP, but the SA approach generalizes to k strings)
- Longest Common Substring of k strings — concatenate with distinct separators, sliding window over SA maintaining coverage of all k sources
Pattern Matching
- Find all occurrences of a pattern — binary search on SA, return
SA[lo..hi-1] - Shortest Unique Substring — for each suffix, compute
min(LCP with left neighbor, LCP with right neighbor) + 1
Lexicographic
- Lexicographically Smallest Rotation (LC 1163) — SA on
s + s, first entry withSA[i] < n - k-th Lexicographically Smallest Distinct Substring — walk SA with LCP-gap counting
Clarification Questions to Ask
- What is the character set? Affects separator character choice and whether radix sort can be tuned.
- Single query or multiple queries on the same string? Multiple queries justify O(n log n) preprocessing + O(1) or O(log n) per query.
- Are the strings static? SA is a static structure; updates require full O(n log n) reconstruction.
- What are the length constraints? n ≤ 10³: brute force fine; n ≤ 10^5: O(n log n) SA; n > 10^6: SA-IS required.
- Does "substring" mean contiguous? Always verify — some problems intend "subsequence."
Common Interview Mistakes
- Building SA but not the LCP array, then computing LCP by direct string comparison (O(n) per query instead of O(1) with sparse table).
- Using
'#'as a separator when the input alphabet is not guaranteed to exclude it; the separator must be provably outside the alphabet. - Confusing SA and RA:
SA[i]gives the starting position of the rank-i suffix;RA[j]gives the rank of the suffix at position j. Swapping them produces silent wrong answers. - Off-by-one:
LCP[i]compares SA[i-1] and SA[i]; valid fori ∈ [1, n-1]. ReadingLCP[0]as a valid comparison crashes or returns garbage. - Not rebuilding RA after the final SA pass; the working rank buffer from prefix doubling may not equal the inverse of the final SA.
Typical Follow-Up Escalations
- "Find all occurrences in O(m + log n) instead of O(m log n)" → use LCP-guided binary search to avoid re-scanning the matched prefix at each step.
- "Now handle k strings" → concatenate all with distinct separators; sliding window over SA tracks how many distinct string sources appear in the current LCP-bounded window.
- "Support online updates" → SA requires full reconstruction; suggest suffix automaton, which supports incremental O(n) amortized construction.
- "What if the alphabet is huge (Unicode)?" → replace character-based radix sort with comparison sort or hash-based rank assignment; O(n log² n) becomes acceptable.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles