Hash Function
A hash function is a deterministic mapping h: key → integer used to assign keys to buckets (in hash tables), compare substrings in O(1), detect duplicates without storing originals, and create composite keys from multi-field objects. This file covers the design and use of hash functions as a coding technique. For the hash table data structure built on top of them, see hash-table.md. For rolling hash applied to string search algorithms (KMP, Rabin-Karp), see string-matching.md.
Core Concepts
Hash function — a deterministic function h(key) → [0, m) mapping a key to a bucket index. Determinism (same input always produces same output) is the single mandatory invariant; everything else is a quality measure.
Uniformity — a hash function distributes keys evenly across all m buckets. Poor uniformity causes clustering: O(n) lookup in the worst bucket even though average is O(1).
Avalanche effect — a small change in the input (one bit, one character) causes roughly half the output bits to change. Hash functions without this property cluster similar keys near the same bucket.
Collision — two distinct keys produce the same hash value. Collisions are unavoidable by the pigeonhole principle whenever the key space exceeds m. The hash function controls how often they occur; the hash table controls how to handle them.
Collision rate — for a uniform hash over m buckets with n keys, the expected number of collisions is n(n-1)/(2m) (birthday paradox). Keeping n << m (low load factor) keeps this small.
Modular arithmetic — standard technique to reduce an arbitrary integer hash to [0, m): h(k) mod m. Choosing m as a prime reduces patterns introduced by the key's binary structure.
Polynomial hash — for a string or sequence s[0..n-1], the value s[0]·p^(n-1) + s[1]·p^(n-2) + … + s[n-1] mod M where p is a base prime and M is a large prime modulus. Polynomial hashes are the standard hash function for strings and tuples in competitive programming and interview problems.
Rolling hash — a polynomial hash of a fixed-length window that updates in O(1) by subtracting the outgoing term and adding the incoming term. Enables O(n) scanning of all length-k substrings.
Prefix hash array — a precomputed array h[0..n] where h[i] is the polynomial hash of s[0..i-1]. Allows O(1) computation of the hash of any substring s[l..r-1] via the formula (h[r] - h[l] * p^(r-l)) % M.
Hash of a composite key — when a key is a tuple, record, or multi-field object, its hash is typically computed by combining field hashes (e.g., polynomial combination). In Python, hash(tuple(fields)) is the idiomatic approach.
Custom hashable type — a user-defined class that implements __hash__ and __eq__ consistently. Python requires that objects that compare equal must have equal hashes (a == b → hash(a) == hash(b)); the converse need not hold.
Anti-hash (adversarial) input — an input set constructed so that all keys collide in a fixed-seed hash table, degrading every operation to O(n). Competitive programming judges sometimes use this to break solutions relying on Python's built-in set/dict.
Random base hashing — defense against adversarial input: choose the polynomial base and/or modulus randomly at runtime, making it infeasible for an adversary to predict the hash values.
Double hashing — computing two independent hashes with different (base, modulus) pairs. A false positive (collision in one hash) requires collision in both; probability drops from ~1/M to ~1/(M₁·M₂).
Perfect hash — a hash function with zero collisions for a known, fixed key set. Constructible in O(n) expected time using FKS two-level hashing or CHD algorithm. Gives O(1) worst-case lookup for static tables.
Minimal perfect hash — a perfect hash where m = n (no wasted buckets). Used in compiler symbol tables and read-only lookup tables.
Types / Variants
| Variant | What it is | When to use it over alternatives | Key property |
|---|---|---|---|
| Division hash | h(k) = k mod m | Integer keys; m is a known prime | Simplest; m must be prime and not near a power of 2 |
| Polynomial hash (string) | Horner-evaluated polynomial over characters mod M | String/sequence keys; need prefix hash array | O(n) build, O(1) substring query; p=31 or p=131 |
| Rolling / sliding hash | Polynomial hash updated in O(1) as window slides | Fixed-length window comparisons; anagram, duplicate substring | Avoids recomputing from scratch; remove old + add new |
| Double hash | Two independent (base, mod) pairs | When collision probability must be near 0; competitive programming | P(false positive) ≈ 1/(M₁·M₂) ≈ 10⁻¹⁸ |
| Tuple hash | hash(tuple(fields)) | Composite keys (coordinate, multi-field object) | Built-in to Python; O(1) lookup with frozenset/tuple as key |
| FNV / djb2 hash | Bitwise mixing hash (XOR + multiply or add + shift) | Custom hash for non-string primitives outside Python | Fast; no modular reduction needed for 32/64-bit output |
| Universal hash | ((a·k + b) mod p) mod m; a, b chosen randomly | Prevent adversarial inputs; provable O(1) expected | Any two distinct keys collide with probability ≤ 1/m |
| Zobrist hash | XOR of random values assigned to (position, piece) pairs | Board game states; incremental hash of set operations | O(1) update when one element changes; XOR is its own inverse |
Key Algorithms
Polynomial Hash Construction
Build a prefix hash array over string or sequence s in O(n). Allows O(1) substring hash queries.
The base p must exceed the alphabet size (31 for lowercase, 131 for ASCII). M must be a large prime (10⁹+7 or 10⁹+9 are standard choices).
h[0] = 0
for i, c in enumerate(s):
h[i+1] = (h[i] * p + ord(c)) % M
Substring hash for s[l:r]: (h[r] - h[l] * pow(p, r-l, M)) % M. Time O(n) build, O(1) per query.
Rolling Hash Update (Sliding Window)
Maintain the hash of a fixed-length window of size k without recomputing from scratch. O(1) per slide after O(k) initial computation.
Remove the leftmost character (multiply its contribution by p^(k-1) and subtract), then add the new rightmost character. The precomputed value p_k = pow(p, k, M) stays constant throughout.
window = (window * p + ord(s[i])) % M # add right
window = (window - ord(s[i-k]) * p_k) % M # remove left
Correctness: (window - x * p_k) % M can be negative in languages with signed modulo; add M before taking mod if needed. Python's % is always non-negative.
Double Hashing for Near-Zero Collision Probability
Run two independent polynomial hashes simultaneously with different (base, modulus) pairs. Combine as a tuple (h1, h2) used as the dictionary key or compared for equality.
Probability of a false positive drops from 1/M to 1/(M₁·M₂). For M₁ = M₂ = 10⁹+7, this is ~10⁻¹⁸ per comparison — negligible for n ≤ 10⁶ strings.
Use when a single hash causes wrong answers on adversarial test data, or in competitive programming where judges are known to include anti-hash inputs.
Universal Hashing
Choose a, b uniformly at random at program start: h(k) = ((a*k + b) % p) % m where p is a prime ≥ universe size. The family {h_{a,b}} is 2-universal: for any two distinct keys x, y, P[h(x) == h(y)] ≤ 1/m over the random choice of a, b.
In Python: random.randint to pick the base at import time; use as the polynomial base for string hashing to defeat adversarial inputs against fixed-base solutions.
Zobrist Hashing
Assign a uniformly random 64-bit integer to each (position, value) pair at program start. The hash of a state is the XOR of all random values for occupied (position, value) pairs.
table[pos][val] = random.getrandbits(64) # initialise once
h ^= table[pos][val] # toggle: XOR again to remove
Key insight: XOR is its own inverse, so removing an element is the same operation as adding it. O(1) update when one element changes; no recomputation over the whole state.
Applies to: board game hashing (chess, Go), set-membership hashing, detecting duplicate board positions during search.
Designing __hash__ for Custom Classes
Implement __hash__ to return a deterministic integer for any instance; implement __eq__ to define equality. Python's contract: a == b must imply hash(a) == hash(b).
Simplest correct approach: combine all fields that participate in __eq__ into a tuple and delegate to the built-in hash.
def __hash__(self): return hash((self.x, self.y, self.label))
def __eq__(self, o): return (self.x, self.y, self.label) == (o.x, o.y, o.label)
Mutable classes should not implement __hash__ (Python sets __hash__ = None by default when __eq__ is defined without __hash__, making instances unhashable).
Hash Map from Scratch (Chaining)
An array of m buckets, each a list of (key, value) pairs. Hash determines the bucket; linear scan within the bucket handles collisions.
The non-trivial parts: choosing m as a prime, resizing when load factor exceeds 0.75 (double m and rehash all entries), and using the modulus-prime trick h(k) = hash(k) % m where hash is Python's built-in (or a custom polynomial hash for strings).
See hash-table.md for the full hash table data structure with all collision strategies, invariants, and operations.
Hash Set from Scratch (Open Addressing / Linear Probing)
Array of m slots, each None (empty), a tombstone sentinel, or a key. find_slot(k) probes (h(k) + i) % m for i = 0, 1, … until empty or matching slot.
Critical: deletions must write a tombstone (not set to empty) to preserve the probe chain for keys that were inserted past the deleted slot.
See hash-table.md for detailed open addressing variants and the tombstone invariant.
Advanced Techniques
Prefix Hash + Binary Search for Longest Repeated Substring
Solves: "find the longest substring that appears at least twice" (or k times). O(n log n). Mechanism: binary search on the answer length L; for each candidate L, compute all n−L+1 substring hashes using the prefix array and check for duplicates via a hash set. Hash match triggers string comparison to rule out false positives. When to use over suffix arrays: when implementing a full suffix array + LCP array is too complex for an interview context, or when the problem only asks for existence or count, not all occurrence positions. Suffix arrays give O(n) and handle all repeated-substring queries; rolling hash binary search is simpler but O(n log n). Complexity: O(n log n) time expected; O(n) space.
See string-matching.md for the full Rabin-Karp and suffix-array context.
Polynomial Hash for Anagram / Permutation Equivalence
Solves: "are these two strings permutations of each other?", "find all windows that are permutations of P".
Mechanism: compute a order-independent hash by summing (not polynomial-chaining) the hashes of individual characters: h(s) = Σ hash(c) for c in s. Equal multisets produce equal sums. Use with sliding window: subtract outgoing character's hash, add incoming. Verify with frequency map on collision.
When to use over frequency array: frequency array is O(26) to compare two windows — fast enough for ASCII. Use hash when the "alphabet" is large (arbitrary integers, Unicode) and a 26-slot array is insufficient.
Complexity: O(n) time, O(1) space per window update.
Composite Key Hashing for Multi-Dimensional Grouping
Solves: group elements by a composite property (e.g., same difference, same character-count fingerprint, same bounding box). Mechanism: reduce the multi-dimensional property to a single hashable key — a tuple, frozenset, or polynomial hash of sorted fields. Use as the dict key directly. When to use over sorting: sorting by the composite key is O(n log n) and requires a total order that may not be natural (e.g., sorting by character-frequency vector). Hashing is O(n) and works for any equivalence class, not just orderable ones. Complexity: O(n) time (assuming O(1) key computation per element), O(n) space.
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Substring comparison | Are two substrings equal in O(1)? | Prefix hash array; compare h(l1, r1) == h(l2, r2) |
| Duplicate substring detection | Does any substring of length k appear twice? | Rolling hash over length-k windows; hash set |
| Longest repeated substring | Longest substring appearing ≥2 times | Binary search on length + rolling hash |
| Anagram / permutation window | Find all windows in T that are permutations of P | Sum-hash or frequency array; sliding window |
| Composite key grouping | Group by multi-field equivalence | Tuple or frozenset key; hash map of lists |
| Custom hashable object | Use class instance as dict/set key | Implement __hash__ + __eq__; delegate to hash(tuple(fields)) |
| Hash map / hash set design | Implement from scratch | Array of buckets + chaining or open addressing |
| Anti-adversarial hashing | Avoid O(n) worst-case from bad inputs | Random base; universal hash family |
| Board / state deduplication | Have we visited this game state? | Zobrist hash; incremental XOR update |
Problem Signal → Technique
| Signal in problem | Likely technique |
|---|---|
| "compare substrings in O(1)" / "check if substrings are equal" | Prefix hash array |
| "find duplicate / repeated substring of length k" | Rolling hash + hash set |
| "longest duplicate substring" | Binary search on length + rolling hash |
| "find all windows that are anagrams of P" | Sliding window with frequency map or sum-hash |
| "group by some equivalence that is hard to compare directly" | Tuple/frozenset canonical key |
| "design a custom object to be used as a dictionary key" | __hash__ + __eq__ on immutable tuple of fields |
| "implement hash map / hash set from scratch" | Array + chaining (simpler) or linear probing (cache-friendly) |
| "adversarial test cases" / "hack" / "random seed" mentioned | Universal or random-base hashing |
| "detect repeated board states" / "cycle detection in game tree" | Zobrist hash |
| "O(1) check if multiset has changed" | Sum of hashed elements; XOR hash |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Prefix hash array | Static string; need many O(1) substring comparisons | Build h[i+1] = (h[i]*p + ord(s[i])) % M; query via subtraction |
| Rolling window hash | Sliding fixed-length window over a sequence | Subtract old left term * p^k; add new right term; maintain running hash |
| Double hash pair | Single hash has unacceptable collision risk | Two independent (p, M) pairs; use (h1, h2) as the map key |
| Sum hash for order-independence | Comparing multisets (anagrams, permutations) | h = Σ hash(c); commutative, so order does not matter |
| Tuple/frozenset canonical key | Grouping by unordered or multi-field equivalence | key = tuple(sorted(fields)) or frozenset(chars); use as dict key |
| Random base at runtime | Defeating adversarial fixed-hash attacks | p = random.choice([large_prime_1, large_prime_2]) at module load |
| Zobrist incremental update | State-space search with expensive full recomputation | XOR in new piece, XOR out old piece; O(1) per move |
hash(tuple(fields)) delegation | Custom __hash__ for a multi-field class | Delegate to built-in hash on a tuple of the fields used in __eq__ |
Edge Cases & Pitfalls
-
Negative modulo result in rolling hash —
(window - outgoing * p_k) % Mis always non-negative in Python but can be negative in C++/Java. Failing to add M before taking mod produces a negative "hash" that never matches positive hashes, causing all lookups to miss. Write((window - outgoing * p_k) % M + M) % Min non-Python code. -
Base too small — using
p = 1(trivially) orp = 2(power of 2, maps all characters to few values) causes many collisions even for short strings. Base must strictly exceed the alphabet size and be prime. -
Modulus too small — M = 10⁶ gives a ~50% birthday-paradox collision for n ≈ 1000 strings. Use M ≥ 10⁹; use two hashes when M² headroom is needed.
-
Defining
__eq__without__hash__— Python automatically sets__hash__ = Noneon any class that defines__eq__without__hash__, making all instances unhashable. Both methods must be defined together. -
Inconsistent
__eq__and__hash__— ifa == bbuthash(a) != hash(b), the object is found in equality checks but will never be located in a dict or set (it is hashed to a different bucket). This causes silent logical errors, not exceptions. -
Mutable keys — modifying an object after using it as a dict key changes its hash (if
__hash__uses mutable state) but does not update its bucket placement. The object becomes permanently unreachable in the dict. Never mutate keys; use immutable types (tuple, frozenset, str) as keys. -
Hash of
None/False/0/""conflicts — Python:hash(0) == hash(False) == hash(0.0)and0 == False == 0.0. Using mixed-type keys with the same value in a dict merges them unintentionally. Coerce keys to a consistent type. -
Polynomial hash collision in competitive programming — a determined adversary can construct strings that all hash to the same value for any fixed (p, M). Randomise p at runtime; avoid publicising the base.
-
Substring hash formula off-by-one — the correct formula for
s[l:r](0-indexed, exclusive r) is(h[r] - h[l] * pow(p, r-l, M)) % M. Usingh[r-1]orh[l+1]instead ofh[r]orh[l]shifts the window by one character and produces wrong hashes. Verify on a length-1 substring:h[i+1] - h[i] * p^1must equalord(s[i]). -
Forgetting to verify on hash collision (Rabin-Karp) — hash equality implies equal strings only with high probability, not certainty. Always compare the actual substrings character by character when a hash match is found. Omitting this verification causes wrong answers on collision inputs.
-
pow(p, k, M)vsp**k % M— for large k,p**koverflows memory before modular reduction in Python (Python integers are arbitrary precision but very slow). Always use the three-argumentpow(p, k, M)form, which applies modular exponentiation in O(log k).
Implementation Notes
- Python's
hash()is randomised per process by default (PYTHONHASHSEEDenvironment variable). Never storehash(x)to disk or compare across processes; use a custom polynomial hash for persistent or cross-process hashing. hash(tuple)in Python combines field hashes using an internal SipHash-based mixing function; it is safe for use as__hash__in custom classes.frozensetis hashable and can be a dict key; its hash is order-independent (XOR-based internally). Use it when the key is a set of values rather than a sequence.collections.defaultdictandCounterboth use Python's built-in dict internally; the O(1) average guarantee comes from the underlying hash table, not the wrapper.- Python's
pow(base, exp, mod)uses fast modular exponentiation (O(log exp)); use it for computingp^k mod M— never write(p**k) % M. - For competitive programming in Python, consider using a random prime from a precomputed list as the modulus to avoid hash-cracking. A list of large primes:
[10**9+7, 10**9+9, 998244353]. - When implementing a hash map from scratch, initialise the bucket array to
None(not[]) and create lists lazily to avoid allocating O(m) empty lists upfront. - Python
setanddictresize when 2/3 full (load factor threshold 0.67). Preallocate withdict.fromkeys(keys)if the final size is known to avoid mid-run resizing.
Cross-Topic Relations
| Related topic | Specific connection |
|---|---|
| hash-table.md | Hash function feeds directly into hash table; collision handling, load factor, and rehashing are covered there |
| string-matching.md | Rabin-Karp rolling hash applied to pattern search; double hashing and collision analysis covered there |
| strings.md | Polynomial hash enables O(n) solutions to many string comparison, palindrome, and duplicate problems |
| prefix-sum.md | Prefix hash array is the string analogue of prefix sums — same precompute-once, query-O(1) structure |
| sliding-window.md | Rolling hash is a sliding window hash; pairs with two-pointer technique for window scanning |
| suffix-array.md | Suffix arrays are the O(n) alternative to binary search + rolling hash for repeated-substring problems |
| number-theory.md | Prime selection for base and modulus; modular inverse for hash formula; birthday paradox collision analysis |
| arrays.md | Underlying storage for hash tables; index arithmetic for open addressing |
Interview Reference
Must-Know Problems
Polynomial / Prefix Hash
- Longest Duplicate Substring (LC 1044, Hard) — binary search on length + rolling hash; canonical hard problem for this technique
- Repeated DNA Sequences (LC 187, Medium) — rolling hash or sliding set over length-10 windows
- Find Duplicate File in System (LC 609, Medium) — hash file content as key; group by hash
- Longest Happy Prefix (LC 1392, Medium) — uses KMP failure function, but polynomial hash gives an alternative O(n) approach
Anagram / Permutation Detection via Hashing
- Find All Anagrams in a String (LC 438, Medium) — sum-hash or frequency array; sliding window
- Group Anagrams (LC 49, Medium) — sorted-tuple key or character-count-tuple key; multimap grouping
- Valid Anagram (LC 242, Easy) — Counter equality or 26-int frequency array
Custom Hashable Keys and Composite Grouping
- Line Reflection (LC 356, Medium) — group points by mid-x using a hash set of (mid_x, y) tuples
- Number of Boomerangs (LC 447, Medium) — hash map keyed by distance from pivot
- Isomorphic Strings (LC 205, Easy) / Word Pattern (LC 290, Easy) — bidirectional hash maps
- 4Sum II (LC 454, Medium) — hash map of pair sums; composite key lookup
Designing Hash Structures from Scratch
- Design HashMap (LC 706, Easy) — implement with array + chaining
- Design HashSet (LC 705, Easy) — implement with array + chaining or bitset
- LRU Cache (LC 146, Medium) — hash map + doubly linked list; tests combined data structure design
Zobrist / State Hashing
- Find Winner on a Tic Tac Toe Game (LC 1275, Easy) — simple state encoding; foundation for Zobrist
- (Competitive) Detecting repeated board states in combinatorial game search — canonical Zobrist use case
Clarification Questions to Ask
- Are the keys strings, integers, or composite objects? (Determines hash strategy: polynomial, division, or tuple delegation.)
- Is the alphabet bounded (e.g., lowercase only) or arbitrary Unicode? (Affects base choice for polynomial hash.)
- Must duplicate detection be exact (no false positives), or is probabilistic acceptable? (Exact → verify on collision; probabilistic → single hash is fine.)
- Are keys mutable? (Mutable keys cannot safely be used in a hash set/map — need to convert to immutable form.)
- Is the input adversarially constructed? (If yes, use random base or double hashing.)
- Do we need O(1) worst-case, or is O(1) average sufficient? (Worst-case requires perfect hashing or cuckoo hashing; average is standard chaining/open addressing.)
- Is the hash required to be stable across runs or machines? (Python's built-in
hashis not; use custom polynomial hash for portability.)
Common Interview Mistakes
- Using Python's built-in
hash()for persistent storage or cross-process comparison, not realising it changes across interpreter runs due toPYTHONHASHSEEDrandomisation. - Writing
p**k % Minstead ofpow(p, k, M), which is correct but extremely slow for large k due to Python computing the fullp**kbignum before reducing. - Forgetting to verify after a hash collision in Rabin-Karp, causing wrong answers on adversarial inputs designed to collide the hash.
- Defining
__eq__on a custom class without defining__hash__, silently making all instances unhashable and triggeringTypeErroronly when the object is inserted into a set or used as a dict key. - Using a list or dict as a dict key, getting
TypeError: unhashable type: 'list'; fix by converting totupleorfrozenset. - Treating
hash(a) == hash(b)as proof thata == b(hash equality is necessary but not sufficient for object equality). - Choosing a polynomial base less than the alphabet size (e.g., p = 26 for lowercase letters), which maps multiple characters to the same hash value and inflates collision rate dramatically.
Typical Follow-Up Escalations
| Follow-up | Expected answer |
|---|---|
| "Your solution has hash collisions on this test case — fix it" | Switch to double hashing with two independent (base, mod) pairs; or randomise the base at runtime |
| "Now make the hash persistent across runs / machines" | Replace hash() with a custom polynomial hash using a fixed seed; hash() is PYTHONHASHSEED-dependent |
| "What if keys are adversarially chosen to all collide?" | Universal hashing: pick a, b uniformly at random; any two keys collide with probability ≤ 1/m |
| "Can you do this with O(1) space per window?" | Rolling hash — O(1) update per step after O(k) initial computation; only the running hash and p^k need to be stored |
| "How would you handle a very large alphabet (e.g., Unicode)?" | Polynomial hash still works; increase base beyond 128; use ord(c) directly as the coefficient |
| "Design a hash function for a Point(x, y) class" | hash((x, y)) delegates to Python's built-in tuple hash; or hash(x * 31 + y) for a manual polynomial combination |
| "What if the same board state can be reached by different move sequences?" | Zobrist hash: XOR-based incremental state hash; collision probability ~1/2⁶⁴ per comparison |
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles