Hash Function

A hash function is a deterministic mapping h: key → integer used to assign keys to buckets (in hash tables), compare substrings in O(1), detect duplicates without storing originals, and create composite keys from multi-field objects. This file covers the design and use of hash functions as a coding technique. For the hash table data structure built on top of them, see hash-table.md. For rolling hash applied to string search algorithms (KMP, Rabin-Karp), see string-matching.md.

Core Concepts

Hash function — a deterministic function h(key) → [0, m) mapping a key to a bucket index. Determinism (same input always produces same output) is the single mandatory invariant; everything else is a quality measure.

Uniformity — a hash function distributes keys evenly across all m buckets. Poor uniformity causes clustering: O(n) lookup in the worst bucket even though average is O(1).

Avalanche effect — a small change in the input (one bit, one character) causes roughly half the output bits to change. Hash functions without this property cluster similar keys near the same bucket.

Collision — two distinct keys produce the same hash value. Collisions are unavoidable by the pigeonhole principle whenever the key space exceeds m. The hash function controls how often they occur; the hash table controls how to handle them.

Collision rate — for a uniform hash over m buckets with n keys, the expected number of collisions is n(n-1)/(2m) (birthday paradox). Keeping n << m (low load factor) keeps this small.

Modular arithmetic — standard technique to reduce an arbitrary integer hash to [0, m): h(k) mod m. Choosing m as a prime reduces patterns introduced by the key's binary structure.

Polynomial hash — for a string or sequence s[0..n-1], the value s[0]·p^(n-1) + s[1]·p^(n-2) + … + s[n-1] mod M where p is a base prime and M is a large prime modulus. Polynomial hashes are the standard hash function for strings and tuples in competitive programming and interview problems.

Rolling hash — a polynomial hash of a fixed-length window that updates in O(1) by subtracting the outgoing term and adding the incoming term. Enables O(n) scanning of all length-k substrings.

Prefix hash array — a precomputed array h[0..n] where h[i] is the polynomial hash of s[0..i-1]. Allows O(1) computation of the hash of any substring s[l..r-1] via the formula (h[r] - h[l] * p^(r-l)) % M.

Hash of a composite key — when a key is a tuple, record, or multi-field object, its hash is typically computed by combining field hashes (e.g., polynomial combination). In Python, hash(tuple(fields)) is the idiomatic approach.

Custom hashable type — a user-defined class that implements __hash__ and __eq__ consistently. Python requires that objects that compare equal must have equal hashes (a == b → hash(a) == hash(b)); the converse need not hold.

Anti-hash (adversarial) input — an input set constructed so that all keys collide in a fixed-seed hash table, degrading every operation to O(n). Competitive programming judges sometimes use this to break solutions relying on Python's built-in set/dict.

Random base hashing — defense against adversarial input: choose the polynomial base and/or modulus randomly at runtime, making it infeasible for an adversary to predict the hash values.

Double hashing — computing two independent hashes with different (base, modulus) pairs. A false positive (collision in one hash) requires collision in both; probability drops from ~1/M to ~1/(M₁·M₂).

Perfect hash — a hash function with zero collisions for a known, fixed key set. Constructible in O(n) expected time using FKS two-level hashing or CHD algorithm. Gives O(1) worst-case lookup for static tables.

Minimal perfect hash — a perfect hash where m = n (no wasted buckets). Used in compiler symbol tables and read-only lookup tables.

Types / Variants

Variant	What it is	When to use it over alternatives	Key property
Division hash	`h(k) = k mod m`	Integer keys; m is a known prime	Simplest; m must be prime and not near a power of 2
Polynomial hash (string)	Horner-evaluated polynomial over characters mod M	String/sequence keys; need prefix hash array	O(n) build, O(1) substring query; p=31 or p=131
Rolling / sliding hash	Polynomial hash updated in O(1) as window slides	Fixed-length window comparisons; anagram, duplicate substring	Avoids recomputing from scratch; remove old + add new
Double hash	Two independent (base, mod) pairs	When collision probability must be near 0; competitive programming	P(false positive) ≈ 1/(M₁·M₂) ≈ 10⁻¹⁸
Tuple hash	`hash(tuple(fields))`	Composite keys (coordinate, multi-field object)	Built-in to Python; O(1) lookup with frozenset/tuple as key
FNV / djb2 hash	Bitwise mixing hash (XOR + multiply or add + shift)	Custom hash for non-string primitives outside Python	Fast; no modular reduction needed for 32/64-bit output
Universal hash	`((a·k + b) mod p) mod m`; a, b chosen randomly	Prevent adversarial inputs; provable O(1) expected	Any two distinct keys collide with probability ≤ 1/m
Zobrist hash	XOR of random values assigned to (position, piece) pairs	Board game states; incremental hash of set operations	O(1) update when one element changes; XOR is its own inverse

Key Algorithms

Polynomial Hash Construction

Build a prefix hash array over string or sequence s in O(n). Allows O(1) substring hash queries.

The base p must exceed the alphabet size (31 for lowercase, 131 for ASCII). M must be a large prime (10⁹+7 or 10⁹+9 are standard choices).

h[0] = 0
for i, c in enumerate(s):
    h[i+1] = (h[i] * p + ord(c)) % M

Substring hash for s[l:r]: (h[r] - h[l] * pow(p, r-l, M)) % M. Time O(n) build, O(1) per query.

Rolling Hash Update (Sliding Window)

Maintain the hash of a fixed-length window of size k without recomputing from scratch. O(1) per slide after O(k) initial computation.

Remove the leftmost character (multiply its contribution by p^(k-1) and subtract), then add the new rightmost character. The precomputed value p_k = pow(p, k, M) stays constant throughout.

window = (window * p + ord(s[i])) % M          # add right
window = (window - ord(s[i-k]) * p_k) % M      # remove left

Correctness: (window - x * p_k) % M can be negative in languages with signed modulo; add M before taking mod if needed. Python's % is always non-negative.

Double Hashing for Near-Zero Collision Probability

Run two independent polynomial hashes simultaneously with different (base, modulus) pairs. Combine as a tuple (h1, h2) used as the dictionary key or compared for equality.

Probability of a false positive drops from 1/M to 1/(M₁·M₂). For M₁ = M₂ = 10⁹+7, this is ~10⁻¹⁸ per comparison — negligible for n ≤ 10⁶ strings.

Use when a single hash causes wrong answers on adversarial test data, or in competitive programming where judges are known to include anti-hash inputs.

Universal Hashing

Choose a, b uniformly at random at program start: h(k) = ((a*k + b) % p) % m where p is a prime ≥ universe size. The family {h_{a,b}} is 2-universal: for any two distinct keys x, y, P[h(x) == h(y)] ≤ 1/m over the random choice of a, b.

In Python: random.randint to pick the base at import time; use as the polynomial base for string hashing to defeat adversarial inputs against fixed-base solutions.

Zobrist Hashing

Assign a uniformly random 64-bit integer to each (position, value) pair at program start. The hash of a state is the XOR of all random values for occupied (position, value) pairs.

table[pos][val] = random.getrandbits(64)   # initialise once
h ^= table[pos][val]   # toggle: XOR again to remove

Key insight: XOR is its own inverse, so removing an element is the same operation as adding it. O(1) update when one element changes; no recomputation over the whole state.

Applies to: board game hashing (chess, Go), set-membership hashing, detecting duplicate board positions during search.

Designing `hash` for Custom Classes

Implement __hash__ to return a deterministic integer for any instance; implement __eq__ to define equality. Python's contract: a == b must imply hash(a) == hash(b).

Simplest correct approach: combine all fields that participate in __eq__ into a tuple and delegate to the built-in hash.

def __hash__(self): return hash((self.x, self.y, self.label))
def __eq__(self, o): return (self.x, self.y, self.label) == (o.x, o.y, o.label)

Mutable classes should not implement __hash__ (Python sets __hash__ = None by default when __eq__ is defined without __hash__, making instances unhashable).

Hash Map from Scratch (Chaining)

An array of m buckets, each a list of (key, value) pairs. Hash determines the bucket; linear scan within the bucket handles collisions.

The non-trivial parts: choosing m as a prime, resizing when load factor exceeds 0.75 (double m and rehash all entries), and using the modulus-prime trick h(k) = hash(k) % m where hash is Python's built-in (or a custom polynomial hash for strings).

See hash-table.md for the full hash table data structure with all collision strategies, invariants, and operations.

Hash Set from Scratch (Open Addressing / Linear Probing)

Array of m slots, each None (empty), a tombstone sentinel, or a key. find_slot(k) probes (h(k) + i) % m for i = 0, 1, … until empty or matching slot.

Critical: deletions must write a tombstone (not set to empty) to preserve the probe chain for keys that were inserted past the deleted slot.

See hash-table.md for detailed open addressing variants and the tombstone invariant.

Advanced Techniques

Prefix Hash + Binary Search for Longest Repeated Substring

Solves: "find the longest substring that appears at least twice" (or k times). O(n log n). Mechanism: binary search on the answer length L; for each candidate L, compute all n−L+1 substring hashes using the prefix array and check for duplicates via a hash set. Hash match triggers string comparison to rule out false positives. When to use over suffix arrays: when implementing a full suffix array + LCP array is too complex for an interview context, or when the problem only asks for existence or count, not all occurrence positions. Suffix arrays give O(n) and handle all repeated-substring queries; rolling hash binary search is simpler but O(n log n). Complexity: O(n log n) time expected; O(n) space.

See string-matching.md for the full Rabin-Karp and suffix-array context.

Polynomial Hash for Anagram / Permutation Equivalence

Solves: "are these two strings permutations of each other?", "find all windows that are permutations of P". Mechanism: compute a order-independent hash by summing (not polynomial-chaining) the hashes of individual characters: h(s) = Σ hash(c) for c in s. Equal multisets produce equal sums. Use with sliding window: subtract outgoing character's hash, add incoming. Verify with frequency map on collision. When to use over frequency array: frequency array is O(26) to compare two windows — fast enough for ASCII. Use hash when the "alphabet" is large (arbitrary integers, Unicode) and a 26-slot array is insufficient. Complexity: O(n) time, O(1) space per window update.

Composite Key Hashing for Multi-Dimensional Grouping

Solves: group elements by a composite property (e.g., same difference, same character-count fingerprint, same bounding box). Mechanism: reduce the multi-dimensional property to a single hashable key — a tuple, frozenset, or polynomial hash of sorted fields. Use as the dict key directly. When to use over sorting: sorting by the composite key is O(n log n) and requires a total order that may not be natural (e.g., sorting by character-frequency vector). Hashing is O(n) and works for any equivalence class, not just orderable ones. Complexity: O(n) time (assuming O(1) key computation per element), O(n) space.

Problem Taxonomy

By Problem Category

Category	What it asks	Key technique
Substring comparison	Are two substrings equal in O(1)?	Prefix hash array; compare `h(l1, r1) == h(l2, r2)`
Duplicate substring detection	Does any substring of length k appear twice?	Rolling hash over length-k windows; hash set
Longest repeated substring	Longest substring appearing ≥2 times	Binary search on length + rolling hash
Anagram / permutation window	Find all windows in T that are permutations of P	Sum-hash or frequency array; sliding window
Composite key grouping	Group by multi-field equivalence	Tuple or frozenset key; hash map of lists
Custom hashable object	Use class instance as dict/set key	Implement `__hash__` + `__eq__`; delegate to `hash(tuple(fields))`
Hash map / hash set design	Implement from scratch	Array of buckets + chaining or open addressing
Anti-adversarial hashing	Avoid O(n) worst-case from bad inputs	Random base; universal hash family
Board / state deduplication	Have we visited this game state?	Zobrist hash; incremental XOR update

Problem Signal → Technique

Signal in problem	Likely technique
"compare substrings in O(1)" / "check if substrings are equal"	Prefix hash array
"find duplicate / repeated substring of length k"	Rolling hash + hash set
"longest duplicate substring"	Binary search on length + rolling hash
"find all windows that are anagrams of P"	Sliding window with frequency map or sum-hash
"group by some equivalence that is hard to compare directly"	Tuple/frozenset canonical key
"design a custom object to be used as a dictionary key"	`__hash__` + `__eq__` on immutable tuple of fields
"implement hash map / hash set from scratch"	Array + chaining (simpler) or linear probing (cache-friendly)
"adversarial test cases" / "hack" / "random seed" mentioned	Universal or random-base hashing
"detect repeated board states" / "cycle detection in game tree"	Zobrist hash
"O(1) check if multiset has changed"	Sum of hashed elements; XOR hash

Common Patterns

Pattern	When it applies	Mechanism
Prefix hash array	Static string; need many O(1) substring comparisons	Build `h[i+1] = (h[i]*p + ord(s[i])) % M`; query via subtraction
Rolling window hash	Sliding fixed-length window over a sequence	Subtract old left term * p^k; add new right term; maintain running hash
Double hash pair	Single hash has unacceptable collision risk	Two independent (p, M) pairs; use `(h1, h2)` as the map key
Sum hash for order-independence	Comparing multisets (anagrams, permutations)	`h = Σ hash(c)`; commutative, so order does not matter
Tuple/frozenset canonical key	Grouping by unordered or multi-field equivalence	`key = tuple(sorted(fields))` or `frozenset(chars)`; use as dict key
Random base at runtime	Defeating adversarial fixed-hash attacks	`p = random.choice([large_prime_1, large_prime_2])` at module load
Zobrist incremental update	State-space search with expensive full recomputation	XOR in new piece, XOR out old piece; O(1) per move
`hash(tuple(fields))` delegation	Custom `__hash__` for a multi-field class	Delegate to built-in `hash` on a tuple of the fields used in `__eq__`

Edge Cases & Pitfalls

Negative modulo result in rolling hash — (window - outgoing * p_k) % M is always non-negative in Python but can be negative in C++/Java. Failing to add M before taking mod produces a negative "hash" that never matches positive hashes, causing all lookups to miss. Write ((window - outgoing * p_k) % M + M) % M in non-Python code.
Base too small — using p = 1 (trivially) or p = 2 (power of 2, maps all characters to few values) causes many collisions even for short strings. Base must strictly exceed the alphabet size and be prime.
Modulus too small — M = 10⁶ gives a ~50% birthday-paradox collision for n ≈ 1000 strings. Use M ≥ 10⁹; use two hashes when M² headroom is needed.
Defining __eq__ without __hash__ — Python automatically sets __hash__ = None on any class that defines __eq__ without __hash__, making all instances unhashable. Both methods must be defined together.
Inconsistent __eq__ and __hash__ — if a == b but hash(a) != hash(b), the object is found in equality checks but will never be located in a dict or set (it is hashed to a different bucket). This causes silent logical errors, not exceptions.
Mutable keys — modifying an object after using it as a dict key changes its hash (if __hash__ uses mutable state) but does not update its bucket placement. The object becomes permanently unreachable in the dict. Never mutate keys; use immutable types (tuple, frozenset, str) as keys.
Hash of None/False/0/"" conflicts — Python: hash(0) == hash(False) == hash(0.0) and 0 == False == 0.0. Using mixed-type keys with the same value in a dict merges them unintentionally. Coerce keys to a consistent type.
Polynomial hash collision in competitive programming — a determined adversary can construct strings that all hash to the same value for any fixed (p, M). Randomise p at runtime; avoid publicising the base.
Substring hash formula off-by-one — the correct formula for s[l:r] (0-indexed, exclusive r) is (h[r] - h[l] * pow(p, r-l, M)) % M. Using h[r-1] or h[l+1] instead of h[r] or h[l] shifts the window by one character and produces wrong hashes. Verify on a length-1 substring: h[i+1] - h[i] * p^1 must equal ord(s[i]).
Forgetting to verify on hash collision (Rabin-Karp) — hash equality implies equal strings only with high probability, not certainty. Always compare the actual substrings character by character when a hash match is found. Omitting this verification causes wrong answers on collision inputs.
pow(p, k, M) vs p**k % M — for large k, p\*\*koverflows memory before modular reduction in Python (Python integers are arbitrary precision but very slow). Always use the three-argumentpow(p, k, M) form, which applies modular exponentiation in O(log k).

Implementation Notes

Python's hash() is randomised per process by default (PYTHONHASHSEED environment variable). Never store hash(x) to disk or compare across processes; use a custom polynomial hash for persistent or cross-process hashing.
hash(tuple) in Python combines field hashes using an internal SipHash-based mixing function; it is safe for use as __hash__ in custom classes.
frozenset is hashable and can be a dict key; its hash is order-independent (XOR-based internally). Use it when the key is a set of values rather than a sequence.
collections.defaultdict and Counter both use Python's built-in dict internally; the O(1) average guarantee comes from the underlying hash table, not the wrapper.
Python's pow(base, exp, mod) uses fast modular exponentiation (O(log exp)); use it for computing p^k mod M — never write (p**k) % M.
For competitive programming in Python, consider using a random prime from a precomputed list as the modulus to avoid hash-cracking. A list of large primes: [10**9+7, 10**9+9, 998244353].
When implementing a hash map from scratch, initialise the bucket array to None (not []) and create lists lazily to avoid allocating O(m) empty lists upfront.
Python set and dict resize when 2/3 full (load factor threshold 0.67). Preallocate with dict.fromkeys(keys) if the final size is known to avoid mid-run resizing.

Cross-Topic Relations

Related topic	Specific connection
hash-table.md	Hash function feeds directly into hash table; collision handling, load factor, and rehashing are covered there
string-matching.md	Rabin-Karp rolling hash applied to pattern search; double hashing and collision analysis covered there
strings.md	Polynomial hash enables O(n) solutions to many string comparison, palindrome, and duplicate problems
prefix-sum.md	Prefix hash array is the string analogue of prefix sums — same precompute-once, query-O(1) structure
sliding-window.md	Rolling hash is a sliding window hash; pairs with two-pointer technique for window scanning
suffix-array.md	Suffix arrays are the O(n) alternative to binary search + rolling hash for repeated-substring problems
number-theory.md	Prime selection for base and modulus; modular inverse for hash formula; birthday paradox collision analysis
arrays.md	Underlying storage for hash tables; index arithmetic for open addressing

Interview Reference

Must-Know Problems

Polynomial / Prefix Hash

Longest Duplicate Substring (LC 1044, Hard) — binary search on length + rolling hash; canonical hard problem for this technique
Repeated DNA Sequences (LC 187, Medium) — rolling hash or sliding set over length-10 windows
Find Duplicate File in System (LC 609, Medium) — hash file content as key; group by hash
Longest Happy Prefix (LC 1392, Medium) — uses KMP failure function, but polynomial hash gives an alternative O(n) approach

Anagram / Permutation Detection via Hashing

Find All Anagrams in a String (LC 438, Medium) — sum-hash or frequency array; sliding window
Group Anagrams (LC 49, Medium) — sorted-tuple key or character-count-tuple key; multimap grouping
Valid Anagram (LC 242, Easy) — Counter equality or 26-int frequency array

Custom Hashable Keys and Composite Grouping

Line Reflection (LC 356, Medium) — group points by mid-x using a hash set of (mid_x, y) tuples
Number of Boomerangs (LC 447, Medium) — hash map keyed by distance from pivot
Isomorphic Strings (LC 205, Easy) / Word Pattern (LC 290, Easy) — bidirectional hash maps
4Sum II (LC 454, Medium) — hash map of pair sums; composite key lookup

Designing Hash Structures from Scratch

Design HashMap (LC 706, Easy) — implement with array + chaining
Design HashSet (LC 705, Easy) — implement with array + chaining or bitset
LRU Cache (LC 146, Medium) — hash map + doubly linked list; tests combined data structure design

Zobrist / State Hashing

Find Winner on a Tic Tac Toe Game (LC 1275, Easy) — simple state encoding; foundation for Zobrist
(Competitive) Detecting repeated board states in combinatorial game search — canonical Zobrist use case

Clarification Questions to Ask

Are the keys strings, integers, or composite objects? (Determines hash strategy: polynomial, division, or tuple delegation.)
Is the alphabet bounded (e.g., lowercase only) or arbitrary Unicode? (Affects base choice for polynomial hash.)
Must duplicate detection be exact (no false positives), or is probabilistic acceptable? (Exact → verify on collision; probabilistic → single hash is fine.)
Are keys mutable? (Mutable keys cannot safely be used in a hash set/map — need to convert to immutable form.)
Is the input adversarially constructed? (If yes, use random base or double hashing.)
Do we need O(1) worst-case, or is O(1) average sufficient? (Worst-case requires perfect hashing or cuckoo hashing; average is standard chaining/open addressing.)
Is the hash required to be stable across runs or machines? (Python's built-in hash is not; use custom polynomial hash for portability.)

Common Interview Mistakes

Using Python's built-in hash() for persistent storage or cross-process comparison, not realising it changes across interpreter runs due to PYTHONHASHSEED randomisation.
Writing p**k % M instead of pow(p, k, M), which is correct but extremely slow for large k due to Python computing the full p**k bignum before reducing.
Forgetting to verify after a hash collision in Rabin-Karp, causing wrong answers on adversarial inputs designed to collide the hash.
Defining __eq__ on a custom class without defining __hash__, silently making all instances unhashable and triggering TypeError only when the object is inserted into a set or used as a dict key.
Using a list or dict as a dict key, getting TypeError: unhashable type: 'list'; fix by converting to tuple or frozenset.
Treating hash(a) == hash(b) as proof that a == b (hash equality is necessary but not sufficient for object equality).
Choosing a polynomial base less than the alphabet size (e.g., p = 26 for lowercase letters), which maps multiple characters to the same hash value and inflates collision rate dramatically.

Typical Follow-Up Escalations

Follow-up	Expected answer
"Your solution has hash collisions on this test case — fix it"	Switch to double hashing with two independent (base, mod) pairs; or randomise the base at runtime
"Now make the hash persistent across runs / machines"	Replace `hash()` with a custom polynomial hash using a fixed seed; `hash()` is PYTHONHASHSEED-dependent
"What if keys are adversarially chosen to all collide?"	Universal hashing: pick a, b uniformly at random; any two keys collide with probability ≤ 1/m
"Can you do this with O(1) space per window?"	Rolling hash — O(1) update per step after O(k) initial computation; only the running hash and `p^k` need to be stored
"How would you handle a very large alphabet (e.g., Unicode)?"	Polynomial hash still works; increase base beyond 128; use `ord(c)` directly as the coefficient
"Design a hash function for a Point(x, y) class"	`hash((x, y))` delegates to Python's built-in tuple hash; or `hash(x * 31 + y)` for a manual polynomial combination
"What if the same board state can be reached by different move sequences?"	Zobrist hash: XOR-based incremental state hash; collision probability ~1/2⁶⁴ per comparison

Explore Unread

Great job! You've read all available articles

Hash Function

Core Concepts

Uniformity — a hash function distributes keys evenly across all m buckets. Poor uniformity causes clustering: O(n) lookup in the worst bucket even though average is O(1).

Collision rate — for a uniform hash over m buckets with n keys, the expected number of collisions is n(n-1)/(2m) (birthday paradox). Keeping n << m (low load factor) keeps this small.

Modular arithmetic — standard technique to reduce an arbitrary integer hash to [0, m): h(k) mod m. Choosing m as a prime reduces patterns introduced by the key's binary structure.

Rolling hash — a polynomial hash of a fixed-length window that updates in O(1) by subtracting the outgoing term and adding the incoming term. Enables O(n) scanning of all length-k substrings.

Random base hashing — defense against adversarial input: choose the polynomial base and/or modulus randomly at runtime, making it infeasible for an adversary to predict the hash values.

Minimal perfect hash — a perfect hash where m = n (no wasted buckets). Used in compiler symbol tables and read-only lookup tables.

Types / Variants

Variant	What it is	When to use it over alternatives	Key property
Division hash	`h(k) = k mod m`	Integer keys; m is a known prime	Simplest; m must be prime and not near a power of 2
Polynomial hash (string)	Horner-evaluated polynomial over characters mod M	String/sequence keys; need prefix hash array	O(n) build, O(1) substring query; p=31 or p=131
Rolling / sliding hash	Polynomial hash updated in O(1) as window slides	Fixed-length window comparisons; anagram, duplicate substring	Avoids recomputing from scratch; remove old + add new
Double hash	Two independent (base, mod) pairs	When collision probability must be near 0; competitive programming	P(false positive) ≈ 1/(M₁·M₂) ≈ 10⁻¹⁸
Tuple hash	`hash(tuple(fields))`	Composite keys (coordinate, multi-field object)	Built-in to Python; O(1) lookup with frozenset/tuple as key
FNV / djb2 hash	Bitwise mixing hash (XOR + multiply or add + shift)	Custom hash for non-string primitives outside Python	Fast; no modular reduction needed for 32/64-bit output
Universal hash	`((a·k + b) mod p) mod m`; a, b chosen randomly	Prevent adversarial inputs; provable O(1) expected	Any two distinct keys collide with probability ≤ 1/m
Zobrist hash	XOR of random values assigned to (position, piece) pairs	Board game states; incremental hash of set operations	O(1) update when one element changes; XOR is its own inverse

Key Algorithms

Polynomial Hash Construction

Build a prefix hash array over string or sequence s in O(n). Allows O(1) substring hash queries.

The base p must exceed the alphabet size (31 for lowercase, 131 for ASCII). M must be a large prime (10⁹+7 or 10⁹+9 are standard choices).

h[0] = 0
for i, c in enumerate(s):
    h[i+1] = (h[i] * p + ord(c)) % M

Substring hash for s[l:r]: (h[r] - h[l] * pow(p, r-l, M)) % M. Time O(n) build, O(1) per query.

Rolling Hash Update (Sliding Window)

Maintain the hash of a fixed-length window of size k without recomputing from scratch. O(1) per slide after O(k) initial computation.

Remove the leftmost character (multiply its contribution by p^(k-1) and subtract), then add the new rightmost character. The precomputed value p_k = pow(p, k, M) stays constant throughout.

window = (window * p + ord(s[i])) % M          # add right
window = (window - ord(s[i-k]) * p_k) % M      # remove left

Correctness: (window - x * p_k) % M can be negative in languages with signed modulo; add M before taking mod if needed. Python's % is always non-negative.

Double Hashing for Near-Zero Collision Probability

Run two independent polynomial hashes simultaneously with different (base, modulus) pairs. Combine as a tuple (h1, h2) used as the dictionary key or compared for equality.

Probability of a false positive drops from 1/M to 1/(M₁·M₂). For M₁ = M₂ = 10⁹+7, this is ~10⁻¹⁸ per comparison — negligible for n ≤ 10⁶ strings.

Use when a single hash causes wrong answers on adversarial test data, or in competitive programming where judges are known to include anti-hash inputs.

Universal Hashing

In Python: random.randint to pick the base at import time; use as the polynomial base for string hashing to defeat adversarial inputs against fixed-base solutions.

Zobrist Hashing

Assign a uniformly random 64-bit integer to each (position, value) pair at program start. The hash of a state is the XOR of all random values for occupied (position, value) pairs.

table[pos][val] = random.getrandbits(64)   # initialise once
h ^= table[pos][val]   # toggle: XOR again to remove

Key insight: XOR is its own inverse, so removing an element is the same operation as adding it. O(1) update when one element changes; no recomputation over the whole state.

Applies to: board game hashing (chess, Go), set-membership hashing, detecting duplicate board positions during search.

Designing `hash` for Custom Classes

Implement __hash__ to return a deterministic integer for any instance; implement __eq__ to define equality. Python's contract: a == b must imply hash(a) == hash(b).

Simplest correct approach: combine all fields that participate in __eq__ into a tuple and delegate to the built-in hash.

def __hash__(self): return hash((self.x, self.y, self.label))
def __eq__(self, o): return (self.x, self.y, self.label) == (o.x, o.y, o.label)

Mutable classes should not implement __hash__ (Python sets __hash__ = None by default when __eq__ is defined without __hash__, making instances unhashable).

Hash Map from Scratch (Chaining)

An array of m buckets, each a list of (key, value) pairs. Hash determines the bucket; linear scan within the bucket handles collisions.

See hash-table.md for the full hash table data structure with all collision strategies, invariants, and operations.

Hash Set from Scratch (Open Addressing / Linear Probing)

Array of m slots, each None (empty), a tombstone sentinel, or a key. find_slot(k) probes (h(k) + i) % m for i = 0, 1, … until empty or matching slot.

Critical: deletions must write a tombstone (not set to empty) to preserve the probe chain for keys that were inserted past the deleted slot.

See hash-table.md for detailed open addressing variants and the tombstone invariant.

Advanced Techniques

Prefix Hash + Binary Search for Longest Repeated Substring

See string-matching.md for the full Rabin-Karp and suffix-array context.

Polynomial Hash for Anagram / Permutation Equivalence

Composite Key Hashing for Multi-Dimensional Grouping

Problem Taxonomy

By Problem Category

Category	What it asks	Key technique
Substring comparison	Are two substrings equal in O(1)?	Prefix hash array; compare `h(l1, r1) == h(l2, r2)`
Duplicate substring detection	Does any substring of length k appear twice?	Rolling hash over length-k windows; hash set
Longest repeated substring	Longest substring appearing ≥2 times	Binary search on length + rolling hash
Anagram / permutation window	Find all windows in T that are permutations of P	Sum-hash or frequency array; sliding window
Composite key grouping	Group by multi-field equivalence	Tuple or frozenset key; hash map of lists
Custom hashable object	Use class instance as dict/set key	Implement `__hash__` + `__eq__`; delegate to `hash(tuple(fields))`
Hash map / hash set design	Implement from scratch	Array of buckets + chaining or open addressing
Anti-adversarial hashing	Avoid O(n) worst-case from bad inputs	Random base; universal hash family
Board / state deduplication	Have we visited this game state?	Zobrist hash; incremental XOR update

Problem Signal → Technique

Signal in problem	Likely technique
"compare substrings in O(1)" / "check if substrings are equal"	Prefix hash array
"find duplicate / repeated substring of length k"	Rolling hash + hash set
"longest duplicate substring"	Binary search on length + rolling hash
"find all windows that are anagrams of P"	Sliding window with frequency map or sum-hash
"group by some equivalence that is hard to compare directly"	Tuple/frozenset canonical key
"design a custom object to be used as a dictionary key"	`__hash__` + `__eq__` on immutable tuple of fields
"implement hash map / hash set from scratch"	Array + chaining (simpler) or linear probing (cache-friendly)
"adversarial test cases" / "hack" / "random seed" mentioned	Universal or random-base hashing
"detect repeated board states" / "cycle detection in game tree"	Zobrist hash
"O(1) check if multiset has changed"	Sum of hashed elements; XOR hash

Common Patterns

Pattern	When it applies	Mechanism
Prefix hash array	Static string; need many O(1) substring comparisons	Build `h[i+1] = (h[i]*p + ord(s[i])) % M`; query via subtraction
Rolling window hash	Sliding fixed-length window over a sequence	Subtract old left term * p^k; add new right term; maintain running hash
Double hash pair	Single hash has unacceptable collision risk	Two independent (p, M) pairs; use `(h1, h2)` as the map key
Sum hash for order-independence	Comparing multisets (anagrams, permutations)	`h = Σ hash(c)`; commutative, so order does not matter
Tuple/frozenset canonical key	Grouping by unordered or multi-field equivalence	`key = tuple(sorted(fields))` or `frozenset(chars)`; use as dict key
Random base at runtime	Defeating adversarial fixed-hash attacks	`p = random.choice([large_prime_1, large_prime_2])` at module load
Zobrist incremental update	State-space search with expensive full recomputation	XOR in new piece, XOR out old piece; O(1) per move
`hash(tuple(fields))` delegation	Custom `__hash__` for a multi-field class	Delegate to built-in `hash` on a tuple of the fields used in `__eq__`

Edge Cases & Pitfalls

Negative modulo result in rolling hash — (window - outgoing * p_k) % M is always non-negative in Python but can be negative in C++/Java. Failing to add M before taking mod produces a negative "hash" that never matches positive hashes, causing all lookups to miss. Write ((window - outgoing * p_k) % M + M) % M in non-Python code.
Base too small — using p = 1 (trivially) or p = 2 (power of 2, maps all characters to few values) causes many collisions even for short strings. Base must strictly exceed the alphabet size and be prime.
Modulus too small — M = 10⁶ gives a ~50% birthday-paradox collision for n ≈ 1000 strings. Use M ≥ 10⁹; use two hashes when M² headroom is needed.
Defining __eq__ without __hash__ — Python automatically sets __hash__ = None on any class that defines __eq__ without __hash__, making all instances unhashable. Both methods must be defined together.
Inconsistent __eq__ and __hash__ — if a == b but hash(a) != hash(b), the object is found in equality checks but will never be located in a dict or set (it is hashed to a different bucket). This causes silent logical errors, not exceptions.
Mutable keys — modifying an object after using it as a dict key changes its hash (if __hash__ uses mutable state) but does not update its bucket placement. The object becomes permanently unreachable in the dict. Never mutate keys; use immutable types (tuple, frozenset, str) as keys.
Hash of None/False/0/"" conflicts — Python: hash(0) == hash(False) == hash(0.0) and 0 == False == 0.0. Using mixed-type keys with the same value in a dict merges them unintentionally. Coerce keys to a consistent type.
Polynomial hash collision in competitive programming — a determined adversary can construct strings that all hash to the same value for any fixed (p, M). Randomise p at runtime; avoid publicising the base.
Substring hash formula off-by-one — the correct formula for s[l:r] (0-indexed, exclusive r) is (h[r] - h[l] * pow(p, r-l, M)) % M. Using h[r-1] or h[l+1] instead of h[r] or h[l] shifts the window by one character and produces wrong hashes. Verify on a length-1 substring: h[i+1] - h[i] * p^1 must equal ord(s[i]).
Forgetting to verify on hash collision (Rabin-Karp) — hash equality implies equal strings only with high probability, not certainty. Always compare the actual substrings character by character when a hash match is found. Omitting this verification causes wrong answers on collision inputs.
pow(p, k, M) vs p**k % M — for large k, p\*\*koverflows memory before modular reduction in Python (Python integers are arbitrary precision but very slow). Always use the three-argumentpow(p, k, M) form, which applies modular exponentiation in O(log k).

Implementation Notes

Python's hash() is randomised per process by default (PYTHONHASHSEED environment variable). Never store hash(x) to disk or compare across processes; use a custom polynomial hash for persistent or cross-process hashing.
hash(tuple) in Python combines field hashes using an internal SipHash-based mixing function; it is safe for use as __hash__ in custom classes.
frozenset is hashable and can be a dict key; its hash is order-independent (XOR-based internally). Use it when the key is a set of values rather than a sequence.
collections.defaultdict and Counter both use Python's built-in dict internally; the O(1) average guarantee comes from the underlying hash table, not the wrapper.
Python's pow(base, exp, mod) uses fast modular exponentiation (O(log exp)); use it for computing p^k mod M — never write (p**k) % M.
For competitive programming in Python, consider using a random prime from a precomputed list as the modulus to avoid hash-cracking. A list of large primes: [10**9+7, 10**9+9, 998244353].
When implementing a hash map from scratch, initialise the bucket array to None (not []) and create lists lazily to avoid allocating O(m) empty lists upfront.
Python set and dict resize when 2/3 full (load factor threshold 0.67). Preallocate with dict.fromkeys(keys) if the final size is known to avoid mid-run resizing.

Cross-Topic Relations

Related topic	Specific connection
hash-table.md	Hash function feeds directly into hash table; collision handling, load factor, and rehashing are covered there
string-matching.md	Rabin-Karp rolling hash applied to pattern search; double hashing and collision analysis covered there
strings.md	Polynomial hash enables O(n) solutions to many string comparison, palindrome, and duplicate problems
prefix-sum.md	Prefix hash array is the string analogue of prefix sums — same precompute-once, query-O(1) structure
sliding-window.md	Rolling hash is a sliding window hash; pairs with two-pointer technique for window scanning
suffix-array.md	Suffix arrays are the O(n) alternative to binary search + rolling hash for repeated-substring problems
number-theory.md	Prime selection for base and modulus; modular inverse for hash formula; birthday paradox collision analysis
arrays.md	Underlying storage for hash tables; index arithmetic for open addressing

Interview Reference

Must-Know Problems

Polynomial / Prefix Hash

Longest Duplicate Substring (LC 1044, Hard) — binary search on length + rolling hash; canonical hard problem for this technique
Repeated DNA Sequences (LC 187, Medium) — rolling hash or sliding set over length-10 windows
Find Duplicate File in System (LC 609, Medium) — hash file content as key; group by hash
Longest Happy Prefix (LC 1392, Medium) — uses KMP failure function, but polynomial hash gives an alternative O(n) approach

Anagram / Permutation Detection via Hashing

Find All Anagrams in a String (LC 438, Medium) — sum-hash or frequency array; sliding window
Group Anagrams (LC 49, Medium) — sorted-tuple key or character-count-tuple key; multimap grouping
Valid Anagram (LC 242, Easy) — Counter equality or 26-int frequency array

Custom Hashable Keys and Composite Grouping

Line Reflection (LC 356, Medium) — group points by mid-x using a hash set of (mid_x, y) tuples
Number of Boomerangs (LC 447, Medium) — hash map keyed by distance from pivot
Isomorphic Strings (LC 205, Easy) / Word Pattern (LC 290, Easy) — bidirectional hash maps
4Sum II (LC 454, Medium) — hash map of pair sums; composite key lookup

Designing Hash Structures from Scratch

Design HashMap (LC 706, Easy) — implement with array + chaining
Design HashSet (LC 705, Easy) — implement with array + chaining or bitset
LRU Cache (LC 146, Medium) — hash map + doubly linked list; tests combined data structure design

Zobrist / State Hashing

Find Winner on a Tic Tac Toe Game (LC 1275, Easy) — simple state encoding; foundation for Zobrist
(Competitive) Detecting repeated board states in combinatorial game search — canonical Zobrist use case

Clarification Questions to Ask

Are the keys strings, integers, or composite objects? (Determines hash strategy: polynomial, division, or tuple delegation.)
Is the alphabet bounded (e.g., lowercase only) or arbitrary Unicode? (Affects base choice for polynomial hash.)
Must duplicate detection be exact (no false positives), or is probabilistic acceptable? (Exact → verify on collision; probabilistic → single hash is fine.)
Are keys mutable? (Mutable keys cannot safely be used in a hash set/map — need to convert to immutable form.)
Is the input adversarially constructed? (If yes, use random base or double hashing.)
Do we need O(1) worst-case, or is O(1) average sufficient? (Worst-case requires perfect hashing or cuckoo hashing; average is standard chaining/open addressing.)
Is the hash required to be stable across runs or machines? (Python's built-in hash is not; use custom polynomial hash for portability.)

Common Interview Mistakes

Using Python's built-in hash() for persistent storage or cross-process comparison, not realising it changes across interpreter runs due to PYTHONHASHSEED randomisation.
Writing p**k % M instead of pow(p, k, M), which is correct but extremely slow for large k due to Python computing the full p**k bignum before reducing.
Forgetting to verify after a hash collision in Rabin-Karp, causing wrong answers on adversarial inputs designed to collide the hash.
Defining __eq__ on a custom class without defining __hash__, silently making all instances unhashable and triggering TypeError only when the object is inserted into a set or used as a dict key.
Using a list or dict as a dict key, getting TypeError: unhashable type: 'list'; fix by converting to tuple or frozenset.
Treating hash(a) == hash(b) as proof that a == b (hash equality is necessary but not sufficient for object equality).
Choosing a polynomial base less than the alphabet size (e.g., p = 26 for lowercase letters), which maps multiple characters to the same hash value and inflates collision rate dramatically.

Typical Follow-Up Escalations

Follow-up	Expected answer
"Your solution has hash collisions on this test case — fix it"	Switch to double hashing with two independent (base, mod) pairs; or randomise the base at runtime
"Now make the hash persistent across runs / machines"	Replace `hash()` with a custom polynomial hash using a fixed seed; `hash()` is PYTHONHASHSEED-dependent
"What if keys are adversarially chosen to all collide?"	Universal hashing: pick a, b uniformly at random; any two keys collide with probability ≤ 1/m
"Can you do this with O(1) space per window?"	Rolling hash — O(1) update per step after O(k) initial computation; only the running hash and `p^k` need to be stored
"How would you handle a very large alphabet (e.g., Unicode)?"	Polynomial hash still works; increase base beyond 128; use `ord(c)` directly as the coefficient
"Design a hash function for a Point(x, y) class"	`hash((x, y))` delegates to Python's built-in tuple hash; or `hash(x * 31 + y)` for a manual polynomial combination
"What if the same board state can be reached by different move sequences?"	Zobrist hash: XOR-based incremental state hash; collision probability ~1/2⁶⁴ per comparison

Explore Unread

Great job! You've read all available articles

Hash Function

Core Concepts

Types / Variants

Key Algorithms

Polynomial Hash Construction

Rolling Hash Update (Sliding Window)

Double Hashing for Near-Zero Collision Probability

Universal Hashing

Zobrist Hashing

Designing __hash__ for Custom Classes

Hash Map from Scratch (Chaining)

Hash Set from Scratch (Open Addressing / Linear Probing)

Advanced Techniques

Prefix Hash + Binary Search for Longest Repeated Substring

Polynomial Hash for Anagram / Permutation Equivalence

Composite Key Hashing for Multi-Dimensional Grouping

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread

Hash Function

Core Concepts

Types / Variants

Key Algorithms

Polynomial Hash Construction

Rolling Hash Update (Sliding Window)

Double Hashing for Near-Zero Collision Probability

Universal Hashing

Zobrist Hashing

Designing __hash__ for Custom Classes

Hash Map from Scratch (Chaining)

Hash Set from Scratch (Open Addressing / Linear Probing)

Advanced Techniques

Prefix Hash + Binary Search for Longest Repeated Substring

Polynomial Hash for Anagram / Permutation Equivalence

Composite Key Hashing for Multi-Dimensional Grouping

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread

Designing `hash` for Custom Classes

Designing `hash` for Custom Classes