Hash Table
Core Concepts
Hash table — a data structure that maps keys to values using a hash function to compute an index (bucket) into an array of slots. Average O(1) insert, lookup, and delete.
Hash function — a deterministic function h(key) → index that maps a key to a bucket index in [0, m) where m is the table size. A good hash function distributes keys uniformly to minimize collisions.
Collision — when two distinct keys hash to the same bucket. All hash table implementations must handle collisions; the strategy is the primary design axis.
Load factor (α) — n / m where n = number of stored entries, m = number of buckets. Controls the collision probability and memory/speed tradeoff. Typical resize threshold: α > 0.75 for chaining, α > 0.5 for open addressing.
Rehashing — when the load factor threshold is exceeded, allocate a new array (typically 2× size) and reinsert all entries by recomputing hashes. Amortizes to O(1) per insertion over a sequence of n insertions.
Amortized O(1) — individual operations may trigger O(n) rehashing, but the average cost per operation across n operations is O(1).
Separate chaining — each bucket holds a linked list (or tree) of all entries that hash to it. Allows load factor > 1.
Open addressing — all entries stored inside the array itself; on collision, probe adjacent slots according to a fixed sequence until an empty slot is found. Load factor must stay below 1.
Tombstone — a sentinel value left in a slot when a key is deleted under open addressing, so probing chains are not broken for subsequent lookups.
Perfect hashing — a hash function with zero collisions for a known static set of keys. Used in compilers and read-only tables; O(1) worst-case lookup.
Universal hashing — a family of hash functions where for any two distinct keys, the probability a randomly chosen function from the family collides them is ≤ 1/m. Prevents adversarial input from causing worst-case behavior.
Structural Invariants
| Invariant | What it guarantees | What breaks if violated |
|---|---|---|
| Each key appears in exactly one bucket | Lookup finds the unique entry | Duplicate keys yield inconsistent get/put results |
| Load factor stays below threshold | Expected O(1) operations | Chains or probe sequences degrade to O(n) |
| After rehash, every entry recomputed in new array | No entry is lost or unreachable | Silent data loss on lookup |
| Open addressing: deleted slots marked as tombstones, not empty | Probe chains remain intact | Lookups terminate early and falsely report key-not-found |
| Hash function is deterministic for equal keys | get(k) always reaches the same bucket put(k) wrote to | Infinite loop or miss on lookup |
Internal Representation
Separate chaining — array of size m, each slot a pointer to a linked list of (key, value) pairs. Java HashMap uses this; switches the chain to a red-black tree when a bucket's chain length exceeds 8 (treeification threshold), reverting to a list when it shrinks below 6.
Open addressing with linear probing — array of size m, probe sequence (h(k) + i) mod m. Exhibits primary clustering: long runs of occupied slots form and attract more collisions. Cache-friendly because accesses are sequential.
Open addressing with quadratic probing — probe sequence (h(k) + i²) mod m. Reduces primary clustering but introduces secondary clustering (keys with the same initial hash follow the same probe path). Requires m to be prime or a power of 2 with specific offsets to guarantee all slots are visited.
Double hashing — probe sequence (h₁(k) + i·h₂(k)) mod m where h₂(k) ≠ 0 and is coprime to m. Eliminates secondary clustering; best distribution among open addressing methods. Slightly more expensive per probe.
Python dict — compact hash table using open addressing with a pseudo-random probe: i = (5*i + 1 + perturb) mod m, perturb >>= 5. Preserves insertion order since Python 3.7. Initial size 8; resize when 2/3 full.
Robin Hood hashing — open addressing variant where an incoming key can "steal" a slot from a key with a smaller displacement (distance from its ideal slot). Reduces variance in probe length; used in many modern implementations (Rust HashMap, C++ Abseil).
Types / Variants
| Variant | What it is | When to use it | Key property |
|---|---|---|---|
| HashMap | Unordered key→value map | Default choice; O(1) avg all ops | No ordering guarantee |
| LinkedHashMap | HashMap + doubly linked list of insertion order | Need iteration in insertion order; LRU cache | O(1) ops, O(n) ordered iteration |
| TreeMap / SortedMap | BST-backed map (Red-Black) | Need sorted keys, floor/ceil, range queries | O(log n) all ops; maintains sort order |
| Frequency / Counter map | HashMap<key, int> tracking counts | Character frequency, word count, sliding window | Conceptual pattern, not a separate class |
| Multimap | One key → multiple values | Grouping problems (anagram buckets) | Values stored as list per key |
| Bidirectional map (Bimap) | Two HashMaps mirroring each other | Bijection problems; two-way lookup | O(1) lookup by key or value; maintain sync manually |
| Bloom filter | Probabilistic set membership | Check if element might be in a large set | False positives possible; no false negatives; O(1) but lossy |
See trees.md for TreeMap internals (Red-Black Tree).
Topic-Specific Operations
put(key, value) — O(1) amortized
Compute h(key) → bucket. In chaining: walk the chain to update if key exists, else prepend. In open addressing: probe until empty or matching slot found, write there. If load factor exceeded after insert, trigger rehash.
get(key) — O(1) average
Compute h(key) → bucket. In chaining: walk the chain comparing keys. In open addressing: probe the sequence until key found or empty slot (not tombstone) encountered. Return null/None/default if not found.
remove(key) — O(1) average
Chaining: find in chain, unlink node. Open addressing: find slot, mark it as a tombstone (not empty) to preserve probe chain integrity. Periodically rehash to reclaim tombstone slots if deletions are frequent.
containsKey(key) — O(1) average
Same as get(key) but return boolean rather than value.
Iteration — O(n + m)
Must scan all m buckets, including empty ones. Iterating n entries costs O(n + m); if m >> n (over-allocated table), iteration is expensive despite few entries. Python dict uses compact storage so iteration is O(n).
Key Algorithms
Division Method Hashing
h(k) = k mod m. Fast; m should be a prime not close to a power of 2 to avoid patterns from binary keys. Avoid powers of 2 because they only use the low-order bits of k.
Polynomial Rolling Hash (strings)
h(s) = (s[0]·p^(n-1) + s[1]·p^(n-2) + … + s[n-1]) mod M
where p is a small prime (31 for lowercase letters, 131 for all ASCII), M a large prime (1e9+7). Key insight: substring hash computable in O(1) from prefix hashes. Time O(n) preprocessing, O(1) per query.
prefix[i+1] = (prefix[i] * p + ord(s[i])) % M
h(s[l:r]) = (prefix[r] - prefix[l] * p**(r-l)) % M
Rabin-Karp Sliding Window Hash
Maintain a rolling hash of a fixed-length window: remove the outgoing character's contribution, shift remaining by p, add the incoming character. O(1) per slide after O(k) initial hash. Used for substring search and duplicate substring detection.
Double Hashing (open addressing probe)
h(k, i) = (h₁(k) + i · h₂(k)) mod m
h₂(k) = q − (k mod q) for prime q < m ensures h₂(k) ≠ 0 and is coprime to m, guaranteeing all m slots probed. Better distribution than linear or quadratic probing; slower due to two hash computations.
Universal Hashing (adversarial input defense)
Choose h_{a,b}(k) = ((a·k + b) mod p) mod m where p is prime ≥ universe size, a ∈ [1, p-1], b ∈ [0, p-1] chosen randomly. Guarantees expected O(1) regardless of input. Python randomizes hash seeds per process start (PYTHONHASHSEED) for this reason.
Advanced Techniques
Prefix Sum + Hash Map (subarray problems)
What it solves: "Find subarray with sum/XOR/property equal to target" in O(n).
Mechanism: Store prefixVal → earliest_index in a hash map. At each index i, check if prefix[i] - target exists in the map; if so, the subarray (map[prefix[i]-target], i] satisfies the condition.
When to use over sliding window: When the array contains negative numbers (sliding window requires non-negative for monotone window expansion). If all values are positive, sliding window is simpler.
Complexity: O(n) time, O(n) space.
Rolling Hash for Duplicate Substring Detection
What it solves: "Is there a repeated substring of length k?" in O(n) average (vs O(n·k) naive). Mechanism: Hash all substrings of length k using a sliding window hash; store hashes in a set. On collision, verify with string comparison (O(k)) to rule out false positives. When to use over suffix arrays: Competitive/interview settings without suffix array library; simpler to implement. Use suffix arrays when you need all duplicate substrings or exact counts. Complexity: O(n) average time; O(n) space.
Two Hash Maps for Bijection / Bidirectional Lookup
What it solves: Isomorphism, word pattern, encoding/decoding — problems requiring "every A maps to exactly one B and vice versa."
Mechanism: Maintain A→B and B→A simultaneously; on each pair (a, b), check both maps before writing.
When to use over single map: Whenever uniqueness must hold in both directions. A single map catches only one-to-one from A side; the reverse map catches many-to-one from B side.
Complexity: O(n) time and space.
Sliding Window + Frequency Map (at-most-k distinct)
What it solves: "Longest subarray/substring with at most k distinct characters/values."
Mechanism: Maintain a char→count map inside a two-pointer window. When distinct count exceeds k, shrink left pointer and decrement counts (remove entry when count hits 0).
When to use over sorting: When you need the actual window or when k is small relative to n. Sorting destroys order and costs O(n log n).
Complexity: O(n) time, O(k) space (at most k+1 entries in map at once).
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Frequency counting | Count occurrences, find majority, most frequent k | Counter map; then heap or sort |
| Complement lookup | Two elements summing/XORing to target | Hash set for "seen"; one-pass |
| Subarray with target sum | Contiguous subarray summing to k | Prefix sum + hash map |
| Grouping / categorization | Group strings by anagram, by property | Hash map keyed by canonical form (sorted string, prime product) |
| Duplicate detection | Has duplicate, find duplicate, first duplicate | Hash set; or value-as-index trick |
| Isomorphism / pattern | Word pattern, string isomorphism | Two hash maps (bidirectional) |
| Substring matching | Find all anagrams, repeated substring | Sliding window frequency map; rolling hash |
| LRU / LFU Cache | Evict least recently / frequently used | LinkedHashMap or dict + doubly linked list |
| Graph / union-find auxiliary | Map node labels to integers | HashMap for label normalization |
| Memoization (top-down DP) | Cache recursive subproblem results | HashMap when states are sparse or non-integer |
Problem Signal → Technique
| Signal in problem | Likely technique |
|---|---|
| "two numbers sum to target" | Hash set, one-pass complement check |
| "subarray sum equals k" | Prefix sum + hash map |
| "longest subarray / substring with at most k distinct" | Sliding window + frequency map |
| "group anagrams" / "find all anagrams" | Sort-key or character-count-key hash map |
| "isomorphic" / "word pattern" | Two bidirectional hash maps |
| "find duplicate in O(n) time O(1) space" | Floyd's cycle; if O(n) space allowed → hash set |
| "first unique / first repeated" | LinkedHashMap or ordered dict + count |
| "design cache with eviction" | LinkedHashMap (Java) or OrderedDict (Python) |
| "count distinct values in window" | Sliding window + hash map size |
| "repeated DNA sequence" / "repeated substring" | Rolling hash or sliding window hash set |
| "check if two strings are permutations" | Frequency map equality or 26-int array |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Counter map | Count occurrences of any hashable | d[k] = d.get(k, 0) + 1; Python Counter |
| Complement lookup | Pair-sum or pair-XOR to target | Insert seen values; check target - x before inserting x |
| Canonical key grouping | Group items that are "equivalent" | Map each item to a canonical form (sorted tuple, frozenset, prime encoding) as the key |
| Prefix aggregate + hash map | Subarray/substring with aggregate equal to target | Store prefix[i] → index in map; query prefix[i] - target |
| Sliding window frequency map | Fixed or variable window with at-most-k constraint | Two pointers; map tracks counts; shrink when invariant breaks |
| Set for O(1) membership | "Have we seen x before?" without value storage | seen = set(); x in seen |
| Index map | Reconstruct order or find span | val → index map; compute i - map[val] |
| Two maps for bijection | Enforce one-to-one mapping in both directions | fwd[a] = b and rev[b] = a; check both before writing |
Edge Cases & Pitfalls
-
Mutable keys — using a list or dict as a key raises
TypeErrorin Python because mutable objects are not hashable. Convert to tuple or frozenset before inserting. In interviews, ask "are keys guaranteed to be hashable/immutable?" -
Default value trap —
dict[key]raisesKeyErrorif key absent; usedict.get(key, default)orcollections.defaultdict. Forgetting this causes runtime crashes on first-time insertions in frequency maps. -
Removing entry vs. setting count to zero — in a sliding window frequency map, failing to
del map[k]whenmap[k]reaches 0 causes the distinct-count (vialen(map)) to be wrong. Always delete the entry rather than leaving a zero. -
Hash collision in custom objects — if you define
__eq__without defining__hash__, Python sets__hash__ = None, making the object unhashable. Both must be defined consistently: equal objects must have equal hashes. -
Integer overflow in polynomial hash — without modular arithmetic, intermediate values overflow 64-bit integers. Always reduce modulo a large prime at each step, not only at the end.
-
Rabin-Karp false positives — hash equality does not guarantee string equality. Always verify with a character-by-character comparison on hash match.
-
Concurrent modification — iterating over a dict while inserting or deleting keys causes
RuntimeErrorin Python. Collect keys to modify first, then modify. -
Load factor and worst-case — adversarial inputs can cause many collisions in a fixed-hash table, degrading every operation to O(n). In competitive programming, use randomized hash seeds or polynomial hashing with a random base to avoid this.
-
isvs==for keys — Python interns small integers and short strings;x is ymay coincidentally be true. Always use==for key comparison logic. -
Prefix sum index off-by-one — initialize
prefix_map = {0: -1}(sum 0 seen before index 0) to handle subarrays starting at index 0. Forgetting this misses subarrays that start from the beginning.
Implementation Notes
- Python
dictandsetare backed by hash tables;incheck is O(1) average, not O(n) — never use a list for membership tests when a set suffices. - Python
Counteris a subclass ofdict;Counter(a) == Counter(b)compares frequency maps directly and works for anagram checks. - Python
collections.defaultdict(int)eliminatesget(k, 0)boilerplate;defaultdict(list)for grouping. - Python
collections.OrderedDictmaintains insertion order with O(1) move-to-end (move_to_end(k)), enabling O(1) LRU cache operations. - Java
HashMapis not thread-safe; useConcurrentHashMapfor concurrent access. - Java
HashMapallows onenullkey;Hashtabledoes not — a common interview distinction. - Python
dictpreserves insertion order since 3.7 (implementation detail since 3.6); do not rely on this in older codebases. - Recursion depth limit (Python default 1000) matters only if you implement chaining traversal recursively — use iteration instead.
frozensetis hashable and can be used as a dict key; useful for grouping by set of characters without sorting.- When implementing a custom hash for a composite key,
hash(tuple(fields))is idiomatic and correct in Python.
Cross-Topic Relations
| Related topic | Specific connection |
|---|---|
| Arrays | Hash table's underlying storage; bucket array is the physical structure |
| Linked Lists | Chaining uses singly linked lists per bucket; LRU cache uses a doubly linked list alongside a hash map |
| Strings | Polynomial and rolling hash are the primary tools for O(n) substring algorithms |
| Prefix Sum | Combines with hash map to solve subarray-sum problems in O(n); the two techniques are almost always paired |
| Sliding Window | Frequency map tracks window contents; hash map enables O(1) window state updates |
| Two Pointers | Complement pairs (two-sum) use a hash set; two pointers use sorted order — choose based on whether sorted order is needed |
| Dynamic Programming | Hash map as memoization cache for top-down DP with non-integer or sparse states |
| Graph | Hash map normalizes node labels; adjacency list is often dict[node] → list[neighbor] |
| Heap | Combined with frequency map for top-k frequent elements: count with map, then heapify |
| Trie | Alternative to hash map for prefix-based string lookups; Trie wins on prefix queries, hash map wins on exact lookups |
See sliding-window.md for patterns that pair with frequency maps. See prefix-sum.md for the full prefix-sum + hash map subarray technique.
Interview Reference
Must-Know Problems
Two-Sum / Complement Lookup
- Two Sum (Easy) — canonical; one-pass hash map
- Three Sum (Medium) — sort + two pointers; hash map variant exists
- Subarray Sum Equals K (Medium) — prefix sum + hash map
- Continuous Subarray Sum (Medium) — prefix sum mod k + hash map
Frequency and Counting
- Top K Frequent Elements (Medium) — Counter + heap or bucket sort
- First Unique Character in String (Easy)
- Find All Anagrams in a String (Medium) — sliding window + frequency array
- Group Anagrams (Medium) — sorted key or character-count tuple key
Sliding Window with Hash Map
- Longest Substring Without Repeating Characters (Medium)
- Longest Substring with At Most K Distinct Characters (Medium)
- Minimum Window Substring (Hard) — need/have counter pattern
- Subarrays with K Different Integers (Hard) — at-most-k minus at-most-(k-1)
Design Problems
- LRU Cache (Medium) — OrderedDict or dict + doubly linked list
- LFU Cache (Hard) — two hash maps + doubly linked list per frequency
- Design HashMap (Easy) — implement from scratch with chaining
Isomorphism and Bijection
- Isomorphic Strings (Easy) — two maps
- Word Pattern (Easy) — two maps
- Word Pattern II (Hard) — backtracking + two maps
Hashing Strings / Substrings
- Repeated DNA Sequences (Medium) — rolling hash or sliding window set
- Longest Duplicate Substring (Hard) — binary search + rolling hash
- Implement Rabin-Karp (not on LC, but asked in system design rounds)
Clarification Questions to Ask
- Are keys guaranteed to be hashable? (Lists, dicts as keys require conversion)
- Can values be negative or zero? (Affects complement-lookup and prefix-sum logic)
- Are there duplicate keys in the input? (Frequency map vs. existence set)
- Is the problem asking for any valid answer or all answers? (Early return vs. full traversal)
- Is "subarray" contiguous? (Prefix-sum technique applies only to contiguous)
- What is the character set? (26 lowercase, ASCII, Unicode — affects array vs. map choice)
- Can the answer be empty / zero-length? (Edge case for sliding window lower bound)
Common Interview Mistakes
- Using
dict[key]without a default, crashing on first encounter of a key — usegetordefaultdict. - Forgetting
prefix_map = {0: -1}initialization in prefix-sum problems, missing subarrays that start at index 0. - Checking
if map[char] == 0: del map[char]after decrement but after already usinglen(map)for the distinct count — must delete before the length check matters. - Using a list instead of a set for "have we seen this?" checks, making the solution O(n²) instead of O(n).
- Building the map first, then doing a second pass — most problems are solvable in one pass where you check before inserting.
- Forgetting that
Counter({'a': 0})is not equal toCounter()— stale zero-count entries cause incorrect equality checks. - Implementing isomorphism with only one map direction and missing the many-to-one violation from the target side.
Typical Follow-Up Escalations
| Follow-up | Expected answer |
|---|---|
| "Now solve it with O(1) space" | Usually forces a different algorithm (e.g., Floyd's cycle detection instead of hash set for duplicates) |
| "What if the input stream is too large to fit in memory?" | Approximate counting (Count-Min Sketch), bloom filters, or reservoir sampling |
| "Can you make the cache thread-safe?" | Add locking around get/put; or use ConcurrentHashMap; discuss lock granularity |
| "What if keys are adversarially chosen to cause collisions?" | Universal hashing with random seed; Java/Python both do this by default |
| "How would you extend Two Sum to a stream?" | Maintain hash set of seen values; check target - x before inserting each element |
| "Can you do Top-K without sorting?" | Quickselect O(n) average; or bucket sort by frequency O(n) |
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles