Comprehensive Guide to Hash Tables and Hashing
A complete deep-dive into hash tables, hash functions, collision resolution strategies, and their applications in problem-solving.
1. Foundational Understanding
1.1 What, Why, and When Questions
Basic Definition and Scope
What is a Hash Table?
A hash table (also called hash map) is a data structure that implements an associative array abstract data type—a structure that maps keys to values. It uses a hash function to compute an index (hash code) into an array of buckets or slots, from which the desired value can be found.
Precise Definition:
- A hash table is a combination of:
- An array (the underlying storage)
- A hash function (maps keys to array indices)
- A collision resolution strategy (handles when multiple keys hash to same index)
Operations Supported:
- Insert/Put: Add a key-value pair to the table - O(1) average
- Delete/Remove: Remove a key-value pair from the table - O(1) average
- Search/Get: Retrieve the value associated with a key - O(1) average
- Update: Modify the value for an existing key - O(1) average
- Contains: Check if a key exists in the table - O(1) average
- Keys/Values/Entries: Iterate through all keys, values, or key-value pairs - O(n)
What Problems Does It Solve That Simpler Structures Cannot?
-
Fast lookups with arbitrary keys: Unlike arrays (which need integer indices), hash tables allow O(1) lookup with any hashable key type (strings, custom objects, etc.)
-
Efficient membership testing: Checking if an element exists is O(1) average, compared to O(n) for unsorted arrays or O(log n) for sorted arrays
-
Frequency counting and grouping: Perfect for counting occurrences or grouping related items
-
Caching and memoization: Store computed results for fast retrieval
-
Set operations: Efficient union, intersection, difference operations
Boundaries of the Concept:
Included in Hash Table Family:
- Hash maps (key-value pairs)
- Hash sets (keys only, no values)
- Multisets/bags (allow duplicate keys)
- Concurrent hash tables (thread-safe versions)
- Persistent hash tables (immutable versions)
- Distributed hash tables (DHT)
Excluded:
- Trees (balanced BSTs, tries) - different structure despite some overlap in use cases
- Direct-address tables - no hashing, just direct indexing
- Bloom filters - probabilistic, don't store actual data
Mathematical and Computational Foundations:
Hash tables build upon:
- Array indexing: O(1) random access
- Hash functions: Mapping from large key space to small index space
- Modular arithmetic: Computing indices within array bounds
- Probability theory: Load factor analysis, collision probability
- Amortized analysis: Understanding dynamic resizing costs
What Inspired Its Creation?
Hash tables emerged from the need to:
- Store and retrieve data faster than O(log n) binary search
- Use meaningful keys (strings, objects) instead of just integer indices
- Handle large symbol tables in compilers efficiently
- Implement associative memory in early AI systems
The concept evolved from:
- Scatter storage (1950s): Early ideas of computed indexing
- Symbol tables in compilers: Need for fast identifier lookup
- Database indexing: Efficient record retrieval
- Cache memory: Hardware-level associative lookup
Purpose and Context
Why Was the Hash Table Invented?
In the 1950s, as computers began processing larger datasets, researchers faced a fundamental challenge:
- Arrays required integer indices and had no semantic meaning
- Linked lists required O(n) search time
- Sorted arrays required O(log n) search but O(n) insertion
- Binary search trees weren't yet well-developed
Hans Peter Luhn (IBM, 1953) and others independently developed hashing to achieve near-constant-time operations using a mathematical transformation (hash function) to convert keys into array indices.
Why Is It Preferred Over Simpler Alternatives?
| Operation | Array (unsorted) | Array (sorted) | Linked List | Hash Table | | --------- | ---------------- | -------------- | ----------- | ---------- | | Search | O(n) | O(log n) | O(n) | O(1) avg | | Insert | O(1) | O(n) | O(1) | O(1) avg | | Delete | O(n) | O(n) | O(n) | O(1) avg |
Hash tables win when:
- You need fast lookups with non-integer keys
- Insert/delete frequency is high
- You can afford extra memory (load factor < 1)
- You don't need ordered traversal
Real-World Motivating Problems:
-
Compiler symbol tables (1950s-1960s):
- Problem: During compilation, need to look up variable names thousands of times
- Solution: Hash table maps identifier strings → memory locations/types
-
Database indexing (1970s):
- Problem: Retrieve customer records by ID/email instantly
- Solution: Hash index for O(1) retrieval
-
Spell checkers (1980s):
- Problem: Check if word exists in 100,000+ word dictionary
- Solution: Hash set for O(1) membership test
-
Network routing (1990s):
- Problem: Map IP addresses to MAC addresses (ARP tables)
- Solution: Hash table for fast address resolution
Limitations Addressed:
- Linear search inefficiency: Hash tables provide O(1) average lookup vs O(n)
- Binary search insertion cost: Hash tables avoid the O(n) insertion time of maintaining sorted order
- Memory rigidity of arrays: Hash tables grow dynamically
- Lack of semantic keys: Arrays use indices; hash tables use meaningful keys
Historical Context
Evolution and Variants:
-
1950s - Birth of Hashing:
- Hans Peter Luhn (IBM): Internal memorandum on hashing (1953)
- Arnold Dumey: "Indexing for rapid random access memory" (1956)
- Early collision strategies: linear probing
-
1960s - Theoretical Foundation:
- W.W. Peterson: Analysis of open addressing (1957)
- Bell Labs: Development of chaining
- Donald Knuth: Systematic analysis in "The Art of Computer Programming"
-
1970s - Refinement:
- Robin Hood hashing: Improved open addressing
- Double hashing: Better probe sequences
- Universal hashing theory (Carter & Wegman, 1979)
-
1980s-1990s - Practical Variants:
- Cuckoo hashing (2001): Worst-case O(1) lookup
- Perfect hashing: No collisions for static datasets
- Concurrent hash tables: Lock-free data structures
-
2000s-Present - Modern Innovations:
- Swiss Tables (Google, 2017): SIMD-optimized hash tables
- Consistent hashing: Distributed systems
- Cryptographic hash tables: Secure against collision attacks
Current Variants:
| Variant | Key Feature | Use Case | | --------------- | --------------------------- | ------------------------------ | | Chaining | Linked lists at each bucket | General purpose, variable load | | Open Addressing | Probing sequence | Cache-friendly, bounded memory | | Robin Hood | Variance reduction | Predictable performance | | Cuckoo | Worst-case O(1) lookup | Real-time systems | | Hopscotch | Bounded search | High performance | | Swiss Tables | SIMD optimization | Modern C++ std::unordered_map |
Key Milestones:
- 1953: First hash table implementation (Luhn)
- 1963: Linear probing analysis (Konheim & Weiss)
- 1979: Universal hashing proved (Carter & Wegman)
- 1997: SHA-1 becomes standard hash function
- 2001: Cuckoo hashing introduced
- 2012: MurmurHash3 widely adopted
- 2017: Google's Swiss Tables revolutionize std::unordered_map
Computational Problems Made Solvable:
- Real-time symbol lookup in compilers: Enabled faster compilation
- Large-scale database indexing: Made relational databases practical
- Distributed caching: Memcached, Redis rely on hash tables
- Deduplication at scale: Git uses SHA-1 hashing for content addressing
- Password verification: Cryptographic hashes (bcrypt, scrypt)
Historical Failures and Lessons
Famous Bugs and Failures:
-
Perl Hash Collision DoS (2003):
- Issue: Predictable hash functions allowed attackers to craft inputs that all hashed to same bucket
- Impact: POST requests with crafted keys caused O(n²) behavior, DoS attacks
- Lesson: Hash functions need randomization (salt) to prevent algorithmic complexity attacks
- Fix: Randomized hash seeds, switched to SipHash in many languages
-
Java HashMap Infinite Loop (pre-Java 8):
- Issue: Concurrent modification during resize could create circular linked lists in buckets
- Impact:
get()operations would loop infinitely - Lesson: Need proper synchronization or use ConcurrentHashMap
- Fix: Java 8 switched to trees for large buckets, added better resize logic
-
Redis Hash Collision Attack (2012):
- Issue: Predictable hash function allowed attackers to degrade performance
- Impact: Redis instances became unresponsive
- Lesson: Even in-memory databases vulnerable to hash attacks
- Fix: Implemented hash function randomization
-
Git Hash Collision (SHA-1, 2017):
- Issue: Researchers demonstrated SHA-1 collision (SHAttered attack)
- Impact: Theoretical ability to forge Git commits
- Lesson: Cryptographic hash functions have finite lifespan
- Fix: Git migrating to SHA-256
-
Birthday Paradox Underestimation:
- Mistake: Developers often underestimate collision probability
- Example: 32-bit hash with 77,163 elements has 50% collision chance
- Lesson: Birthday paradox means collisions happen at √n, not n
- Fix: Use sufficient hash bit width (64-bit or 128-bit)
Common Historical Misconceptions:
-
"Hash tables are always O(1)"
- Reality: O(1) is average case; worst case is O(n) with all collisions
- Modern understanding: Amortized analysis, probabilistic guarantees
-
"Any hash function will do"
- Reality: Poor hash functions cause clustering and performance degradation
- Modern understanding: Need avalanche effect, good distribution
-
"Larger tables always perform better"
- Reality: Cache misses can make sparse tables slower than denser ones
- Modern understanding: Balance load factor vs cache locality
-
"Chaining is always better than open addressing"
- Reality: Open addressing has better cache performance on modern CPUs
- Modern understanding: Hardware matters; Swiss Tables use open addressing
"Obvious" Optimizations That Were Wrong:
-
Using multiplication instead of modulo:
- Thought: Modulo is slow, use
(hash * 2654435761) >> shiftinstead - Reality: Modern CPUs have fast modulo; multiplication can have worse distribution
- Truth: It depends on CPU architecture and hash function
- Thought: Modulo is slow, use
-
Always using prime-sized tables:
- Thought: Prime sizes avoid patterns in modulo operation
- Reality: Power-of-2 sizes enable fast bitwise AND instead of modulo
- Truth: Prime sizes help only with poor hash functions; good hash functions work with any size
-
Pre-allocating huge tables to avoid resizing:
- Thought: Eliminate resize cost by making table oversized from start
- Reality: Memory waste and cache misses dominate, slower than resizing
- Truth: Dynamic resizing with good load factor (0.7-0.75) is optimal
Real-World Incidents:
-
Algorithmic Complexity Attacks (2011):
- Multiple languages (PHP, Python, Ruby, Java) vulnerable simultaneously
- Crafted POST data caused hash collisions → O(n²) behavior
- Web servers crashed under attack
- Led to widespread adoption of randomized hashing
-
Cloudflare Bug (2017):
- Hash table overflow in edge cases caused memory corruption
- Leaked sensitive data (Cloudbleed)
- Lesson: Careful bounds checking even in "safe" operations
Evolution of Best Practices:
| Era | Practice | Rationale | | ----- | ------------------------- | --------------------------------------- | | 1960s | Simple division method | Fast on limited hardware | | 1970s | Multiplication method | Better distribution | | 1980s | Prime table sizes | Avoid patterns with poor hash functions | | 1990s | Cryptographic hashes | Security-conscious hashing | | 2000s | Randomized seeds | Prevent DoS attacks | | 2010s | SIMD-optimized structures | Leverage modern CPU features | | 2020s | Hardware-aware design | Cache-oblivious algorithms |
Philosophical and Conceptual
Fundamental Trade-offs:
-
Time vs. Space:
- Lower load factor → faster operations, more memory
- Higher load factor → more collisions, less memory
- Sweet spot: 0.7-0.75 load factor for most applications
-
Determinism vs. Security:
- Predictable hash functions → reproducible behavior, vulnerable to attacks
- Randomized hash functions → secure, non-deterministic iteration order
-
Simplicity vs. Performance:
- Chaining → simple to implement, pointer overhead
- Open addressing → complex deletion, better cache locality
-
Worst-case vs. Average-case:
- Standard hash table → O(1) average, O(n) worst
- Cuckoo hashing → O(1) worst-case lookup, more complex insertion
Core Assumptions:
- Good hash function: Keys distribute uniformly across buckets
- Hashability: Keys can be hashed consistently (immutable, implement equality)
- Independence: Hash values are independent random variables (under simple uniform hashing assumption)
- Sufficient memory: Can maintain load factor below threshold
- Amortized cost acceptable: Occasional expensive resize is okay
Paradigm:
Hash tables follow the randomized algorithm paradigm:
- Use randomness (hash function) to achieve expected-case performance
- Trade worst-case guarantees for better average performance
- Rely on probabilistic analysis rather than deterministic bounds
Core Insight:
The fundamental insight of hashing is trading uniqueness for speed:
- We can't guarantee unique locations for keys (collisions happen)
- But we can make collisions rare enough that average-case is constant time
- The hash function compresses a large key space into a small index space
- Mathematically: We accept the pigeonhole principle (|Keys| > |Buckets|) and handle collisions efficiently
Philosophical Connections:
- Pigeonhole Principle: If you have more keys than buckets, collisions are inevitable
- Birthday Paradox: Collisions happen much sooner than intuition suggests
- Locality of Reference: Modern implementations optimize for cache performance
- Probabilistic Reasoning: Performance guarantees are statistical, not deterministic
1.2 Classification and Categorization
Type and Family
Data Structure Type:
- Primary Classification: Non-linear, random access data structure
- Abstract Data Type: Implements the Map/Dictionary ADT and Set ADT
- Storage Category: Array-based with computed indexing
Algorithm Family (for hashing):
- Searching Algorithms: Hash-based search
- String Algorithms: Rolling hash (Rabin-Karp)
- Cryptographic Algorithms: Secure hash functions (SHA-256, bcrypt)
Major Variations:
-
By Collision Resolution:
- Separate Chaining: Each bucket contains a linked list/array/tree
- Open Addressing: All elements stored in the array itself
- Linear probing
- Quadratic probing
- Double hashing
- Robin Hood Hashing: Variant of open addressing with fairness
- Cuckoo Hashing: Multiple hash functions, relocate on collision
- Hopscotch Hashing: Bounded search neighborhood
-
By Purpose:
- Hash Map: Stores key-value pairs
- Hash Set: Stores unique keys only
- Multimap: Allows duplicate keys
- Multiset/Bag: Allows duplicate elements
-
By Implementation:
- Fixed-size hash table: Static capacity
- Dynamic hash table: Grows/shrinks with load factor
- Perfect hash table: No collisions (for static datasets)
- Minimal perfect hash table: No collisions, minimal space
-
By Thread-Safety:
- Single-threaded: Standard hash table
- Concurrent hash table: Lock-free or fine-grained locking
- Distributed hash table (DHT): Spread across multiple machines
-
By Persistence:
- Ephemeral: Standard mutable hash table
- Persistent: Immutable, preserves old versions
- Cache-oblivious: Optimized for unknown cache parameters
Most Common Variant:
Separate Chaining with Dynamic Resizing is most common because:
- Simple to implement correctly
- Handles variable load factors gracefully
- Deletion is straightforward
- Predictable performance
However, Open Addressing is gaining popularity in modern systems (e.g., Google's Swiss Tables) due to:
- Better cache locality
- No pointer overhead
- Smaller memory footprint
Distinguishing Factors:
| Aspect | Chaining | Open Addressing | Cuckoo | | ------------------ | ---------------------- | -------------------- | --------------- | | Collision Strategy | Linked structures | Probe sequence | Multiple tables | | Cache Performance | Poor (pointer chasing) | Excellent | Good | | Deletion | Easy | Complex (tombstones) | Easy | | Load Factor | Can exceed 1.0 | Must stay < 1.0 | Typically < 0.5 | | Worst-case Lookup | O(n) | O(n) | O(1) |
Domain and Application
Where Most Commonly Used:
-
Compilers and Interpreters:
- Symbol tables: Map identifiers to types, scopes, addresses
- Constant pools: Deduplicate string/numeric literals
- Example: Python's
dictfor variable namespaces
-
Databases:
- Hash indexes: Fast equality lookups (not range queries)
- Join operations: Hash join algorithm
- Example: PostgreSQL hash indexes
-
Operating Systems:
- Page tables: Virtual → physical address mapping
- File systems: Directory structures (ext4, NTFS)
- Process tables: PID → process structure
- Buffer cache: Inode hashing
-
Networking:
- Routing tables: IP address lookup
- ARP cache: IP → MAC address mapping
- NAT tables: Connection tracking
- DNS caching
-
Caching Systems:
- Web caches: URL → cached response
- CDNs: Content distribution
- Example: Redis, Memcached
-
Security:
- Password storage: Hash + salt
- Digital signatures: Document integrity
- Certificate validation: Key fingerprinting
Industries Relying Heavily:
| Industry | Application | Specific Use | | ------------------ | ----------------- | --------------------------------- | | Finance | Trading systems | Order matching, position tracking | | E-commerce | Product catalogs | SKU → product info | | Social Media | User data | User ID → profile | | Gaming | Game state | Entity ID → game object | | Telecommunications | Call routing | Phone number → routing info | | Bioinformatics | Sequence analysis | k-mer counting, genome assembly |
Scale Suitability:
-
Small Datasets (< 1,000 elements):
- Hash tables work but may be overkill
- Simple linear search might be comparable
- Memory overhead not justified
-
Medium Datasets (1,000 - 1,000,000):
- Ideal range for hash tables
- O(1) lookup clearly beats O(log n) trees
- Memory footprint reasonable
-
Large Datasets (1,000,000 - 1,000,000,000):
- Hash tables excel if they fit in RAM
- Cache performance becomes critical
- Consider Swiss Tables or other cache-optimized variants
-
Massive Datasets (> 1 billion):
- May not fit in memory of single machine
- Use distributed hash tables (DHT): Chord, Kademlia
- Or disk-based hash structures: Extendible hashing
-
Streaming Data:
- Use Count-Min Sketch or Bloom filters for approximate answers
- Standard hash tables for exact tracking of recent items
Problem Type Excellence:
Hash tables excel at:
- Membership testing: "Does this element exist?"
- Counting/frequency: "How many times does X appear?"
- Grouping/partitioning: "Group elements by property"
- Deduplication: "Remove duplicates"
- Caching: "Have I computed this before?"
- Two-pointer patterns: "Find pair summing to target"
Hash tables are poor at:
- Ordered operations: Can't iterate in sorted order
- Range queries: Can't find "all keys between X and Y"
- Nearest neighbor: Can't find "closest key to X"
- Prefix searches: Can't find "all keys starting with X"
- (Use tries or prefix trees instead)
2. Theory and Principles
2.1 Core Concepts and Structure
Fundamental Ideas
Underpinning Concepts:
-
Hashing Concept:
- Transform arbitrary key into fixed-range integer
- Key → Hash Function → Hash Code → Index
- Hash code typically much larger than table size
- Compression:
index = hash_code % table_size
-
Simple Uniform Hashing Assumption (SUHA):
- Each key is equally likely to hash to any bucket
- Hash values are uniformly distributed
- Independence: Hash values of different keys are independent
- This is an ideal; real hash functions approximate it
-
Load Factor (α):
- α = n / m (n = number of elements, m = table size)
- Critical threshold: typically resize when α > 0.75
- Determines average collision chain length
-
Collision:
- When two different keys hash to the same index
- Inevitable by pigeonhole principle when |keys| > |buckets|
- Must be handled by collision resolution strategy
Underlying Mathematical Model:
Hash tables are modeled as:
- Balls and Bins: Throwing n balls (keys) into m bins (buckets)
- Probability Theory: Analyze collision rates using birthday paradox
- Random Variables: Chain length is random variable; analyze expected value
Expected chain length under SUHA:
- E[chain length] = α (load factor)
- With chaining, as α → ∞, performance degrades linearly
- With open addressing, must keep α < 1
Invariants
Must Always Be True:
-
Hash Consistency:
- If
key1.equals(key2), thenhash(key1) == hash(key2) - Violation → same key could be stored in multiple locations
- Critical for correctness of get/put/delete
- If
-
Capacity Invariant:
- For open addressing:
num_elements < table_size(α < 1) - For chaining: no strict limit, but typically
num_elements < threshold * table_size
- For open addressing:
-
Probe Sequence Covers All Slots (Open Addressing):
- Every position must be reachable from any starting position
- Ensures we can always find an empty slot (when α < 1)
- E.g., with linear probing: covers all slots trivially
- E.g., with double hashing: step size must be coprime to table size
-
Key Immutability:
- Keys must not change after insertion (or hash must not change)
- If key mutates → hash changes → can't find element
- Languages like Java enforce this for String, Integer, etc.
Boundary Conditions:
- Empty table: All operations should handle gracefully
- Full table (open addressing): Must trigger resize before truly full
- Load factor threshold: Resize at specific α to maintain performance
- Single element: Edge case for iteration, min/max operations
What Happens If Invariant Violated:
-
Hash consistency broken:
# Bad: Using mutable list as key my_dict = {} key = [1, 2, 3] my_dict[key] = "value" # Works initially key.append(4) # Mutates key → hash changes print(my_dict[key]) # May not find the value! -
Capacity exceeded (open addressing):
- Infinite loop in probe sequence looking for empty slot
- Modern implementations prevent this by resizing proactively
Restoration Mechanisms:
- Resize operation: When load factor exceeds threshold, create new larger table and rehash all elements
- Rehashing: Recompute all indices for new table size
- Tombstones (open addressing): Mark deleted slots to maintain probe sequences
Architecture and Design
Internal Organization:
Hash Table Structure:
┌─────────────────────────────────────┐
│ Hash Function (key → hash code) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Compression (hash code % size) │
└─────────────────────────────────────┘
↓
┌───────────────────────────────────────┐
│ Bucket Array │
│ ┌───┬───┬───┬───┬───┬───┬───┬───┐ │
│ │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ │
│ └───┴───┴───┴───┴───┴───┴───┴───┘ │
└───────────────────────────────────────┘
↓
Collision Resolution
(Chaining or Open Addressing)
Components:
-
Bucket Array:
- Fixed-size array (grows dynamically)
- Each slot can hold one or more key-value pairs
- Size typically power of 2 (enables fast modulo via bitwise AND)
-
Hash Function:
- Maps key to hash code (typically 32-bit or 64-bit integer)
- Should distribute keys uniformly
- Examples:
hash(key) = (a * key.hashCode() + b) % table_size
-
Collision Resolution Mechanism:
- Chaining: Each bucket points to linked list/array/tree
- Open Addressing: Probe sequence to find next available slot
-
Metadata:
size: Number of elements currently storedcapacity: Size of underlying arrayload_factor_threshold: When to resize (e.g., 0.75)hash_seed(optional): Randomization to prevent attacks
Memory Layout:
Separate Chaining:
Bucket Array (contiguous memory):
┌─────┬─────┬─────┬─────┐
│ ptr │ ptr │ ptr │ ptr │ (pointers to chains)
└──┬──┴──┬──┴──┬──┴──┬──┘
│ │ │ │
v v v v
Node Node Node NULL
│ │ │
v v v
Node NULL Node
│ │
v v
NULL NULL
Each Node: [key | value | next_ptr]
Open Addressing (Linear Probing):
Single Array (contiguous memory):
┌──────────┬──────────┬──────────┬──────────┐
│ (k1, v1) │ (k2, v2) │ NULL │ (k3, v3) │
└──────────┴──────────┴──────────┴──────────┘
Probe sequence →
Logical vs. Physical Representation:
- Logical: Abstract map {key1: value1, key2: value2, ...}
- Physical: Array of buckets with collision chains or probe sequences
Pointers and Links:
- Chaining: Each bucket has pointer to first node in chain; nodes link to each other
- Open Addressing: No pointers; use arithmetic to find next probe location
- Cuckoo Hashing: Multiple arrays, each with own hash function
Mathematical Foundations
Key Formulas:
-
Division Method:
h(k) = k mod m- Simple and fast
- m should be prime to avoid clustering
- Not good if m is power of 2 and keys are even
-
Multiplication Method:
h(k) = ⌊m * (k * A mod 1)⌋- A is constant (golden ratio ≈ 0.6180339887 works well)
k * A mod 1extracts fractional part- Works well for any m (power of 2 is fine)
-
Universal Hashing:
h_ab(k) = ((a * k + b) mod p) mod m- Choose random a, b from [0, p-1] where p is prime > |U| (universe of keys)
- Guarantees: For any two keys k1 ≠ k2, Pr[h(k1) = h(k2)] ≤ 1/m
- Provides theoretical protection against adversarial inputs
-
Load Factor:
α = n / m- n = number of elements
- m = table size
- α determines average chain length (chaining) or probe length (open addressing)
Collision Probability (Birthday Paradox):
For m buckets and n keys:
P(no collision) ≈ e^(-n² / 2m)
P(collision) ≈ 1 - e^(-n² / 2m)
For 50% collision chance: n ≈ 1.177√m
Example: 32-bit hash (m = 2³²) has 50% collision with only ~77,000 keys
Expected Chain Length (Chaining):
Under SUHA, expected number of keys in a bucket:
E[chain length] = α = n/m
Expected Probes (Linear Probing):
Successful search:
E[probes] ≈ 1/2 * (1 + 1/(1-α))
Unsuccessful search:
E[probes] ≈ 1/2 * (1 + 1/(1-α)²)
As α → 1, probes → ∞ (table becomes full)
Theorems Supporting Correctness:
-
Universal Hashing Theorem (Carter & Wegman, 1979):
- For any two distinct keys, probability of collision is at most 1/m
- Guarantees good average-case performance even with adversarial input
-
Chernoff Bounds:
- Probability that chain length exceeds expected value by large amount is exponentially small
- Makes hash tables reliable in practice
Deriving Complexity:
Chaining - Successful Search:
- Expected chain length = α
- Expected comparisons = 1 + α/2
- With constant α, this is O(1)
Chaining - Unsuccessful Search:
- Must traverse entire chain
- Expected comparisons = α
- With constant α, this is O(1)
Open Addressing - Linear Probing:
- Clustering causes longer probe sequences
- Expected probes ≈ 1/(1-α)
- With α = 0.75, about 4 probes on average
Models and Representations
Visualization:
Separate Chaining:
Index Bucket (Chains)
0 → NULL
1 → [("apple", 5)] → NULL
2 → [("banana", 3)] → [("berry", 7)] → NULL
3 → NULL
4 → [("cherry", 9)] → NULL
5 → [("date", 2)] → [("durian", 4)] → NULL
6 → NULL
7 → [("fig", 6)] → NULL
Open Addressing (Linear Probing):
Index Key Value
0 "apple" 5
1 "banana" 3
2 "cherry" 9
3 <empty>
4 "date" 2
5 "fig" 6
6 <empty>
7 "grape" 8
If inserting "avocado" and hash("avocado") = 0:
- Bucket 0 occupied → probe bucket 1
- Bucket 1 occupied → probe bucket 2
- Bucket 2 occupied → probe bucket 3
- Bucket 3 empty → insert here
Cuckoo Hashing:
Table 1 (hash1): Table 2 (hash2):
┌───┬─────────┐ ┌───┬─────────┐
│ 0 │ "apple" │ │ 0 │ "date" │
│ 1 │ <empty> │ │ 1 │ "fig" │
│ 2 │ "banana"│ │ 2 │ <empty> │
│ 3 │ "cherry"│ │ 3 │ "grape" │
└───┴─────────┘ └───┴─────────┘
Key exists if found in table1[hash1(key)] OR table2[hash2(key)]
Alternative Mental Models:
-
Phone Book Analogy:
- Hash function = "Turn to the section for first letter of last name"
- Bucket = "Page of names starting with that letter"
- Collision = "Multiple people with same first letter"
-
Library Catalog:
- Hash function = "Dewey Decimal System category"
- Bucket = "Shelf for that category"
- Collision = "Multiple books in same category"
-
Post Office Boxes:
- Hash function = "PO Box number assignment"
- Bucket = "Individual box"
- Collision = "Shared box for family members"
Notation:
T[i]: Bucket at index ih(k): Hash value of key kh'(k): Secondary hash function (double hashing)k ↦ v: Key k maps to value vα = n/m: Load factor⟨k, v⟩: Key-value pair
2.2 Logical Framework
Invariants and Properties
Core Invariants:
-
Hash Determinism:
- Same key always produces same hash value
- Formally: ∀x, h(x) is a function (not a relation)
-
Equality Consistency:
- If keys are equal, hashes must be equal
- Formally: k₁ = k₂ ⇒ h(k₁) = h(k₂)
- Contrapositive: h(k₁) ≠ h(k₂) ⇒ k₁ ≠ k₂
-
Uniqueness Invariant (for maps):
- Each key appears at most once
- Inserting duplicate key updates existing value
-
Capacity Constraint (open addressing):
- n < m always (number of elements < table size)
Boundary Conditions:
- α = 0: Empty table, all operations trivial
- α → 1 (open addressing): Performance degrades severely, must resize
- α ≫ 1 (chaining): Long chains, approaching O(n) search
Invariant Violation Consequences:
-
Hash changes for same key:
class BadKey: def __init__(self, val): self.val = val def __hash__(self): return random.randint(0, 100) # WRONG: non-deterministic d = {} key = BadKey(5) d[key] = "value" print(d[key]) # May fail: hash changed between put and get -
Mutable key:
d = {} key = [1, 2, 3] # Lists are mutable, can't be hashed d[key] = "value" # Raises TypeError in Python
Restoration After Violation:
- Resize operation: Rehash all elements to restore load factor invariant
- Deletion with tombstones: Maintain probe sequence integrity in open addressing
- Rebuild hash table: If corruption detected, reconstruct from scratch
Correctness and Proofs
Algorithm Correctness:
Theorem: Hash table correctly implements the Map ADT under SUHA.
Proof Sketch:
- Insertion: Key k hashed to index i, stored at T[i] (or chain at T[i])
- Retrieval: Key k hashed to same index i, found at T[i]
- Deletion: Key k hashed to index i, removed from T[i]
- Correctness: Since h(k) is deterministic, same key always maps to same location
Loop Invariant (for searching in chaining):
def search(table, key):
index = hash(key) % len(table)
node = table[index]
while node is not None:
if node.key == key:
return node.value
node = node.next
return None
Invariant: At the start of each iteration, if key exists in the chain, it's either at node or later in the chain.
Initialization: True before first iteration (key is either at first node or later). Maintenance: If key not at current node, move to next; invariant still holds. Termination: Either found key or reached end of chain (key doesn't exist).
Proof of Termination:
- Chains are finite (acyclic linked lists)
- Each iteration advances to next node
- Eventually reaches end (NULL)
- Therefore, loop terminates
Open Addressing Termination:
For linear probing with α < 1:
- There exists at least one empty slot
- Probe sequence covers all slots
- Therefore, will eventually find empty slot or key
Optimal Substructure:
Hash tables don't directly exhibit optimal substructure (not a DP problem), but collision resolution has recursive structure:
Chaining: Insert into chain is a linked list insertion subproblem Cuckoo hashing: Eviction creates subproblem of inserting evicted element
Assumptions and Requirements
Preconditions:
-
Keys must be hashable:
- Immutable (or hash based on immutable properties)
- Implement
__hash__()and__eq__()(Python) orhashCode()andequals()(Java)
-
Sufficient memory:
- Enough memory to maintain load factor < threshold
- Can allocate new array during resize
-
Hash function available:
- Efficient to compute
- Good distribution properties
Postconditions:
After put(key, value):
- Key exists in table
get(key)returnsvalue- Size incremented by 1 (if key was new)
After delete(key):
- Key no longer in table
get(key)returns NULL/None- Size decremented by 1
After resize():
- Load factor below threshold
- All keys still retrievable
- Indices may have changed (rehashing occurred)
Assumption Violations:
-
Mutable keys:
# Violation my_dict = {} key = [1, 2] # Mutable # Python raises: TypeError: unhashable type: 'list' -
Inconsistent equals/hash:
// Java example class BadKey { int val; @Override public boolean equals(Object o) { return ((BadKey)o).val == this.val; } // Missing hashCode() override // Violates contract: equal objects must have equal hash codes } -
Hash function returns different values:
import random class BadHash: def __hash__(self): return random.randint(0, 1000) # WRONG: non-deterministic
3. Implementation and Mechanics
3.1 How It Works
Internal Workings
Data Flow During Operations:
-
Insertion Flow:
Key + Value ↓ Compute hash code: hash(key) ↓ Compress to index: hash(key) % table_size ↓ Check bucket at index: - Empty? → Insert directly - Occupied? → Handle collision -
Search Flow:
Key ↓ Compute hash code: hash(key) ↓ Compress to index: hash(key) % table_size ↓ Check bucket: - Chaining: Traverse linked list, compare keys - Open Addressing: Probe sequence, compare keys ↓ Found? → Return value Not found? → Return NULL/None
Element Storage:
Chaining:
- Bucket array stores pointers to first node of chain
- Each node stores: (key, value, next_pointer)
- Nodes allocated dynamically on heap
Open Addressing:
- Bucket array stores (key, value) pairs directly
- Empty slots marked as NULL or special sentinel
- Deleted slots marked with tombstone (DELETED marker)
Modification:
- Update: Find key, replace value in place
- Delete: Remove entry, potentially leave tombstone
- Resize: Allocate new array, rehash all entries
Bit/Byte Level:
Hash Computation (for integers):
def hash_int(key, table_size):
# Step 1: Mix bits using multiplication
key = ((key >> 16) ^ key) * 0x45d9f3b
key = ((key >> 16) ^ key) * 0x45d9f3b
key = (key >> 16) ^ key
# Step 2: Reduce to table range
index = key % table_size
return index
String Hashing (polynomial rolling hash):
def hash_string(s, table_size):
hash_val = 0
prime = 31 # Small prime
for char in s:
hash_val = (hash_val * prime + ord(char)) % (2**32)
return hash_val % table_size
At byte level:
- Each character converted to ASCII/Unicode value
- Accumulated into hash using multiplication and addition
- Final modulo ensures result fits in table
Operation Mechanics
Insert Operation (Separate Chaining):
Step-by-step:
- Compute index = hash(key) % table_size
- Access bucket at table[index]
- Traverse chain checking for existing key
- If found: Update value
- If not found: Create new node, prepend to chain
- Increment size counter
- If size/capacity > load_threshold: resize and rehash
Delete Operation (Separate Chaining):
- Compute index = hash(key) % table_size
- Access bucket at table[index]
- Traverse chain with two pointers (prev, current)
- If key found:
- If at head: table[index] = current.next
- Else: prev.next = current.next
- Decrement size counter
- If not found: No-op or raise exception
Search Operation:
- Compute index = hash(key) % table_size
- Access bucket at table[index]
- Chaining: Traverse linked list, compare each key
- Open Addressing: Probe sequence until found or empty slot
- Return value if found, else None
Insert Operation (Open Addressing - Linear Probing):
1. index = hash(key) % table_size
2. i = 0
3. while True:
4. probe_index = (index + i) % table_size
5. if table[probe_index] is EMPTY or DELETED:
6. table[probe_index] = (key, value)
7. size++
8. break
9. if table[probe_index].key == key:
10. table[probe_index].value = value # Update
11. break
12. i++
13. if size/capacity > threshold: resize()
Delete Operation (Open Addressing - with Tombstones):
1. index = hash(key) % table_size
2. i = 0
3. while True:
4. probe_index = (index + i) % table_size
5. if table[probe_index] is EMPTY:
6. return # Not found
7. if table[probe_index].key == key:
8. table[probe_index] = DELETED # Tombstone
9. size--
10. break
11. i++
Resize Operation:
1. old_table = table
2. table = new array of size (2 * old_capacity) or next prime
3. size = 0
4. for each bucket in old_table:
5. if chaining:
6. for each node in chain:
7. insert(node.key, node.value) # Rehash into new table
8. else:
9. if bucket not EMPTY and not DELETED:
10. insert(bucket.key, bucket.value)
Algorithm Execution
Typical Execution Flow (Chaining):
# Example: Inserting multiple elements
hash_table = HashTable(size=7)
hash_table.put("apple", 5)
# 1. hash("apple") = 47896 → 47896 % 7 = 3
# 2. table[3] is empty
# 3. Create node ("apple", 5), set table[3] = node
hash_table.put("banana", 3)
# 1. hash("banana") = 52381 → 52381 % 7 = 2
# 2. table[2] is empty
# 3. Create node ("banana", 3), set table[2] = node
hash_table.put("apricot", 7)
# 1. hash("apricot") = 47901 → 47901 % 7 = 3
# 2. table[3] is occupied (collision!)
# 3. Create node ("apricot", 7)
# 4. Prepend to chain: node.next = table[3], table[3] = node
Phases of Hash Table Lifecycle:
-
Initialization:
- Allocate array of specified size
- Initialize all buckets to NULL/empty
- Set size = 0, capacity = initial_size
-
Growth Phase:
- Elements inserted
- Load factor increases
- When α > threshold: resize (typically at α = 0.75)
-
Steady State:
- Mix of insertions, deletions, lookups
- Occasional resizes maintain load factor
- Performance remains O(1) average
-
Shrink Phase (optional):
- If many deletions, may resize down to save memory
- Less common in practice
Iteration Lifecycle:
For linear probing insertion:
Iteration 0: Check table[hash(key) % size]
Iteration 1: Check table[(hash(key) + 1) % size]
Iteration 2: Check table[(hash(key) + 2) % size]
...
Until: Empty slot found or key matched
State and Transitions
Possible States:
- Empty: No elements, all buckets NULL
- Partially Full: 0 < α < threshold
- Resize Triggered: α ≥ threshold, needs expansion
- Resizing: In process of allocating new array and rehashing
- Full (open addressing): α → 1, imminent performance collapse
State Transitions:
Empty --[insert]--> Partially Full
Partially Full --[insert (α > threshold)]--> Resize Triggered
Resize Triggered --[resize()]--> Resizing
Resizing --[rehash complete]--> Partially Full
Partially Full --[delete all]--> Empty
Triggers for State Changes:
- Insert: May trigger resize if α exceeds threshold
- Delete: May trigger shrink if α falls below lower threshold (rare)
- Resize: Triggered automatically by insert/delete
Internal Bookkeeping:
size: Current number of elementscapacity: Size of underlying arraythreshold: size at which to resize (e.g., 0.75 * capacity)mod_count(optional): For fail-fast iterators, detect concurrent modification
Open Addressing Specific:
- Track number of DELETED slots to decide when to rebuild (remove tombstones)
3.2 Dependencies and Interactions
Internal Dependencies
Component Interactions:
-
Hash Function ↔ Compression:
- Hash function produces hash code (any integer)
- Compression function (% table_size) maps to valid index
- Quality of hash function determines distribution
-
Collision Resolution ↔ Bucket Structure:
- Chaining requires linked list or dynamic array per bucket
- Open addressing requires probe sequence calculation
- Choice of resolution strategy determines memory layout
-
Load Factor ↔ Resize:
- Load factor monitored after each insert/delete
- Triggers resize when threshold exceeded
- Resize involves rehashing, which uses hash function again
Operation Dependencies:
- Resize depends on Insert/Delete: Size changes trigger load factor check
- Insert depends on Search: Must check if key exists before inserting
- Delete depends on Search: Must find key before deletion
Concurrent Modifications:
- Problem: Two threads inserting simultaneously can corrupt structure
- Chaining: May lose elements if both modify same chain
- Open Addressing: Probe sequences can be broken, elements lost
Example Failure (Java):
// Thread 1: Inserting during resize
HashMap<String, Integer> map = new HashMap<>();
// Thread 1 calls put() → triggers resize
// Thread 2 calls get() during resize
// Result: ConcurrentModificationException or undefined behavior
Solution: Use ConcurrentHashMap with fine-grained locking or lock-free algorithms
Causal Chains
Single Operation Cascading Effects:
Insert causing resize:
Insert "key" with α = 0.75 (at threshold)
→ size++ makes α = 0.76
→ Triggers resize()
→ Allocate new array (2x size)
→ Rehash ALL existing elements
→ Hash codes recomputed
→ Indices change (different table size → different % result)
→ All elements move to new locations
→ Old array garbage collected
Delete in open addressing:
Delete "key" at index i
→ Place DELETED tombstone at i
→ Future searches must continue past tombstone
→ Future inserts can reuse tombstone slot
→ If too many tombstones: degrade to O(n) search
→ Eventually triggers rebuild to remove tombstones
Hash collision propagation:
Poor hash function causes clustering
→ Multiple keys hash to same region
→ Collisions increase
→ Chains grow longer (chaining) or probes increase (open addressing)
→ Search time increases
→ More likely to trigger resize
→ Resize computes new hashes (hopefully better distribution)
Feedback Loops:
-
Negative Feedback (resize):
- High load factor → slow operations
- Slow operations → trigger resize
- Resize → lower load factor
- Lower load factor → faster operations
-
Positive Feedback (clustering in poor hash functions):
- Cluster forms → more collisions in that region
- Collisions → longer probe sequences
- Longer probes → more likely to add to cluster
- Cluster grows → even worse performance
Conditional Factors
External Factors Affecting Behavior:
-
Input Size:
- Small n: Hash table overhead may dominate, simple array faster
- Large n: Hash table O(1) advantage clear
-
Input Distribution:
- Uniformly random keys → excellent performance
- Clustered keys (e.g., sequential IDs) → depends on hash function quality
- Adversarial input (all hash to same bucket) → O(n) worst case
-
Key Types:
- Integers: Fast to hash (bitwise operations)
- Strings: Slower to hash (must process each character)
- Custom objects: Hash quality depends on implementation
Performance Variation:
| Input Pattern | Chaining | Open Addressing | | -------------------------- | ---------- | --------------- | | Random keys | O(1) | O(1) | | Sequential keys, good hash | O(1) | O(1) | | Sequential keys, bad hash | O(n) worst | O(n) worst | | Many duplicates | O(1) | O(1) | | All collide | O(n) | O(n) |
Environmental Factors:
-
Cache Size:
- Small cache: Chaining suffers (pointer chasing)
- Large cache: Chaining benefits from prefetching
- Open addressing generally better cache performance
-
Processor:
- Branch predictor quality affects conditional-heavy code
- SIMD capabilities: Swiss Tables use SIMD for faster lookup
-
Memory Allocator:
- Chaining: Many small allocations (node per element)
- Open addressing: Few large allocations (array only)
- Allocator speed affects chaining more
-
Programming Language:
- Python: dict is built-in, optimized open addressing
- Java: HashMap uses chaining (pre-8) or trees (Java 8+)
- C++: std::unordered_map traditionally chaining, but Swiss Tables use open addressing
Emergent Properties
Complex Behaviors from Simple Rules:
-
Clustering (Linear Probing):
- Simple rule: "If occupied, try next slot"
- Emergence: Long contiguous runs of occupied slots form
- Consequence: "Rich get richer" - clusters attract more collisions
-
Load Factor Sweet Spot:
- Simple rule: "Resize when α > 0.75"
- Emergence: Trade-off between memory and speed naturally optimizes
- Consequence: Industry standard of 0.75 emerges across implementations
-
Hash Function Sensitivity:
- Simple rule: "Map key to integer"
- Emergence: Tiny changes in hash function drastically affect distribution
- Example: Changing prime in polynomial hash changes entire distribution
Local Operations → Global Properties:
-
Local: Insert single element into bucket
-
Global: Distribution of chain lengths follows Poisson distribution (under SUHA)
-
Local: Single key hashes to index
-
Global: Birthday paradox means collisions likely with √m elements
Patterns from Repeated Operations:
-
Pattern: Insert, insert, insert → resize → insert, insert, insert → resize
-
Emergence: Geometric growth pattern (table size: 16, 32, 64, 128, ...)
-
Pattern: Insert many elements with same hash prefix
-
Emergence: Primary clustering in linear probing (long runs of occupied slots)
5. Complexity Analysis
5.1 Time Complexity
Operation-Wise Analysis
Hash Table Operations - Separate Chaining:
| Operation | Best Case | Average Case | Worst Case | Amortized | | --------- | --------- | ------------ | ---------- | --------- | | Search | O(1) | O(1 + α) | O(n) | O(1) | | Insert | O(1) | O(1 + α) | O(n) | O(1) | | Delete | O(1) | O(1 + α) | O(n) | O(1) | | Resize | - | O(n) | O(n) | O(1) |
Where α = n/m (load factor), n = number of elements, m = table size.
Hash Table Operations - Open Addressing (Linear Probing):
| Operation | Best Case | Average Case | Worst Case | | --------------------- | --------- | ------------ | ---------- | | Search (successful) | O(1) | O(1/(1-α)) | O(n) | | Search (unsuccessful) | O(1) | O(1/(1-α)²) | O(n) | | Insert | O(1) | O(1/(1-α)²) | O(n) | | Delete | O(1) | O(1/(1-α)) | O(n) |
Complexity Analysis by Hash Table Topic:
Topic 58: Hash Table Fundamentals
- Create empty table: O(m) where m is initial capacity
- Hash function computation: O(k) where k is key size
- Index computation: O(1)
Topic 59: Hash Functions and Collision Handling
- Division method: O(1)
- Multiplication method: O(1)
- Universal hashing: O(1) for computation
- Polynomial rolling hash: O(k) for string of length k
- Collision resolution (chaining): O(chain length) = O(1 + α) average
- Collision resolution (probing): O(probes) = O(1/(1-α)) average
Topic 60: Chaining vs Open Addressing
Chaining:
- Insert: O(1) if inserting at head, O(n) if checking duplicates
- Search: O(1 + α) average
- Delete: O(1 + α) average
- Supports α > 1 (load factor can exceed 1)
Open Addressing (Linear Probing):
- Insert: O(1/(1-α)²) average
- Search: O(1/(1-α)) average
- Delete: O(1/(1-α)) with tombstones
- Must maintain α < 1
Open Addressing (Double Hashing):
- Better distribution than linear probing
- Insert: O(1/(1-α)) average
- Search: O(1/(1-α)) average
Topic 61: Load Factor and Rehashing
- Load factor computation: O(1)
- Rehashing operation: O(n + m) where n = elements, m = old capacity
- Amortized cost per insert with rehashing: O(1)
- Typical threshold: α = 0.75
Topic 62: HashMap Implementation
- get(key): O(1) average, O(n) worst
- put(key, value): O(1) average, O(n) worst
- remove(key): O(1) average, O(n) worst
- containsKey(key): O(1) average, O(n) worst
- keySet/values/entries iteration: O(n + m)
Topic 63: HashSet Implementation
- add(element): O(1) average, O(n) worst
- remove(element): O(1) average, O(n) worst
- contains(element): O(1) average, O(n) worst
- size(): O(1)
- isEmpty(): O(1)
Topic 64: Two Sum Problem and Variations
Two Sum:
Time: O(n) - single pass with hash map
Space: O(n) - store n elements in hash map
Two Sum II (sorted array):
Time: O(n) - two pointer approach (no hash table needed)
Space: O(1) - two pointers only
Three Sum:
Time: O(n²) - O(n) for outer loop, O(n) for two-sum per iteration
Space: O(n) - hash set for duplicates
Four Sum:
Time: O(n³) - three nested loops
Space: O(n) - hash set
Topic 65: Group Anagrams Using Hash Map
Time: O(n * k log k) where n = number of strings, k = max string length
- Sorting each string: O(k log k)
- n strings total: O(n * k log k)
Alternative with counting:
Time: O(n * k) where k is string length
- Count chars for each string: O(k)
- n strings: O(n * k)
Space: O(n * k) - storing all strings in groups
Topic 66: Longest Consecutive Sequence
Time: O(n) - add all to set O(n), check sequences O(n)
Space: O(n) - hash set storing n elements
Key insight: Each sequence only checked once from its start
Topic 67: Subarray Sum Equals K
Time: O(n) - single pass through array
Space: O(n) - hash map storing prefix sums
Optimization: Prefix sum + hash map approach
Without hash map: O(n²) nested loops
Topic 68: Designing Hash Map from Scratch
- Initial allocation: O(m) where m = capacity
- All operations: Same as standard hash map above
- Implementing custom hash function: Depends on key type
- Implementing collision resolution: Depends on strategy
Topic 69: Rolling Hash (Rabin-Karp Algorithm)
Pattern matching in text:
Time: O(n + m) average, O(nm) worst
where n = text length, m = pattern length
Rolling hash computation: O(1) per character
Initial hash: O(m) for pattern
Hash update: O(1) per slide
Space: O(1) - only store hash values
Overall Algorithm Complexity
Deriving Hash Table Complexity from First Principles:
Successful Search Analysis (Chaining):
Under Simple Uniform Hashing Assumption (SUHA):
- Each of n keys equally likely to hash to any of m buckets
- Expected number of keys in a bucket = α = n/m
Expected search time:
E[search time] = 1 + E[chain length]
= 1 + α
= O(1 + α)
With constant load factor (α ≤ 0.75), this is O(1).
Insert Analysis (Open Addressing - Linear Probing):
Probability that i-th slot is occupied: α^i (approximation)
Expected number of probes:
E[probes] = Σ(i=0 to ∞) α^i
= 1/(1-α) (geometric series)
For α = 0.75: E[probes] ≈ 4
Rehashing Amortized Analysis:
Suppose we double table size when α > 0.75:
- Rehash at n = 1, 2, 4, 8, 16, ..., n
- Cost of i-th rehash: O(2^i)
- Total cost for n insertions:
Σ(i=0 to log n) 2^i = 2^(log n + 1) - 1 = 2n - 1 = O(n) - Amortized cost per insert: O(n)/n = O(1)
Birthday Paradox and Collision Probability:
For m buckets and n keys:
P(no collision) ≈ e^(-n(n-1)/(2m))
P(collision) ≈ 1 - e^(-n²/(2m))
For 50% collision probability:
n ≈ 1.177 * √m
Example: 32-bit hash (m = 2^32) has 50% collision with only ~77,163 keys.
Conditional Complexity
Input Characteristics Affecting Performance:
-
Key Distribution:
- Uniform random: O(1) average case holds
- Clustered keys: Depends on hash function quality
- Sequential integers with poor hash: O(n) worst case (all collide)
- Adversarial input: O(n) if hash function predictable
-
Load Factor:
α = 0.25: Very fast, memory wasteful α = 0.50: Fast, balanced α = 0.75: Standard threshold, good balance α = 0.90: Slower, memory efficient (chaining only) α → 1.0: Very slow for open addressing α > 1.0: Only possible with chaining, linear search -
Hash Function Quality:
- Good hash (uniform distribution): O(1) average
- Poor hash (clustering): O(n) worst case
- Cryptographic hash: O(k) where k = key size, very uniform
-
Table Size:
- Prime size: Better with poor hash functions
- Power of 2: Faster modulo (bitwise AND), requires good hash
- Small table: More collisions, slower
- Large table: Fewer collisions, cache misses
Best Case Performance:
- Empty table or perfect hash function (no collisions)
- All operations: O(1)
- Occurs when: Hash function perfectly distributes keys
Worst Case Performance:
- All keys hash to same bucket
- All operations: O(n)
- Occurs when:
- Adversarial input with predictable hash function
- Poor hash function (e.g., always returns 0)
- Birthday paradox with insufficient table size
Average Case Performance:
- Assumes Simple Uniform Hashing Assumption (SUHA)
- Keys distribute uniformly across buckets
- Load factor α is constant (typically ≤ 0.75)
- All operations: O(1)
Amortized Analysis for Dynamic Resizing:
Using aggregate method:
Operation sequence: n insertions
Regular inserts: n * O(1) = O(n)
Resizes: O(1) + O(2) + O(4) + ... + O(n) = O(2n) = O(n)
Total: O(n) + O(n) = O(n)
Amortized per operation: O(1)
Using accounting method:
- Charge 3 units per insert:
- 1 unit for immediate insert
- 1 unit credit for itself during future resize
- 1 unit credit for one old element during resize
- Always have enough credit for resize
- Amortized: O(1)
Complexity Trade-offs
Time vs Space:
- Lower α (more space): Faster operations, less collisions
- Higher α (less space): Slower operations, more collisions
- Optimal α ≈ 0.75: Empirically best balance
Chaining vs Open Addressing:
| Aspect | Chaining | Open Addressing | | ------------------- | ----------------------- | ---------------------- | | Best/avg complexity | O(1) | O(1/(1-α)) | | Space overhead | Pointers (8 bytes each) | None, but need α < 1 | | Cache performance | Poor (pointer chasing) | Excellent (contiguous) | | Supports α > 1 | Yes | No |
Hash Function Complexity:
- Fast hash (multiplication): O(1), possibly poor distribution
- Moderate hash (polynomial): O(k), good distribution
- Cryptographic hash (SHA-256): O(k), excellent distribution, slower
Collision Resolution:
- Linear probing: Fast, clustering issues
- Quadratic probing: Moderate, reduces clustering
- Double hashing: Slower per probe, best distribution
5.2 Space Complexity
Memory Usage
Basic Space Requirements:
Separate Chaining:
Total space = Array + Nodes + Overhead
Array: O(m) pointers (m = table size)
- 8 bytes per pointer on 64-bit system
- Total: 8m bytes
Nodes: O(n) nodes (n = number of elements)
- Each node: key + value + next pointer
- Assuming 8-byte key, 8-byte value, 8-byte pointer
- Total: 24n bytes
Metadata: O(1)
- size, capacity, threshold, hash seed
- ~32 bytes
Total: O(m + n)
In terms of n (with α = n/m):
- m = n/α
- Space = O(n/α + n) = O(n) with constant α
Open Addressing:
Total space = Array + Overhead
Array: O(m) entries
- Each entry: key + value + status (empty/occupied/deleted)
- Assuming 8-byte key, 8-byte value, 1-byte status
- Total: 17m bytes
Must maintain α < 1, typically α ≤ 0.75
So m ≥ n/0.75 = 1.33n
Metadata: O(1)
- size, capacity, threshold
- ~24 bytes
Total: O(m) = O(n) with constant α
Space Comparison:
For n = 1,000,000 elements, α = 0.75:
Chaining:
- Array: 8 * 1,333,333 ≈ 10.7 MB
- Nodes: 24 * 1,000,000 = 24 MB
- Total: ~34.7 MB
Open Addressing:
- Array: 17 * 1,333,333 ≈ 22.7 MB
- Total: ~22.7 MB
Open addressing uses ~35% less space (no pointer overhead).
Memory Layout
Separate Chaining Memory Layout:
Hash Table Object:
├─ table: pointer to array (8 bytes)
├─ size: integer (4 bytes)
├─ capacity: integer (4 bytes)
├─ threshold: float (4 bytes)
├─ hash_seed: integer (4 bytes)
└─ padding (4 bytes) → Total: 28 bytes
Array (heap):
┌────────────────────────────────────┐
│ ptr₀ │ ptr₁ │ ptr₂ │ ... │ ptrₘ₋₁│ (m * 8 bytes)
└──┬───┴──┬───┴──┬───┴─────┴──┬─────┘
│ │ │ │
↓ ↓ ↓ ↓
Node Node Node Node (each 24 bytes)
↓ ↓
Node Node
↓
Node
Memory Fragmentation:
- Chaining: Many small allocations (nodes) → fragmentation
- Open addressing: Few large allocations → less fragmentation
Cache Line Analysis:
Modern CPUs have 64-byte cache lines.
Chaining:
- Each node access: potential cache miss
- Following linked list: poor spatial locality
- Average cache misses per search: α/2 (half the chain)
Open Addressing (Linear Probing):
- Contiguous array: excellent spatial locality
- Probe sequence stays in same cache line (if cluster small)
- Average cache misses per search: ≈1 (if probe sequence < 8 entries)
Memory Overhead Comparison:
| Structure | Per-Element Overhead | Total Overhead | | --------------- | ----------------------- | -------------- | | Chaining | 8 bytes (next pointer) | O(n) | | Open Addressing | 1 byte (status) + slack | O(m) | | Cuckoo Hashing | 0 bytes, but 2 tables | O(2m) |
Space-Time Trade-offs
Load Factor Trade-off:
# Lower α: More space, faster operations
α = 0.5:
Space: 2n entries
Search time: O(1 + 0.5) = O(1.5)
Cache efficiency: Excellent
# Standard α: Balanced
α = 0.75:
Space: 1.33n entries
Search time: O(1 + 0.75) = O(1.75)
Cache efficiency: Good
# Higher α: Less space, slower operations
α = 1.0:
Space: n entries (chaining only)
Search time: O(1 + 1.0) = O(2)
Cache efficiency: Moderate
Memoization with Hash Tables:
Classic time-space trade-off in dynamic programming:
# Without memoization: Exponential time, O(1) space
def fib(n):
if n <= 1:
return n
return fib(n-1) + fib(n-2)
# Time: O(2^n), Space: O(n) recursion stack
# With hash table memoization: Linear time, linear space
def fib_memo(n, memo={}):
if n <= 1:
return n
if n not in memo:
memo[n] = fib_memo(n-1, memo) + fib_memo(n-2, memo)
return memo[n]
# Time: O(n), Space: O(n) memo table + O(n) stack
Auxiliary Space for Operations:
Search:
- Iterative: O(1) auxiliary space
- Recursive (rare): O(log n) stack space
Insert with Resize:
- O(m) temporary space for new array
- Old array freed after rehashing
Iteration:
- O(1) auxiliary space for iterator state
Sorting hash table values:
- Extract to array: O(n) auxiliary space
- Sort: O(log n) stack space (quicksort)
Recursion Depth (where applicable)
Hash tables rarely use recursion, but some scenarios involve it:
Recursive Operations:
-
Cuckoo Hashing Insert:
def insert(key, value, depth=0): if depth > max_depth: rehash() # Prevent infinite loop return insert(key, value, 0) old_key, old_value = table[hash(key)] table[hash(key)] = (key, value) if old_key is not None: insert(old_key, old_value, depth + 1) # Recursive eviction- Worst-case depth: O(n) (but extremely rare)
- Average depth: O(1)
- Stack space: O(depth)
-
Java 8 HashMap (Tree Nodes): When chain length > 8, converts to red-black tree:
- Tree operations: O(log n) recursion depth
- Stack space: O(log n)
Stack Space Analysis:
Iterative implementations (standard): O(1) stack space Recursive cuckoo hashing: O(log n) expected, O(n) worst Hybrid tree-based buckets: O(log n) stack space
5.3 Other Complexity Measures
Cache Efficiency
Cache Hierarchy:
L1 Cache: 32-64 KB, ~4 cycles latency
L2 Cache: 256-512 KB, ~10 cycles latency
L3 Cache: 8-16 MB, ~40 cycles latency
RAM: GBs, ~200 cycles latency
Chaining Cache Performance:
# Poor cache locality
def search_chaining(table, key):
index = hash(key) % len(table)
node = table[index] # Cache miss 1: Array access
while node:
if node.key == key: # Cache miss 2: Node access
return node.value
node = node.next # Cache miss 3: Next node (pointer chasing)
return None
Cache misses:
- Array access: 1 miss (if not in cache)
- Each node: 1 miss (nodes scattered in memory)
- Average misses: 1 + α/2 (1 array + half the chain)
Open Addressing Cache Performance:
# Excellent cache locality
def search_linear_probing(table, key):
index = hash(key) % len(table)
i = 0
while table[index + i] is not EMPTY: # Sequential access
if table[index + i].key == key:
return table[index + i].value
i += 1
return None
Cache misses:
- Initial access: 1 miss (if not in cache)
- Probe sequence: Same cache line if cluster small
- Average misses: ~1 (entire probe sequence often in 1-2 cache lines)
Cache Line Utilization:
64-byte cache line holds:
- 8 pointers (chaining)
- ~3-4 key-value pairs (open addressing, assuming 16 bytes each)
Linear probing benefits:
- Probing within same cache line: 0 additional misses
- Sequential prefetching by CPU
Swiss Tables (Google's Abseil):
Optimized for cache efficiency:
Group of 16 slots + 16-byte control byte array
Control bytes: Empty, Deleted, or H2 hash (top 7 bits)
Search:
1. Compute H2 (7 bits of hash)
2. SIMD compare: 16 control bytes at once
3. Only check matching slots
Cache efficiency:
- Control bytes: 16 bytes = fits in cache line
- SIMD: 16 comparisons in parallel
- Fewer memory accesses
Practical Performance
Constant Factors:
Big-O hides constant factors that matter in practice:
Hash Function Overhead:
# Simple hash: Fast but poor distribution
def simple_hash(key):
return key # O(1), constant = 1
# Polynomial hash: Moderate speed, good distribution
def poly_hash(s):
h = 0
for c in s:
h = (h * 31 + ord(c)) % (2**32)
return h # O(k), constant = ~5 operations per char
# Cryptographic hash: Slow, excellent distribution
def crypto_hash(key):
return hashlib.sha256(key).digest() # O(k), constant = ~50 ops/char
Collision Resolution Overhead:
Linear probing:
- Each probe: 1 comparison + 1 array access
- Constant: ~2 operations per probe
Chaining:
- Each node: 1 comparison + 1 pointer dereference
- Constant: ~3 operations per node (worse due to cache misses)
Small Input Performance:
For n < 100:
- Hash table overhead: Hash computation + collision handling
- Linear search: Simple, cache-friendly
Breakeven point (empirical):
- n ≈ 10-20: Hash table starts to win
- n < 10: Linear search often faster
Real-World Benchmarks:
Python dict (open addressing):
n = 1,000: Insert 0.05 µs, Search 0.03 µs
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs
n = 100,000,000: Insert 0.08 µs, Search 0.06 µs
Nearly constant time, slight degradation due to cache misses
Java HashMap (chaining → trees):
n = 1,000: Insert 0.08 µs, Search 0.05 µs
n = 1,000,000: Insert 0.10 µs, Search 0.07 µs
Slightly slower than Python due to pointer overhead
C++ std::unordered_map (traditionally chaining, now Swiss Tables):
Old (chaining):
n = 1,000,000: Insert 0.12 µs, Search 0.09 µs
New (Swiss Tables):
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs
~2x speedup from cache-friendly design
Practical Bottlenecks:
-
Hash Function:
- Slow hash → dominates for small keys
- Optimize: Use fast hash for performance-critical code
-
Memory Allocation (Chaining):
- malloc/free overhead for each node
- Optimize: Use memory pool, allocate nodes in batches
-
Cache Misses (Chaining):
- Pointer chasing → random memory access
- Optimize: Use open addressing or array-based buckets
-
Resize Operation:
- Occasional O(n) spike
- Optimize: Incremental resizing, pre-allocate if size known
-
Load Factor:
- Too high → many collisions
- Too low → memory waste, cache misses
- Optimize: Tune based on profiling (typically 0.75)
When Theory Meets Practice:
Theory predicts: O(1) average case Practice shows:
- Constant factor varies 10x based on implementation
- Cache effects dominate for small-medium datasets
- Hash function quality matters more than collision resolution
- Open addressing outperforms chaining in modern systems
Practical Complexity Guidelines:
Tiny (n < 20): Linear search may be faster
Small (n < 1,000): Hash table good, simple implementation fine
Medium (n < 1,000,000): Hash table excellent, optimize hash function
Large (n < 1,000,000,000): Hash table great, optimize for cache
Massive (n > 1 billion): Distributed hash table or approximate structures
6. Optimization and Trade-offs
6.1 Performance Optimization
Optimization Techniques
Hash Function Optimization:
- Fast Integer Hashing:
# Slow: Modulo operation
def hash_slow(key, size):
return key % size
# Fast: Bitwise AND (requires size = power of 2)
def hash_fast(key, size):
return key & (size - 1) # Equivalent to key % size when size = 2^k
# Example: size = 16 (binary: 10000)
# size - 1 = 15 (binary: 01111)
# key = 42 (binary: 101010)
# 42 & 15 = 10 (binary: 001010)
# Speed: ~5x faster on modern CPUs
- Multiplication Method (Faster than Division):
# Knuth's multiplicative hashing
def hash_multiplicative(key, size):
# Golden ratio: (√5 - 1) / 2 ≈ 0.6180339887
A = 0.6180339887498948
# Or use integer version: multiply by 2^32 * A
A_int = 2654435769 # 2^32 * golden ratio
return ((key * A_int) >> (32 - size.bit_length())) & (size - 1)
# Fast: One multiplication, one shift, one AND
- String Hashing Optimization:
# Standard polynomial hash (slow for long strings)
def hash_poly_slow(s):
h = 0
for c in s:
h = (h * 31 + ord(c)) % (2**32)
return h
# Optimized: Process multiple characters at once
def hash_poly_fast(s):
h = 0
# Process 4 characters at a time (32-bit chunks)
for i in range(0, len(s), 4):
chunk = 0
for j in range(4):
if i + j < len(s):
chunk = (chunk << 8) | ord(s[i + j])
h = h * 31 + chunk
return h & 0xFFFFFFFF
# Speed: ~4x faster for long strings
- Cache Hash Values:
# For immutable keys, cache hash value
class CachedHashKey:
def __init__(self, key):
self.key = key
self._hash = None # Lazy compute
def __hash__(self):
if self._hash is None:
self._hash = hash(self.key)
return self._hash
def __eq__(self, other):
return self.key == other.key
# Python strings do this automatically
# Java String.hashCode() is cached after first call
Collision Resolution Optimization:
- Robin Hood Hashing (Variance Reduction):
def robin_hood_insert(table, key, value):
"""
Insert with Robin Hood hashing: "rich" entries give way to "poor" ones.
Distance from ideal position = probe distance.
"""
index = hash(key) % len(table)
distance = 0
while table[index] is not EMPTY:
existing_key, existing_value, existing_dist = table[index]
if existing_key == key:
table[index] = (key, value, distance) # Update
return
# Robin Hood: If current item is richer (less distance), swap
if distance > existing_dist:
key, value, distance = existing_key, existing_value, existing_dist
table[index] = (key, value, distance)
index = (index + 1) % len(table)
distance += 1
table[index] = (key, value, distance)
# Benefit: Reduces variance in probe lengths
# Average probe length: Same as linear probing
# Worst-case probe length: Much better (more predictable)
- Quadratic Probing (Reduce Clustering):
def quadratic_probe(table, key, value):
"""
Probe sequence: h(k), h(k)+1^2, h(k)+2^2, h(k)+3^2, ...
Reduces primary clustering of linear probing.
"""
index = hash(key) % len(table)
i = 0
while table[index] is not EMPTY:
if table[index].key == key:
table[index].value = value # Update
return
i += 1
index = (hash(key) + i * i) % len(table)
table[index] = (key, value)
# Benefit: Reduces clustering compared to linear probing
# Caveat: Requires table size to be prime or power of 2
- Double Hashing (Best Probe Distribution):
def double_hash_probe(table, key, value):
"""
Use second hash function for step size.
Probe sequence: h1(k), h1(k)+h2(k), h1(k)+2*h2(k), ...
"""
h1 = hash(key) % len(table)
h2 = 1 + (hash(key) // len(table)) % (len(table) - 1) # Step size
# h2 must be coprime to table size (prime table size ensures this)
index = h1
i = 0
while table[index] is not EMPTY:
if table[index].key == key:
table[index].value = value
return
i += 1
index = (h1 + i * h2) % len(table)
table[index] = (key, value)
# Benefit: Best distribution, minimal clustering
# Cost: Two hash computations instead of one
Resize Optimization:
- Incremental Resizing:
class IncrementalHashTable:
"""
Spread resize cost across multiple operations.
Avoid O(n) spike.
"""
def __init__(self):
self.old_table = None
self.new_table = []
self.resize_index = 0
self.resize_in_progress = False
def insert(self, key, value):
# Do a bit of incremental resizing
if self.resize_in_progress:
self._rehash_some_entries()
# Normal insert
if self.should_resize():
self._start_resize()
# Insert into new table if resizing
table = self.new_table if self.resize_in_progress else self.old_table
self._insert_into(table, key, value)
def _rehash_some_entries(self):
"""Rehash a small batch of entries from old to new table."""
batch_size = 10 # Amortize cost
for _ in range(batch_size):
if self.resize_index < len(self.old_table):
entry = self.old_table[self.resize_index]
if entry is not EMPTY:
self._insert_into(self.new_table, entry.key, entry.value)
self.resize_index += 1
else:
# Resize complete
self.old_table = self.new_table
self.resize_in_progress = False
break
# Benefit: No single slow operation
# Cost: More complex implementation, dual-table lookup during resize
- Pre-allocate if Size Known:
# Slow: Start small, resize multiple times
hash_map = {}
for i in range(1000000):
hash_map[i] = i
# Multiple resizes: at 12, 24, 48, ..., 524288, 1048576
# Fast: Pre-allocate to avoid resizes
hash_map = dict.fromkeys(range(1000000)) # Python
# Or explicitly: hash_map = {}; hash_map.reserve(1000000) # C++
for i in range(1000000):
hash_map[i] = i
# No resizes: ~2-3x faster
- Lazy Deletion (Avoid Rehashing):
class LazyDeleteHashTable:
"""
Mark deleted slots, rebuild only when too many tombstones.
"""
def __init__(self):
self.table = []
self.size = 0
self.deleted_count = 0
def delete(self, key):
index = self._find(key)
if index is not None and self.table[index] is not DELETED:
self.table[index] = DELETED
self.size -= 1
self.deleted_count += 1
# Rebuild if too many tombstones
if self.deleted_count > len(self.table) // 4:
self._rebuild()
def _rebuild(self):
"""Remove tombstones without increasing table size."""
old_table = self.table
self.table = [EMPTY] * len(old_table)
self.deleted_count = 0
for entry in old_table:
if entry is not EMPTY and entry is not DELETED:
self._insert(entry.key, entry.value)
# Benefit: Fast deletion, infrequent rebuilds
Lazy Evaluation and Caching:
- Lazy Bucket Initialization:
class LazyHashTable:
"""
Don't allocate bucket arrays until needed.
Saves memory for sparse hash tables.
"""
def __init__(self, size):
self.buckets = [None] * size # Array of None, not empty lists
def insert(self, key, value):
index = hash(key) % len(self.buckets)
# Lazy init: Create bucket list only when first item inserted
if self.buckets[index] is None:
self.buckets[index] = []
self.buckets[index].append((key, value))
# Benefit: Save memory for sparse tables
# Use case: Large table with few entries
- Memoization Pattern:
from functools import lru_cache
# Automatic memoization with hash table
@lru_cache(maxsize=None) # Unbounded cache
def expensive_function(x):
# Complex computation
return sum(i**2 for i in range(x))
# Or manual:
def memoize(func):
cache = {}
def wrapper(*args):
if args not in cache:
cache[args] = func(*args)
return cache[args]
return wrapper
@memoize
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
# Time: O(2^n) → O(n) with memoization
Bottleneck Analysis
Identifying Bottlenecks:
- Hash Function Profiling:
import time
def profile_hash_function(hash_func, keys):
start = time.perf_counter()
for key in keys:
_ = hash_func(key)
end = time.perf_counter()
return end - start
# Test different hash functions
keys = [str(i) for i in range(1000000)]
print("Simple sum:", profile_hash_function(lambda s: sum(ord(c) for c in s), keys))
print("Polynomial:", profile_hash_function(lambda s: hash(s), keys))
print("SHA-256:", profile_hash_function(lambda s: hashlib.sha256(s.encode()).hexdigest(), keys))
# Results (example):
# Simple sum: 0.15s
# Polynomial: 0.08s (built-in, optimized)
# SHA-256: 2.5s (cryptographic, slow)
- Collision Rate Analysis:
def analyze_collisions(hash_table):
"""
Measure collision rate and chain length distribution.
"""
chain_lengths = []
for bucket in hash_table.buckets:
if bucket is not None:
chain_lengths.append(len(bucket))
else:
chain_lengths.append(0)
total_collisions = sum(max(0, length - 1) for length in chain_lengths)
avg_chain = sum(chain_lengths) / len(chain_lengths)
max_chain = max(chain_lengths)
print(f"Average chain length: {avg_chain:.2f}")
print(f"Max chain length: {max_chain}")
print(f"Total collisions: {total_collisions}")
print(f"Load factor: {sum(chain_lengths) / len(chain_lengths):.2f}")
# Histogram
from collections import Counter
histogram = Counter(chain_lengths)
print("Chain length distribution:")
for length, count in sorted(histogram.items()):
print(f" Length {length}: {count} buckets")
# Ideal: Most buckets have 0-1 elements
# Problem: Many long chains indicate poor hash function or high load factor
- Memory Access Patterns:
def measure_cache_misses(operation, iterations=1000000):
"""
Estimate cache efficiency by timing.
More cache misses = slower per operation.
"""
# Warm up
for _ in range(1000):
operation()
# Measure
start = time.perf_counter()
for _ in range(iterations):
operation()
end = time.perf_counter()
return (end - start) / iterations * 1e9 # nanoseconds per op
# Compare chaining vs open addressing
chaining_time = measure_cache_misses(lambda: chaining_table[random_key()])
open_addr_time = measure_cache_misses(lambda: open_table[random_key()])
print(f"Chaining: {chaining_time:.1f} ns/op")
print(f"Open addressing: {open_addr_time:.1f} ns/op")
# Expected: Open addressing ~2-3x faster due to better cache locality
Common Bottlenecks:
-
Slow Hash Function:
- Symptom: Insertion/search time dominated by hash computation
- Detection: Profile shows >50% time in hash function
- Solution: Use faster hash (multiplication method, cached hash)
-
High Collision Rate:
- Symptom: Long chains or probe sequences
- Detection: Average chain length > 2, max chain length > 10
- Solution: Improve hash function, reduce load factor, resize table
-
Frequent Resizing:
- Symptom: Periodic slow insertions
- Detection: Profiler shows time in resize/rehash
- Solution: Pre-allocate if size known, incremental resizing
-
Cache Misses (Chaining):
- Symptom: Slower than expected for low load factor
- Detection: Performance doesn't match O(1) prediction
- Solution: Switch to open addressing, use memory pool for nodes
-
Poor Load Factor:
- Too high: Many collisions
- Too low: Memory waste, cache misses
- Solution: Tune threshold (typically 0.75)
Ideal Conditions
Optimal Configuration:
-
Load Factor:
- Chaining: α = 0.75 (standard)
- Open addressing: α = 0.5-0.7 (lower for better performance)
- Cryptographic attacks concerned: α = 0.5 (reduces collision probability)
-
Initial Capacity:
- Known size: Set to expected_size / load_factor
- Unknown size: Start with 16-32 (small overhead, quick to grow)
- Large datasets: Start with 1024+ (avoid early resizes)
-
Hash Function:
- Integers: Multiplication method or identity (if well-distributed)
- Strings: Polynomial hash (built-in) or MurmurHash
- Security-critical: SipHash or Blake2
-
Collision Resolution:
- General purpose: Chaining (simple, robust)
- Performance-critical: Open addressing with linear probing (cache-friendly)
- Predictability-critical: Robin Hood hashing or cuckoo hashing
Best Practices:
class OptimizedHashTable:
"""
Production-ready hash table with best practices.
"""
def __init__(self, initial_capacity=16, load_factor=0.75):
# Use power of 2 for fast modulo (bitwise AND)
self.capacity = self._next_power_of_2(initial_capacity)
self.size = 0
self.load_factor = load_factor
self.threshold = int(self.capacity * load_factor)
# Randomized hash seed to prevent DoS attacks
import random
self.hash_seed = random.getrandbits(32)
# Initialize table
self.table = [None] * self.capacity
def _next_power_of_2(self, n):
"""Round up to next power of 2."""
return 1 << (n - 1).bit_length()
def _hash(self, key):
"""
High-quality hash function.
"""
# Use built-in hash + randomized seed
h = hash(key) ^ self.hash_seed
# Additional mixing (optional, for extra quality)
h = ((h >> 16) ^ h) * 0x45d9f3b
h = ((h >> 16) ^ h) * 0x45d9f3b
h = (h >> 16) ^ h
return h & (self.capacity - 1) # Fast modulo
def put(self, key, value):
"""Insert with automatic resizing."""
if self.size >= self.threshold:
self._resize()
# ... insert logic ...
def _resize(self):
"""Double capacity and rehash."""
old_table = self.table
self.capacity *= 2
self.threshold = int(self.capacity * self.load_factor)
self.table = [None] * self.capacity
self.size = 0
# Rehash all entries
for entry in old_table:
if entry is not None:
self.put(entry.key, entry.value)
# Usage:
hash_table = OptimizedHashTable(
initial_capacity=expected_size // 0.75, # Avoid resizes
load_factor=0.75 # Standard trade-off
)
Performance Tuning Guide:
| Scenario | Recommended Settings | | -------------------- | -------------------------------------------------- | | General purpose | α=0.75, chaining, polynomial hash | | Performance-critical | α=0.5, open addressing (linear probing), fast hash | | Memory-constrained | α=1.0, chaining, lazy deletion | | Predictable latency | α=0.5, Robin Hood or cuckoo hashing | | Security-critical | α=0.5, SipHash, randomized seed | | Known size | Pre-allocate, α=0.75 | | Unknown size | Start small (16), α=0.75 |
6.2 Design Trade-offs
Fundamental Trade-offs
Time vs Space:
# Space-efficient but slower: High load factor
class DenseHashTable:
def __init__(self):
self.load_factor = 0.95 # Use more space
# Pros: Minimal memory overhead
# Cons: More collisions, slower operations
# Use when: Memory is limited
# Fast but space-intensive: Low load factor
class SparseHashTable:
def __init__(self):
self.load_factor = 0.5 # Use more space
# Pros: Fewer collisions, faster operations
# Cons: ~2x memory usage
# Use when: Speed is critical, memory available
Simplicity vs Performance:
# Simple: Chaining with list
class SimpleHashTable:
def __init__(self):
self.buckets = [[] for _ in range(capacity)]
def put(self, key, value):
bucket = self.buckets[hash(key) % len(self.buckets)]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
# Pros: Easy to implement, easy to debug
# Cons: Pointer overhead, cache misses
# Complex but fast: Open addressing with Robin Hood hashing
class ComplexHashTable:
def __init__(self):
self.table = [EMPTY] * capacity
self.distances = [0] * capacity
def put(self, key, value):
# Robin Hood logic: swap if current entry is "richer"
# ... complex insertion logic with distance tracking ...
pass
# Pros: Better performance, cache-friendly
# Cons: Complex implementation, harder to debug
Determinism vs Security:
# Deterministic: Predictable, vulnerable to attacks
class DeterministicHashTable:
def __init__(self):
self.hash_seed = 0 # Fixed seed
def hash(self, key):
return hash(key) # Same output every run
# Pros: Reproducible behavior, easier debugging
# Cons: Vulnerable to hash collision DoS attacks
# Randomized: Secure, non-deterministic
class RandomizedHashTable:
def __init__(self):
import random
self.hash_seed = random.getrandbits(64) # Random per instance
def hash(self, key):
return hash((key, self.hash_seed)) # Different output each run
# Pros: Resistant to DoS attacks
# Cons: Non-deterministic iteration order, harder to debug
Worst-case vs Average-case:
# Average-case optimized: Standard hash table
class StandardHashTable:
# O(1) average case, O(n) worst case
# Most operations fast, rare slow operations acceptable
pass
# Worst-case optimized: Cuckoo hashing
class CuckooHashTable:
# O(1) worst-case lookup, O(1) amortized insert
# Guaranteed fast lookups, insertions occasionally expensive
def __init__(self):
self.table1 = [None] * capacity
self.table2 = [None] * capacity
def get(self, key):
# Check both tables: guaranteed O(1)
h1 = hash1(key) % len(self.table1)
if self.table1[h1] and self.table1[h1][0] == key:
return self.table1[h1][1]
h2 = hash2(key) % len(self.table2)
if self.table2[h2] and self.table2[h2][0] == key:
return self.table2[h2][1]
return None # Only 2 lookups, guaranteed
# Use when: Predictable latency critical (real-time systems)
Alternative Approaches
Comparison Table:
| Approach | Insert | Search | Delete | Space | Cache | Use Case | | ---------------------------- | ------------ | ------------ | ------------ | ------- | --------- | -------------------------- | | Hash Table (Chaining) | O(1) avg | O(1) avg | O(1) avg | O(n) | Poor | General purpose | | Hash Table (Open Addr) | O(1) avg | O(1) avg | O(1) avg | O(n) | Good | Performance-critical | | Binary Search Tree | O(log n) | O(log n) | O(log n) | O(n) | Poor | Ordered iteration | | Balanced BST (Red-Black) | O(log n) | O(log n) | O(log n) | O(n) | Poor | Ordered + balanced | | Sorted Array | O(n) | O(log n) | O(n) | O(n) | Excellent | Static data, range queries | | Unsorted Array | O(1) | O(n) | O(n) | O(n) | Excellent | Small datasets | | Trie | O(k) | O(k) | O(k) | O(n*k) | Good | Prefix searches | | Skip List | O(log n) avg | O(log n) avg | O(log n) avg | O(n) | Good | Concurrent access | | Bloom Filter | O(k) | O(k) | No | O(m) | Good | Approximate membership |
Choosing the Right Structure:
def choose_data_structure(requirements):
"""
Decision tree for selecting data structure.
"""
if requirements.need_fast_lookup:
if requirements.need_ordered:
return "Balanced BST (TreeMap)"
elif requirements.need_range_queries:
return "Balanced BST or Segment Tree"
elif requirements.need_prefix_search:
return "Trie"
else:
if requirements.memory_limited:
return "Bloom Filter (approximate)"
else:
return "Hash Table" # ← Most common choice
elif requirements.need_fast_insertion:
if requirements.data_static:
return "Sorted Array + Binary Search"
else:
return "Hash Table"
elif requirements.need_ordered_iteration:
return "Balanced BST (Red-Black, AVL)"
else:
if requirements.dataset_small:
return "Unsorted Array"
else:
return "Hash Table (default choice)"
When to Prefer Alternatives:
-
Use Binary Search Tree when:
- Need ordered iteration
- Need range queries (find all keys between X and Y)
- Need min/max operations
- Example: Event scheduling, interval queries
-
Use Sorted Array when:
- Data is static or rarely changes
- Need fast lookups (binary search)
- Memory is tight (no overhead)
- Example: Dictionary lookup, spell checker
-
Use Trie when:
- Keys are strings
- Need prefix searches
- Need autocomplete functionality
- Example: Autocomplete, IP routing, dictionary
-
Use Bloom Filter when:
- Need membership testing
- Can tolerate false positives
- Memory is very limited
- Example: Web cache, spell checker, malware detection
Balancing Competing Factors
Fast Reads vs Fast Writes:
# Optimize for reads: Lower load factor
class ReadOptimizedHashTable:
def __init__(self):
self.load_factor = 0.5 # Fewer collisions
# Reads: Very fast (avg 1.5 probes)
# Writes: Slower (more frequent resizes)
# Memory: 2x overhead
# Use when: Read-heavy workload (90% reads, 10% writes)
# Optimize for writes: Higher load factor
class WriteOptimizedHashTable:
def __init__(self):
self.load_factor = 0.9 # Rare resizes
# Reads: Slower (avg 10 probes for open addressing)
# Writes: Faster (infrequent resizes)
# Memory: 1.1x overhead
# Use when: Write-heavy workload or memory-constrained
Consistency vs Performance:
# Strong consistency: Lock entire table
class ConsistentHashTable:
def __init__(self):
self.lock = threading.Lock()
def put(self, key, value):
with self.lock: # Exclusive access
# ... insert logic ...
pass
# Pros: Simple, correct
# Cons: No concurrent writes, single point of contention
# Weak consistency: Fine-grained locking
class ConcurrentHashTable:
def __init__(self):
self.segment_locks = [threading.Lock() for _ in range(16)]
def put(self, key, value):
segment = hash(key) % len(self.segment_locks)
with self.segment_locks[segment]: # Lock only this segment
# ... insert logic ...
pass
# Pros: Concurrent writes to different segments
# Cons: More complex, potential for stale reads
Heuristics for Tuning:
class AdaptiveHashTable:
"""
Hash table that tunes itself based on workload.
"""
def __init__(self):
self.load_factor = 0.75
self.resize_count = 0
self.collision_count = 0
self.operation_count = 0
def put(self, key, value):
self.operation_count += 1
# Track metrics
if self._collision_occurred():
self.collision_count += 1
# Adapt load factor based on collision rate
if self.operation_count % 10000 == 0:
self._adjust_load_factor()
# ... normal insert ...
def _adjust_load_factor(self):
collision_rate = self.collision_count / self.operation_count
if collision_rate > 0.3: # Too many collisions
self.load_factor = max(0.5, self.load_factor - 0.1)
print(f"Reducing load factor to {self.load_factor}")
elif collision_rate < 0.1: # Too sparse
self.load_factor = min(0.9, self.load_factor + 0.1)
print(f"Increasing load factor to {self.load_factor}")
# Use when: Workload characteristics unknown or changing
6.3 Advanced Optimizations
Amortization
Amortized Analysis for Resizing:
class AmortizedHashTable:
"""
Demonstrate amortized O(1) insertion with resizing.
"""
def __init__(self):
self.capacity = 8
self.size = 0
self.table = [None] * self.capacity
self.resize_costs = [] # Track resize costs
def put(self, key, value):
# Resize if needed (happens at size = 8, 16, 32, 64, ...)
if self.size >= self.capacity * 0.75:
old_capacity = self.capacity
self._resize()
self.resize_costs.append(old_capacity) # Track resize cost
# Normal insert: O(1)
index = hash(key) % self.capacity
self.table[index] = (key, value)
self.size += 1
def _resize(self):
"""Resize and rehash all elements."""
old_table = self.table
self.capacity *= 2
self.table = [None] * self.capacity
# Rehash: O(n) for this operation
for entry in old_table:
if entry is not None:
index = hash(entry[0]) % self.capacity
self.table[index] = entry
def analyze_amortized_cost(self):
"""
Analyze amortized cost.
For n insertions:
- Regular inserts: n * O(1) = O(n)
- Resizes at sizes: 8, 16, 32, ..., n
- Resize costs: 8 + 16 + 32 + ... + n = 2n - 1 = O(n)
- Total: O(n) + O(n) = O(2n) = O(n)
- Amortized per insert: O(n) / n = O(1)
"""
total_resize_cost = sum(self.resize_costs)
total_cost = self.size + total_resize_cost
amortized = total_cost / self.size if self.size > 0 else 0
print(f"Total insertions: {self.size}")
print(f"Total resize cost: {total_resize_cost}")
print(f"Total cost: {total_cost}")
print(f"Amortized cost per insert: {amortized:.2f}")
# Output: Amortized cost ≈ 2 (constant!)
# Demonstration
ht = AmortizedHashTable()
for i in range(1000):
ht.put(i, i)
ht.analyze_amortized_cost()
# Shows amortized O(1) despite occasional O(n) resize
Accounting Method:
class AmortizedHashTableWithCredits:
"""
Using accounting method: charge extra "credits" per operation.
"""
def __init__(self):
self.capacity = 8
self.size = 0
self.table = [None] * self.capacity
self.credits = 0
def put(self, key, value):
# Charge 3 units per insert (amortized cost)
cost = 3
# 1 unit: Actual insert work
self._insert(key, value)
actual_cost = 1
# 2 units: Credits for future resize
# - 1 credit for moving this element during resize
# - 1 credit for moving one old element
self.credits += (cost - actual_cost)
# When resizing, use accumulated credits
if self.size >= self.capacity * 0.75:
resize_cost = self.size # Cost to rehash all elements
assert self.credits >= resize_cost, "Not enough credits!"
self.credits -= resize_cost
self._resize()
return cost # Return amortized cost (always 3)
def _insert(self, key, value):
index = hash(key) % self.capacity
self.table[index] = (key, value)
self.size += 1
def _resize(self):
old_table = self.table
self.capacity *= 2
self.table = [None] * self.capacity
for entry in old_table:
if entry is not None:
index = hash(entry[0]) % self.capacity
self.table[index] = entry
# Key insight: By charging 3 units per insert instead of 1,
# we always have enough credits saved to pay for resize operations.
# Amortized cost: O(1) per insert.
Potential Method:
def potential_function(hash_table):
"""
Potential Φ(table) = 2 * size - capacity
Intuition: Potential increases as table fills up,
releasing "stored energy" during resize.
"""
return 2 * hash_table.size - hash_table.capacity
def amortized_cost_insert(hash_table):
"""
Amortized cost = Actual cost + ΔΦ
"""
old_potential = potential_function(hash_table)
# Actual insert cost
if hash_table.size < hash_table.capacity * 0.75:
actual_cost = 1 # Just insert
hash_table.size += 1
else:
# Resize + insert
actual_cost = hash_table.size + 1 # Rehash all + insert
hash_table.capacity *= 2
hash_table.size += 1
new_potential = potential_function(hash_table)
delta_potential = new_potential - old_potential
amortized_cost = actual_cost + delta_potential
return amortized_cost
# Analysis:
# Regular insert:
# Actual = 1
# ΔΦ = (2(n+1) - m) - (2n - m) = 2
# Amortized = 1 + 2 = 3
#
# Resize insert (when n = m*0.75):
# Actual = n + 1
# ΔΦ = (2(n+1) - 2m) - (2n - m) = 2 - m = 2 - (n/0.75) ≈ -n/0.75
# Amortized = n + 1 - n/0.75 ≈ n - 1.33n = -0.33n (negative!)
# But potential released pays for expensive operation
#
# Result: Amortized O(1) per insert
Parallelization
Concurrent Hash Table Design:
import threading
from concurrent.futures import ThreadPoolExecutor
class ConcurrentHashTable:
"""
Thread-safe hash table with fine-grained locking.
"""
def __init__(self, num_segments=16):
self.num_segments = num_segments
self.segments = [[] for _ in range(num_segments)]
self.locks = [threading.Lock() for _ in range(num_segments)]
def _get_segment(self, key):
"""Map key to segment."""
return hash(key) % self.num_segments
def put(self, key, value):
"""Thread-safe insert."""
segment = self._get_segment(key)
with self.locks[segment]:
# Only lock this segment, others can be accessed concurrently
for i, (k, v) in enumerate(self.segments[segment]):
if k == key:
self.segments[segment][i] = (key, value)
return
self.segments[segment].append((key, value))
def get(self, key):
"""Thread-safe read."""
segment = self._get_segment(key)
with self.locks[segment]:
for k, v in self.segments[segment]:
if k == key:
return v
return None
def parallel_build(self, items):
"""Build hash table in parallel."""
with ThreadPoolExecutor(max_workers=self.num_segments) as executor:
# Partition items by segment
segment_items = [[] for _ in range(self.num_segments)]
for key, value in items:
segment = self._get_segment(key)
segment_items[segment].append((key, value))
# Insert each segment in parallel (no lock contention)
def insert_segment(segment_id):
for key, value in segment_items[segment_id]:
self.put(key, value)
executor.map(insert_segment, range(self.num_segments))
# Scalability: With S segments and T threads:
# - Throughput: ~T/S times faster (if T ≤ S)
# - Best case: T = S, no lock contention, ~T times faster
Lock-Free Hash Table (Advanced):
import threading
from ctypes import c_bool, c_int
class LockFreeHashTable:
"""
Lock-free hash table using Compare-And-Swap (CAS).
Note: Python doesn't have true CAS; this is conceptual.
Real implementation would use C/C++ with atomic operations.
"""
def __init__(self, capacity):
# Each entry: (key, value, tombstone_flag)
self.table = [(None, None, False) for _ in range(capacity)]
self.capacity = capacity
def put(self, key, value):
"""
Lock-free insert using CAS.
"""
index = hash(key) % self.capacity
while True:
# Read current value
current = self.table[index]
current_key, current_value, is_deleted = current
# Case 1: Empty slot or deleted
if current_key is None or is_deleted:
# Try to CAS: if still empty, insert
new_entry = (key, value, False)
if self._cas(index, current, new_entry):
return # Success
# CAS failed, retry
# Case 2: Key already exists
elif current_key == key:
# Try to CAS: update value
new_entry = (key, value, False)
if self._cas(index, current, new_entry):
return # Success
# CAS failed, retry
# Case 3: Collision
else:
index = (index + 1) % self.capacity # Linear probe
def _cas(self, index, expected, new_value):
"""
Compare-And-Swap operation (atomic).
if table[index] == expected:
table[index] = new_value
return True
else:
return False
"""
# In real implementation, this would be atomic CPU instruction
# In Python, we simulate (not truly lock-free):
with threading.Lock(): # Simulating atomicity
if self.table[index] == expected:
self.table[index] = new_value
return True
return False
# Benefits of lock-free:
# - No deadlock possible
# - Always makes progress (some thread succeeds)
# - Better worst-case latency (no lock waiting)
#
# Drawbacks:
# - Complex to implement correctly
# - May have higher contention under high load
# - Requires hardware support (CAS instruction)
# Real-world examples:
# - Java's ConcurrentHashMap (since Java 8)
# - C++ folly::AtomicHashMap
# - Rust's dashmap
Embarrassingly Parallel Operations:
def parallel_hash_operations(keys, values):
"""
Some hash table operations are embarrassingly parallel.
"""
# 1. Parallel building (no dependencies)
def parallel_build(items):
hash_tables = [HashTable() for _ in range(num_threads)]
with ThreadPoolExecutor(max_workers=num_threads) as executor:
# Each thread builds its own partition
def build_partition(partition_id):
for key, value in items[partition_id::num_threads]:
hash_tables[partition_id].put(key, value)
executor.map(build_partition, range(num_threads))
# Merge tables (sequential, but fast)
final_table = HashTable()
for ht in hash_tables:
for key, value in ht.items():
final_table.put(key, value)
return final_table
# 2. Parallel lookups (read-only, no locks needed)
def parallel_lookups(hash_table, keys):
with ThreadPoolExecutor(max_workers=num_threads) as executor:
results = list(executor.map(hash_table.get, keys))
return results
# 3. Parallel frequency counting (merge phase)
def parallel_word_count(documents):
with ThreadPoolExecutor(max_workers=num_threads) as executor:
# Each thread counts its partition
partial_counts = executor.map(count_words, documents)
# Merge counts
total_count = {}
for partial in partial_counts:
for word, count in partial.items():
total_count[word] = total_count.get(word, 0) + count
return total_count
def count_words(doc):
counts = {}
for word in doc.split():
counts[word] = counts.get(word, 0) + 1
return counts
# Speedup: Near-linear with number of cores for read-heavy workloads
Approximation
Approximate Data Structures:
- Bloom Filter (Approximate Membership):
import mmh3 # MurmurHash3
class BloomFilter:
"""
Space-efficient probabilistic data structure for membership testing.
Trade-off: Small false positive rate for large space savings.
"""
def __init__(self, size, num_hash_functions):
self.size = size
self.num_hashes = num_hash_functions
self.bit_array = [False] * size
def add(self, item):
"""Add item to set."""
for seed in range(self.num_hashes):
index = mmh3.hash(item, seed) % self.size
self.bit_array[index] = True
def contains(self, item):
"""Check if item might be in set."""
for seed in range(self.num_hashes):
index = mmh3.hash(item, seed) % self.size
if not self.bit_array[index]:
return False # Definitely NOT in set
return True # Probably in set (might be false positive)
def false_positive_rate(self, n):
"""
Calculate false positive probability.
FPR ≈ (1 - e^(-k*n/m))^k
where k = num_hashes, n = items inserted, m = size
"""
import math
k, m = self.num_hashes, self.size
return (1 - math.exp(-k * n / m)) ** k
# Example: 1 million items
# Hash table: ~24 MB (with pointers)
# Bloom filter: ~1.2 MB (1% false positive rate)
# Space savings: 20x smaller!
# Use cases:
# - Web crawlers: Check if URL already visited
# - Databases: Avoid disk lookups for non-existent keys
# - CDNs: Check if content in cache before fetching
- Count-Min Sketch (Approximate Frequency):
class CountMinSketch:
"""
Space-efficient probabilistic data structure for frequency counting.
Trade-off: Approximate counts (overestimation) for space savings.
"""
def __init__(self, width, depth):
self.width = width # Number of buckets per row
self.depth = depth # Number of hash functions
self.table = [[0] * width for _ in range(depth)]
def add(self, item, count=1):
"""Increment count for item."""
for i in range(self.depth):
index = hash((item, i)) % self.width
self.table[i][index] += count
def estimate(self, item):
"""Estimate frequency of item."""
return min(
self.table[i][hash((item, i)) % self.width]
for i in range(self.depth)
)
def error_bound(self, total_count):
"""
Error bound: ε * total_count with probability 1 - δ
where ε = e / width, δ = (1/2)^depth
"""
import math
epsilon = math.e / self.width
delta = (1/2) ** self.depth
return epsilon * total_count, delta
# Example: Count 1 billion elements
# Hash table: ~16 GB (assuming 16 bytes per entry)
# Count-Min Sketch: ~1 MB (1% error, 99% confidence)
# Space savings: 16,000x smaller!
# Use cases:
# - Network traffic analysis: Count packets per IP
# - Stream processing: Top-K frequent items
# - Database query optimization: Estimate join sizes
- HyperLogLog (Approximate Cardinality):
import math
class HyperLogLog:
"""
Space-efficient cardinality estimation.
Trade-off: ~2% error for ~1.5 KB space (for billions of elements).
"""
def __init__(self, precision=14):
self.precision = precision
self.m = 1 << precision # 2^precision buckets
self.buckets = [0] * self.m
# Bias correction constant
if self.m == 16:
self.alpha = 0.673
elif self.m == 32:
self.alpha = 0.697
elif self.m == 64:
self.alpha = 0.709
else:
self.alpha = 0.7213 / (1 + 1.079 / self.m)
def add(self, item):
"""Add item to set."""
# Hash item to 64 bits
h = hash(item) & ((1 << 64) - 1)
# First 'precision' bits determine bucket
bucket = h >> (64 - self.precision)
# Count leading zeros in remaining bits (+ 1)
remaining = h & ((1 << (64 - self.precision)) - 1)
leading_zeros = self._leading_zeros(remaining) + 1
# Update bucket with maximum leading zeros seen
self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)
def cardinality(self):
"""Estimate number of unique items."""
# Harmonic mean of 2^bucket values
raw_estimate = self.alpha * self.m * self.m / sum(
2 ** (-bucket) for bucket in self.buckets
)
# Bias correction for small/large cardinalities
if raw_estimate <= 2.5 * self.m:
# Small range correction
zeros = self.buckets.count(0)
if zeros != 0:
return self.m * math.log(self.m / zeros)
if raw_estimate <= (1/30) * (1 << 32):
return raw_estimate
else:
# Large range correction
return -1 * (1 << 32) * math.log(1 - raw_estimate / (1 << 32))
def _leading_zeros(self, num):
"""Count leading zeros in binary representation."""
if num == 0:
return 64 - self.precision
return (64 - self.precision) - num.bit_length()
# Example: Count unique users from 1 billion events
# Hash set: ~16 GB
# HyperLogLog: ~12 KB (for 14-bit precision, ~0.8% standard error)
# Space savings: >1,000,000x smaller!
# Use cases:
# - Analytics: Count unique visitors
# - Databases: Estimate distinct values for query optimization
# - Ad tech: Count unique ad impressions
When to Use Approximation:
| Exact Structure | Approximate Alternative | Space Savings | Use When | | ---------------------- | ----------------------- | ------------- | ------------------------------------------------ | | Hash Set | Bloom Filter | 10-100x | Membership testing, can tolerate false positives | | Hash Map (counts) | Count-Min Sketch | 1000-10000x | Frequency estimation, overestimate acceptable | | Hash Set (cardinality) | HyperLogLog | 1000000x | Counting unique items, ~2% error acceptable |
def choose_approximate_structure(use_case):
"""Decision guide for approximate structures."""
if use_case == "membership_testing":
if can_tolerate_false_positives():
return BloomFilter(size=optimal_size(), num_hashes=optimal_k())
else:
return HashSet() # Need exact answers
elif use_case == "frequency_counting":
if need_exact_counts():
return HashMap() # Exact counts
elif can_tolerate_overestimation():
return CountMinSketch(width=1000, depth=5) # Approximate
else:
return HashMap() # Need exact
elif use_case == "cardinality_estimation":
if need_exact_cardinality():
return HashSet() # Exact unique count
elif dataset_massive():
return HyperLogLog(precision=14) # Approximate, tiny space
else:
return HashSet() # Small enough for exact
# Rule of thumb:
# - Small datasets (< 1M elements): Use exact structures
# - Large datasets (> 100M elements): Consider approximate
# - Massive datasets (> 1B elements): Approximate almost required
4. Problem-Solving and Applications
4.1 Use Cases and Applications
Where and When to Use
Most Effective Scenarios:
-
Fast Lookup with Non-Integer Keys:
- Need O(1) retrieval using strings, objects, or complex keys
- Example: User ID → User profile mapping
-
Frequency Counting:
- Count occurrences of elements in stream or array
- Example: Word frequency in document
-
Deduplication:
- Remove or detect duplicates efficiently
- Example: Find unique elements in array
-
Caching/Memoization:
- Store computed results for reuse
- Example: Dynamic programming with complex state
-
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group by category
When to Choose Hash Table Over Alternatives:
| Requirement | Choose | Over | | ----------------------- | ------------------ | ----------------------------- | | Fast lookup by key | Hash Table | Binary Search Tree | | Don't need sorted order | Hash Table | Balanced BST | | Membership testing | Hash Set | Array or List | | Counting frequencies | Hash Map (Counter) | Sorted Map | | Caching results | Hash Map | Dictionary with linear search |
When NOT to Use:
-
Need Ordered Iteration:
- Hash tables don't maintain insertion or sorted order
- Use: TreeMap, OrderedDict, or sorted array instead
-
Range Queries:
- Can't find "all keys between X and Y"
- Use: Balanced BST (TreeMap) or sorted array with binary search
-
Prefix Searches:
- Can't find "all keys starting with X"
- Use: Trie or prefix tree
-
Small Datasets (< 10-20 elements):
- Hash table overhead not justified
- Simple array with linear search may be faster
-
Memory Constrained:
- Hash tables need extra space (load factor < 1)
- Use: In-place algorithms or compact data structures
Problem Statement Signals:
Keywords/phrases that suggest hash tables:
- "Find if pair/triplet sums to target" → Hash set for O(1) lookup
- "Count frequency/occurrences" → Hash map for counting
- "Group by property" → Hash map with list values
- "Check if exists" → Hash set for membership
- "First non-repeating character" → Hash map with counts
- "Detect duplicates" → Hash set
- "Longest substring without repeating" → Hash set for seen characters
- "Two arrays intersection" → Hash set for O(n) intersection
Problem Categories
Classic Problem Types:
-
Two-Sum Pattern:
- Find pair of numbers summing to target
- Use hash map to store complements
- Example: Two Sum, Three Sum, Four Sum
-
Frequency/Counting:
- Count element occurrences
- Example: Most frequent element, k most frequent
-
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group shifted strings
-
Subarray/Substring Problems:
- Find subarray with property (sum, distinct elements)
- Example: Subarray sum equals K, longest substring without repeating
-
Deduplication:
- Remove or identify duplicates
- Example: Contains duplicate, longest consecutive sequence
-
Graph Problems:
- Represent graph as adjacency list (hash map)
- Example: Clone graph, course schedule
Canonical/Classic Problems:
- Two Sum (Topic 64)
- Group Anagrams (Topic 65)
- Longest Consecutive Sequence (Topic 66)
- Subarray Sum Equals K (Topic 67)
- Design HashMap from Scratch (Topic 68)
- Rolling Hash - Rabin-Karp (Topic 69)
Also classic:
- First non-repeating character
- Isomorphic strings
- Valid Sudoku
- LRU Cache (hash map + doubly linked list)
Real-World Applications
Production Systems:
-
Database Indexing:
- Hash indexes for equality searches
- PostgreSQL, MySQL support hash indexes
- Example:
SELECT * FROM users WHERE email = 'user@example.com'
-
Caching Layers:
- Redis: In-memory key-value store using hash tables
- Memcached: Distributed hash table for caching
- CDN caching: Hash URLs to cached responses
-
Compiler/Interpreter Symbol Tables:
- Map variable names to memory locations/types
- Python: Global/local variable namespaces are dicts
- JavaScript: Object properties are hash maps
-
Web Frameworks:
- Routing: Map URL patterns to handlers
- Session management: Session ID → session data
- Cookies: Key-value pairs
-
Blockchain:
- Bitcoin: Map transaction ID → transaction details
- Ethereum: Account address → account state
- Merkle trees use cryptographic hashes
Famous Systems Using Hash Tables:
| System | Hash Table Usage | | ------------- | -------------------------------------- | | Google Search | Index: keyword → document IDs | | Facebook | Social graph: User ID → friend list | | Amazon | Product catalog: SKU → product details | | DNS | Domain name → IP address | | Git | Blob hash → file content | | Linux Kernel | Process table: PID → process structure | | Chrome V8 | Object property maps (hidden classes) |
Domain-Specific Applications:
Finance:
- Order books: Order ID → Order details
- Position tracking: Ticker → positions
- Risk calculations: Portfolio ID → risk metrics
E-commerce:
- Inventory: SKU → stock count
- Shopping carts: User ID → cart items
- Recommendation engines: User ID → recommended products
Social Media:
- Friend graphs: User ID → friends
- News feeds: User ID → feed posts
- Hashtag tracking: Tag → posts
Gaming:
- Entity management: Entity ID → game object
- Player data: Player ID → stats
- Leaderboards: Player → score (sorted by score)
Bioinformatics:
- K-mer counting: k-mer → count
- Sequence alignment: Hash subsequences for fast matching
- Gene databases: Gene ID → gene info
Scale and Performance Requirements
Dataset Size Suitability:
Small (< 1,000):
- Hash table works but may be overkill
- Linear search can be competitive
- Memory overhead noticeable
Medium (1,000 - 1,000,000):
- Sweet spot for hash tables
- Clear advantage over O(log n) structures
- Fits comfortably in RAM
Large (1,000,000 - 1,000,000,000):
- Hash tables excellent if fit in memory
- Watch for cache misses
- Consider cache-optimized variants (Swiss Tables)
Massive (> 1 billion):
- May not fit in single machine memory
- Use distributed hash tables (Chord, Kademlia)
- Or approximate data structures (Bloom filter, Count-Min Sketch)
Performance Guarantees:
| Operation | Average | Worst | Space | | --------- | ------- | ----- | ----- | | Search | O(1) | O(n) | O(n) | | Insert | O(1) | O(n) | O(n) | | Delete | O(1) | O(n) | O(n) |
Amortized insert/delete: O(1) accounting for occasional resize
When Hash Table is Overkill:
- Problem solvable in one pass without lookups
- Data already sorted (can use binary search)
- Only need to iterate once (no repeated lookups)
When Hash Table is Insufficient:
- Need sorted order iteration
- Need min/max in O(1) (use heap instead)
- Need range queries (use segment tree or BST)
- Memory extremely limited (use bit arrays or compact structures)
4.2 Problem-Solving Approaches
Recognition Patterns
How to Recognize Hash Table Problems:
-
Explicit Counting:
- "Count the number of..." → Hash map
- "Find the most/least frequent..." → Hash map with counts
-
Existence Checks:
- "Check if X exists" → Hash set
- "Find if there is a pair..." → Hash set
-
Complement/Pair Finding:
- "Find two elements that sum to target"
- Pattern: For each element x, check if (target - x) exists
-
Grouping by Property:
- "Group items that share X"
- Pattern: Hash map where key = property, value = list of items
-
Tracking Seen Elements:
- "Find first duplicate"
- "Longest substring without repeating characters"
- Pattern: Hash set to track what we've seen
-
Index Mapping:
- "Find indices of two numbers that sum to target"
- Pattern: Hash map storing value → index
Input/Output Patterns:
-
Input: Array/list + target/condition
-
Output: Pair/subset satisfying condition
-
Approach: Hash set/map for O(1) complement lookup
-
Input: Array of strings
-
Output: Groups of related strings
-
Approach: Hash map with canonical form as key
Constraint Indicators:
- "In O(n) time" or "Linear time" → Likely hash table
- "Find in one pass" → Hash table for seen elements
- "Can you do better than O(n log n)?" → Hash table to achieve O(n)
Application Strategy
General Problem-Solving Templates:
# Template 1: Existence/Complement Check
def two_sum_pattern(arr, target):
seen = {} # or set() if only checking existence
for i, num in enumerate(arr):
complement = target - num
if complement in seen:
return [seen[complement], i] # Found pair
seen[num] = i
return None
# Template 2: Frequency Counting
def frequency_pattern(arr):
freq = {}
for item in arr:
freq[item] = freq.get(item, 0) + 1
# Process frequencies
return max(freq, key=freq.get) # Example: most frequent
# Template 3: Grouping by Property
def grouping_pattern(items):
groups = {}
for item in items:
key = compute_key(item) # Canonical form
if key not in groups:
groups[key] = []
groups[key].append(item)
return list(groups.values())
# Template 4: Sliding Window with Hash Map
def sliding_window_pattern(arr, k):
window = {}
# Initialize window
for i in range(k):
window[arr[i]] = window.get(arr[i], 0) + 1
# Slide window
for i in range(k, len(arr)):
# Add new element
window[arr[i]] = window.get(arr[i], 0) + 1
# Remove old element
window[arr[i-k]] -= 1
if window[arr[i-k]] == 0:
del window[arr[i-k]]
Adapting to Problem-Specific Constraints:
-
Limited Memory:
- Use Count-Min Sketch instead of hash map for approximate counts
- Use Bloom filter for approximate membership
-
Need Sorted Output:
- Use hash map during computation
- Sort results afterward
-
Multiple Lookups:
- Use hash map with tuple keys
- Or nested hash maps
-
Custom Key Types:
- Implement
__hash__and__eq__methods - Or convert to canonical string/tuple form
- Implement
Common Variations:
- Two Sum → Three Sum: Add nested loop, check if -(arr[i] + arr[j]) exists
- Find pair → Find all pairs: Don't return on first find, collect all
- Exact count → Top K frequent: Use heap with hash map
- Single array → Two arrays: Two hash sets, find intersection/difference
Combination with Other Techniques
Hash Table + Two Pointers:
Problem: Three Sum - find all triplets summing to zero
def three_sum(nums):
nums.sort() # Sort for two-pointer technique
result = []
seen_triplets = set() # Hash set to avoid duplicates
for i in range(len(nums) - 2):
left, right = i + 1, len(nums) - 1
while left < right:
total = nums[i] + nums[left] + nums[right]
if total == 0:
triplet = (nums[i], nums[left], nums[right])
if triplet not in seen_triplets:
result.append(list(triplet))
seen_triplets.add(triplet)
left += 1
right -= 1
elif total < 0:
left += 1
else:
right -= 1
return result
Hash Table + Sliding Window:
Problem: Longest Substring Without Repeating Characters
def length_of_longest_substring(s):
char_index = {} # char → last seen index
max_length = 0
start = 0
for end in range(len(s)):
if s[end] in char_index and char_index[s[end]] >= start:
start = char_index[s[end]] + 1 # Move start past duplicate
char_index[s[end]] = end
max_length = max(max_length, end - start + 1)
return max_length
Hash Table + BFS/DFS:
Problem: Clone Graph
def clone_graph(node):
if not node:
return None
cloned = {} # Original node → cloned node mapping
def dfs(node):
if node in cloned:
return cloned[node]
clone = Node(node.val)
cloned[node] = clone # Store before recursing (handles cycles)
for neighbor in node.neighbors:
clone.neighbors.append(dfs(neighbor))
return clone
return dfs(node)
Hash Table + Prefix Sum:
Problem: Subarray Sum Equals K
def subarray_sum(nums, k):
prefix_sum = 0
sum_count = {0: 1} # prefix_sum → count of occurrences
count = 0
for num in nums:
prefix_sum += num
# If (prefix_sum - k) exists, found subarray
if prefix_sum - k in sum_count:
count += sum_count[prefix_sum - k]
sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1
return count
Hash Table + Heap:
Problem: Top K Frequent Elements
import heapq
from collections import Counter
def top_k_frequent(nums, k):
# Hash map for frequency counting
freq = Counter(nums)
# Min heap of size k
return heapq.nlargest(k, freq.keys(), key=freq.get)
Hash Table + Trie:
Problem: Word Search II (find all words from dictionary in 2D board)
class TrieNode:
def __init__(self):
self.children = {} # Hash map for children
self.word = None
def find_words(board, words):
# Build trie from words
root = TrieNode()
for word in words:
node = root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.word = word
# DFS on board using trie
result = []
for i in range(len(board)):
for j in range(len(board[0])):
dfs(board, i, j, root, result)
return result
Preprocessing Patterns:
-
Sort then Hash:
- Group anagrams: Sort each string, use sorted string as key
-
Normalize then Hash:
- Case-insensitive comparison: Lowercase before hashing
- Canonical form: Convert to standard representation
-
Accumulate then Hash:
- Prefix sums: Store cumulative sum at each index as key
Postprocessing Patterns:
-
Hash to Collect, Sort to Output:
freq = Counter(arr) return sorted(freq.items(), key=lambda x: x[1], reverse=True) -
Hash to Deduplicate, List to Output:
seen = set() result = [] for item in arr: if item not in seen: result.append(item) seen.add(item)
7. Comparative Analysis
7.1 Comparison with Alternatives
Feature-by-Feature Comparison
Hash Table vs Binary Search Tree (BST):
| Feature | Hash Table | Binary Search Tree | | --------------------- | ---------------------------------- | ----------------------------------------- | | Search | O(1) average, O(n) worst | O(log n) average, O(n) worst (unbalanced) | | Insert | O(1) average, O(n) worst | O(log n) average, O(n) worst (unbalanced) | | Delete | O(1) average, O(n) worst | O(log n) average, O(n) worst (unbalanced) | | Min/Max | O(n) | O(log n) or O(1) with pointer | | Ordered Iteration | No | Yes (in-order traversal) | | Range Query | No | O(log n + k) where k = results | | Predecessor/Successor | No | O(log n) | | Memory Overhead | Load factor dependent | Pointer overhead | | Cache Performance | Good (open addr) / Poor (chaining) | Poor (pointer chasing) |
When to choose:
- Hash Table: Need O(1) lookups, don't need ordering
- BST: Need ordering, range queries, or min/max operations
Hash Table vs Balanced BST (Red-Black, AVL):
| Feature | Hash Table | Balanced BST | | ------------------------- | ------------------ | ------------------------------------ | | Search | O(1) average | O(log n) guaranteed | | Insert | O(1) average | O(log n) guaranteed | | Delete | O(1) average | O(log n) guaranteed | | Worst-case guarantee | No (O(n) possible) | Yes (O(log n)) | | Ordered operations | No | Yes | | Implementation complexity | Simple (chaining) | Complex (rotations) | | Memory overhead | Moderate | High (color/balance bits + pointers) |
When to choose:
- Hash Table: Performance-critical, no ordering needed
- Balanced BST: Need guaranteed O(log n), ordered operations
Hash Table vs Sorted Array:
| Feature | Hash Table | Sorted Array | | ----------------- | ------------------ | ----------------------------- | | Search | O(1) average | O(log n) binary search | | Insert | O(1) average | O(n) maintain sorted order | | Delete | O(1) average | O(n) shift elements | | Space | O(n / load_factor) | O(n) | | Range Query | No | O(log n + k) | | Cache Performance | Varies | Excellent (contiguous) | | Dynamic resizing | Efficient | Expensive | | Best for | Dynamic data | Static or rarely-changed data |
When to choose:
- Hash Table: Frequent insertions/deletions
- Sorted Array: Static data with many queries
Hash Table vs Trie:
| Feature | Hash Table | Trie | | --------------------- | ------------ | ------------------------------------ | | Search | O(1) average | O(k) where k = key length | | Insert | O(1) average | O(k) | | Delete | O(1) average | O(k) | | Space | O(n) | O(ALPHABET*SIZE * k _ n) worst case | | Prefix search | No | O(p) where p = prefix length | | Autocomplete | No | Excellent | | Longest common prefix | No | Natural | | Best for strings | Yes (hash) | Yes (prefix operations) |
When to choose:
- Hash Table: Exact match lookups, any key type
- Trie: Prefix searches, autocomplete, string-specific operations
Hash Table (Chaining) vs Hash Table (Open Addressing):
| Feature | Chaining | Open Addressing | | ----------------- | -------------------------- | -------------------------------- | | Load factor | Can exceed 1.0 | Must stay < 1.0 | | Cache performance | Poor | Excellent | | Deletion | Easy | Complex (tombstones) | | Memory overhead | Pointer per node | Slack space in array | | Implementation | Simple | Moderate | | Worst-case search | O(n) | O(n) | | Best for | Variable load, simple code | Performance-critical, fixed load |
When to choose:
- Chaining: Simple implementation, variable load factor
- Open Addressing: Performance-critical, cache-friendly
Quantitative Performance Comparison:
import time
import random
def benchmark_structures(n=100000):
"""
Benchmark different data structures for common operations.
"""
keys = list(range(n))
random.shuffle(keys)
# Hash Table (dict)
start = time.perf_counter()
hash_table = {}
for key in keys:
hash_table[key] = key
insert_time_hash = time.perf_counter() - start
start = time.perf_counter()
for key in keys:
_ = hash_table[key]
search_time_hash = time.perf_counter() - start
# Sorted List (bisect)
import bisect
start = time.perf_counter()
sorted_list = []
for key in keys:
bisect.insort(sorted_list, key)
insert_time_sorted = time.perf_counter() - start
start = time.perf_counter()
for key in keys:
bisect.bisect_left(sorted_list, key)
search_time_sorted = time.perf_counter() - start
# Unsorted List
start = time.perf_counter()
unsorted_list = []
for key in keys:
unsorted_list.append(key)
insert_time_unsorted = time.perf_counter() - start
start = time.perf_counter()
for key in keys:
key in unsorted_list
search_time_unsorted = time.perf_counter() - start
print(f"Results for n={n}:")
print(f"Hash Table - Insert: {insert_time_hash:.3f}s, Search: {search_time_hash:.3f}s")
print(f"Sorted List - Insert: {insert_time_sorted:.3f}s, Search: {search_time_sorted:.3f}s")
print(f"Unsorted List - Insert: {insert_time_unsorted:.3f}s, Search: {search_time_unsorted:.3f}s")
# Typical output for n=100,000:
# Hash Table - Insert: 0.008s, Search: 0.005s
# Sorted List - Insert: 8.523s, Search: 0.012s
# Unsorted List - Insert: 0.003s, Search: 12.456s
Contextual Comparison
Use Case Decision Matrix:
| Use Case | Best Choice | Runner-up | Avoid | | ---------------------- | ------------ | ---------------------- | ------------- | | Exact match lookup | Hash Table | - | Trie | | Prefix search | Trie | - | Hash Table | | Range query | Balanced BST | Sorted Array | Hash Table | | Frequent insert/delete | Hash Table | - | Sorted Array | | Static data | Sorted Array | Hash Table | Linked List | | Ordered iteration | Balanced BST | Sorted Array | Hash Table | | Memory-constrained | Sorted Array | Hash Table (high load) | Trie | | Guaranteed worst-case | Balanced BST | Cuckoo Hash | Standard Hash | | String prefix ops | Trie | - | Hash Table | | Approximate membership | Bloom Filter | - | Hash Set |
Context-Specific Recommendations:
-
Compiler Symbol Table:
- Best: Hash Table
- Reason: Fast O(1) lookups, identifiers are unique, no need for ordering
- Alternative: Balanced BST if scoping requires ordered traversal
-
Database Index:
- Best: B-tree (Balanced BST variant)
- Reason: Range queries, ordered access, disk-friendly
- Alternative: Hash index for exact-match-only columns
-
Phone Directory:
- Best: Trie (for name prefix search)
- Reason: Autocomplete, prefix matching
- Alternative: Hash Table for exact name lookup
-
Spell Checker:
- Best: Hash Set
- Reason: Membership test, exact word match
- Alternative: Trie for suggestions, Bloom filter if memory tight
-
URL Cache:
- Best: Hash Table (LRU Cache)
- Reason: Fast lookup, combine with doubly-linked list for LRU
- Alternative: None suitable
-
Event Scheduler:
- Best: Balanced BST or Heap
- Reason: Need ordered by time, min/max operations
- Alternative: Hash Table if only lookup by event ID
Decision Criteria:
def choose_data_structure(requirements):
"""
Comprehensive decision tree.
"""
# Priority 1: Ordering requirements
if requirements.need_ordered_iteration:
return "Balanced BST"
if requirements.need_range_queries:
return "Balanced BST or Segment Tree"
# Priority 2: Operation type
if requirements.primary_operation == "prefix_search":
return "Trie"
if requirements.primary_operation == "exact_lookup":
if requirements.performance_critical:
return "Hash Table (Open Addressing)"
else:
return "Hash Table (Chaining)"
# Priority 3: Data characteristics
if requirements.data_static:
return "Sorted Array"
if requirements.data_very_dynamic:
return "Hash Table"
# Priority 4: Guarantees
if requirements.need_worst_case_guarantee:
return "Balanced BST or Cuckoo Hash Table"
if requirements.can_tolerate_approximation:
return "Bloom Filter or Count-Min Sketch"
# Default
return "Hash Table"
7.2 Evolution and Variants
Historical Progression
Timeline of Hash Table Evolution:
-
1950s - Birth:
- 1953: Hans Peter Luhn (IBM) - First hash table proposal
- 1956: Arnold Dumey - Formalized scatter storage
- Technique: Simple division method, linear probing
- Limitation: Poor collision handling, no load factor management
-
1960s - Theoretical Foundation:
- 1963: Gene Amdahl - Analyzed chaining
- 1963: Konheim & Weiss - Linear probing mathematical analysis
- Innovation: Separate chaining introduced
- Impact: Theoretical complexity bounds established
-
1970s - Refinement:
- 1974: Knuth - Comprehensive analysis in TAOCP Vol. 3
- 1977: Robin Hood hashing introduced
- 1979: Carter & Wegman - Universal hashing theory
- Innovation: Load factor management, better hash functions
- Impact: Provable average-case guarantees
-
1980s - Practical Optimizations:
- 1982: Dynamic hashing (Larson)
- 1985: Linear hashing
- Innovation: Incremental resizing, better memory management
- Impact: Production systems could scale gracefully
-
1990s - Security and Variants:
- 1993: Multiplicative hashing gains popularity
- 1996: SHA-1 becomes standard secure hash
- 1997: Perfect hashing for static data
- Innovation: Cryptographic hashes, security awareness
- Impact: Protection against collision attacks
-
2000s - Advanced Variants:
- 2001: Cuckoo hashing (Pagh & Rodler)
- 2004: Hopscotch hashing
- 2007: MurmurHash (fast, good distribution)
- Innovation: Worst-case O(1) lookup, cache-friendly designs
- Impact: Real-time systems could use hash tables
-
2010s - Modern Optimization:
- 2012: SipHash (defense against DoS)
- 2014: Google Swiss Tables (SIMD optimization)
- 2016: Rust's HashMap uses Robin Hood hashing
- Innovation: Hardware-aware design, SIMD parallelization
- Impact: 2-3x speedup on modern CPUs
-
2020s - Current State:
- 2020: xxHash3 (extremely fast non-cryptographic hash)
- 2021: wyhash optimization
- Present: Cache-oblivious algorithms, GPU hashing
- Innovation: AI/ML integration, quantum-resistant hashing research
- Impact: Handles big data and real-time analytics
What Each Era Replaced:
| Era | Old Approach | New Approach | Improvement | | ----- | --------------------------- | ------------------------ | ----------------------- | | 1960s | Linear search | Simple hash table | O(n) → O(1) average | | 1970s | Poor hash functions | Universal hashing | Provable guarantees | | 1980s | Static size | Dynamic resizing | No manual tuning | | 1990s | Predictable hashes | Randomized hashes | DoS protection | | 2000s | Chaining (pointer overhead) | Open addressing variants | Better cache use | | 2010s | Scalar operations | SIMD operations | 2-4x speedup | | 2020s | General-purpose | Hardware-specialized | 10x+ for specific tasks |
Variant Analysis
Major Hash Table Variants:
-
Separate Chaining:
class ChainingHashTable: def __init__(self, capacity=16): self.buckets = [[] for _ in range(capacity)] def put(self, key, value): bucket = self.buckets[hash(key) % len(self.buckets)] for i, (k, v) in enumerate(bucket): if k == key: bucket[i] = (key, value) return bucket.append((key, value))- Pros: Simple, load factor can exceed 1, easy deletion
- Cons: Pointer overhead, cache misses
- Best for: General purpose, variable load
-
Open Addressing (Linear Probing):
class LinearProbingHashTable: def __init__(self, capacity=16): self.table = [None] * capacity def put(self, key, value): i = hash(key) % len(self.table) while self.table[i] is not None and self.table[i][0] != key: i = (i + 1) % len(self.table) self.table[i] = (key, value)- Pros: Cache-friendly, no pointers
- Cons: Clustering, complex deletion
- Best for: Performance-critical, low load factor
-
Robin Hood Hashing:
class RobinHoodHashTable: def __init__(self, capacity=16): self.table = [None] * capacity self.distances = [0] * capacity def put(self, key, value): i = hash(key) % len(self.table) distance = 0 while self.table[i] is not None: if self.distances[i] < distance: # Swap: current entry is "richer" than new entry key, value, distance, self.table[i], self.distances[i] = \ self.table[i][0], self.table[i][1], self.distances[i], \ (key, value), distance i = (i + 1) % len(self.table) distance += 1 self.table[i] = (key, value) self.distances[i] = distance- Pros: Reduces variance, better worst-case
- Cons: More complex than linear probing
- Best for: Predictable latency
-
Cuckoo Hashing:
class CuckooHashTable: def __init__(self, capacity=16): self.table1 = [None] * capacity self.table2 = [None] * capacity self.max_evictions = 10 def put(self, key, value): return self._insert(key, value, 0) def _insert(self, key, value, evictions): if evictions > self.max_evictions: self._rehash() return self.put(key, value) # Try table 1 i1 = self._hash1(key) % len(self.table1) if self.table1[i1] is None: self.table1[i1] = (key, value) return # Evict from table 1, insert in table 2 old_key, old_value = self.table1[i1] self.table1[i1] = (key, value) i2 = self._hash2(old_key) % len(self.table2) if self.table2[i2] is None: self.table2[i2] = (old_key, old_value) else: evicted = self.table2[i2] self.table2[i2] = (old_key, old_value) self._insert(evicted[0], evicted[1], evictions + 1) def get(self, key): # Check both tables: guaranteed O(1) i1 = self._hash1(key) % len(self.table1) if self.table1[i1] and self.table1[i1][0] == key: return self.table1[i1][1] i2 = self._hash2(key) % len(self.table2) if self.table2[i2] and self.table2[i2][0] == key: return self.table2[i2][1] return None- Pros: O(1) worst-case lookup
- Cons: Complex insertion, requires two tables
- Best for: Real-time systems needing guaranteed latency
-
Hopscotch Hashing:
class HopscotchHashTable: def __init__(self, capacity=16, hop_range=32): self.table = [None] * capacity self.hop_range = hop_range self.hop_info = [0] * capacity # Bitmap of occupied slots def put(self, key, value): i = hash(key) % len(self.table) # Find free slot within hop range free_slot = i while free_slot < i + self.hop_range and self.table[free_slot] is not None: free_slot += 1 if free_slot >= i + self.hop_range: # Move entries to make room (complex logic) free_slot = self._make_room(i) self.table[free_slot] = (key, value) self.hop_info[i] |= (1 << (free_slot - i))- Pros: Bounded search, cache-friendly
- Cons: Complex implementation
- Best for: High-performance concurrent hash tables
-
Swiss Tables (Google Abseil):
class SwissTable: def __init__(self, capacity=16): self.groups = capacity // 16 self.ctrl = bytearray(capacity) # Control bytes self.slots = [None] * capacity def put(self, key, value): h1 = hash(key) h2 = h1 >> 57 # Top 7 bits for H2 hash group = (h1 % self.groups) * 16 # SIMD: Compare H2 against 16 control bytes at once matches = self._simd_match(self.ctrl[group:group+16], h2) for offset in matches: if self.slots[group + offset] and self.slots[group + offset][0] == key: self.slots[group + offset] = (key, value) return # Insert in first empty slot for offset in range(16): if self.ctrl[group + offset] == 0: # Empty self.ctrl[group + offset] = h2 self.slots[group + offset] = (key, value) return- Pros: SIMD optimization, excellent cache performance
- Cons: Requires modern CPU, complex implementation
- Best for: Modern C++ applications (C++17+)
Variant Comparison Table:
| Variant | Lookup | Insert | Delete | Space | Cache | Complexity | Use Case | | -------------- | ---------- | ----------- | ---------- | ------ | --------- | ---------- | --------------- | | Chaining | O(1) avg | O(1) avg | O(1) avg | High | Poor | Low | General purpose | | Linear Probing | O(1/(1-α)) | O(1/(1-α)²) | O(1/(1-α)) | Medium | Excellent | Low | Performance | | Robin Hood | O(1/(1-α)) | O(1/(1-α)²) | O(1/(1-α)) | Medium | Excellent | Medium | Predictable | | Cuckoo | O(1) worst | O(1) amort | O(1) | High | Good | High | Real-time | | Hopscotch | O(1) | O(1) | O(1) | Medium | Excellent | High | Concurrent | | Swiss Tables | O(1) | O(1) | O(1) | Medium | Excellent | Very High | Modern C++ |
Choosing the Right Variant:
def choose_hash_table_variant(requirements):
"""
Choose hash table variant based on requirements.
"""
if requirements.need_worst_case_guarantee:
if requirements.can_afford_space:
return "Cuckoo Hashing"
else:
return "Hopscotch Hashing"
if requirements.need_concurrent:
return "Hopscotch or Lock-Free"
if requirements.performance_critical:
if requirements.modern_cpu:
return "Swiss Tables"
else:
return "Robin Hood or Linear Probing"
if requirements.simple_implementation:
return "Separate Chaining"
if requirements.low_memory:
return "Open Addressing (Linear Probing)"
# Default
return "Separate Chaining" # Most common, simplest
7.3 Related Concepts
Closely Related Structures/Algorithms
Hash Table Family Tree:
Hash-Based Data Structures
│
├── Exact Structures
│ ├── Hash Map (key-value)
│ │ ├── HashMap (Java)
│ │ ├── unordered_map (C++)
│ │ ├── dict (Python)
│ │ └── Map (JavaScript)
│ │
│ ├── Hash Set (keys only)
│ │ ├── HashSet (Java)
│ │ ├── unordered_set (C++)
│ │ ├── set (Python)
│ │ └── Set (JavaScript)
│ │
│ ├── Multi-Map/Multi-Set
│ │ └── Allows duplicate keys
│ │
│ └── Bidirectional Map
│ └── Key ↔ Value both indexed
│
├── Approximate Structures
│ ├── Bloom Filter (membership)
│ ├── Count-Min Sketch (frequency)
│ ├── HyperLogLog (cardinality)
│ └── Quotient Filter (membership)
│
├── Distributed Structures
│ ├── Consistent Hashing
│ ├── Distributed Hash Table (DHT)
│ │ ├── Chord
│ │ ├── Kademlia
│ │ └── Pastry
│ │
│ └── Rendezvous Hashing
│
├── Concurrent Structures
│ ├── Lock-Free Hash Table
│ ├── ConcurrentHashMap (Java)
│ ├── TBB concurrent_hash_map (C++)
│ └── Striped Hash Table
│
└── Specialized Structures
├── Perfect Hash Table
├── Minimal Perfect Hash Table
├── Locality-Sensitive Hashing (LSH)
└── Geometric Hashing
Relationships:
-
Hash Table ↔ Hash Function:
- Hash table depends on hash function
- Quality of hash function determines collision rate
- Example hash functions:
- Division method
- Multiplication method
- Universal hashing
- Cryptographic hashes (MD5, SHA-256)
-
Hash Table ↔ Array:
- Hash table built on top of array
- Hash function maps keys to array indices
- Array provides O(1) random access
-
Hash Table ↔ Linked List (Chaining):
- Chaining uses linked lists for collision resolution
- Each bucket is a linked list
- Poor cache locality but simple
-
Hash Table ↔ BST (Java 8 HashMap):
- When chain length > 8, convert to tree
- Guarantees O(log n) even with poor hash
- Best of both worlds
-
Hash Set ↔ Hash Map:
- Hash set is hash map with dummy values
- Implementation: HashSet uses HashMap internally
- Only stores keys, values are placeholder
-
Hash Table ↔ Bloom Filter:
- Bloom filter is probabilistic hash set
- Multiple hash functions
- False positives possible, no false negatives
- Much smaller than hash set
Hybrid Solutions:
-
LRU Cache (Hash Map + Doubly Linked List):
class LRUCache: def __init__(self, capacity): self.capacity = capacity self.cache = {} # Hash map: key -> node self.head = Node(0, 0) # Dummy head self.tail = Node(0, 0) # Dummy tail self.head.next = self.tail self.tail.prev = self.head def get(self, key): if key in self.cache: node = self.cache[key] self._move_to_front(node) return node.value return -1 def put(self, key, value): if key in self.cache: node = self.cache[key] node.value = value self._move_to_front(node) else: if len(self.cache) >= self.capacity: # Evict LRU lru = self.tail.prev self._remove(lru) del self.cache[lru.key] node = Node(key, value) self.cache[key] = node self._add_to_front(node) # Hash map: O(1) lookup # Doubly linked list: O(1) move to front, O(1) remove tail # Combined: O(1) get and put -
TreeMap with Hashing (for buckets):
class HashTreeMap: """ Combines hash table (for speed) with BST (for ordering within buckets). """ def __init__(self, capacity=16): self.buckets = [None] * capacity def put(self, key, value): bucket_idx = hash(key) % len(self.buckets) if self.buckets[bucket_idx] is None: self.buckets[bucket_idx] = BST() self.buckets[bucket_idx].insert(key, value) def range_query(self, low, high): """ Get all keys in range [low, high]. Hash to relevant buckets, then BST range query. """ results = [] for bucket in self.buckets: if bucket: results.extend(bucket.range_query(low, high)) return results # Benefits: Fast exact lookup (hash) + range queries (BST) -
Trie with Hash Maps (for children):
class TrieNode: def __init__(self): self.children = {} # Hash map instead of array self.is_end = False # Array-based trie: Fixed alphabet size (e.g., 26 for a-z) # Hash map-based trie: Any characters, dynamic # Trade-off: Speed vs flexibility
Cross-Domain Connections
Hash Tables in Different Fields:
-
Databases:
- Hash Indexes: Fast equality lookups
- Hash Joins: Join two tables by hashing smaller table
- Query Optimization: Cardinality estimation with HyperLogLog
- Caching: Query result cache using hash map
-
Cryptography:
- Password Storage: Hash passwords with salt
- Digital Signatures: Hash document, sign hash
- Blockchain: Hash pointers in Merkle trees
- Integrity: File hashing (checksums)
-
Networking:
- Load Balancing: Consistent hashing for server selection
- ARP Tables: IP address → MAC address
- DNS Caching: Domain name → IP address
- NAT: Map internal IP:port to external IP:port
-
Compilers:
- Symbol Tables: Identifier → type/address
- String Interning: Deduplicate string literals
- Constant Folding: Cache computed constants
- Macro Expansion: Macro name → definition
-
Operating Systems:
- Page Tables: Virtual address → physical address
- File Systems: Inode hashing, directory structures
- Process Tables: PID → process control block
- Buffer Cache: Block number → cached block
-
Graphics/Computer Vision:
- Geometric Hashing: Object recognition
- Locality-Sensitive Hashing: Similar image search
- Texture Caching: Hash texture coordinates
- Collision Detection: Spatial hashing
-
Machine Learning:
- Feature Hashing: High-dimensional sparse features
- Hashing Trick: Reduce dimensionality
- MinHash: Similar document detection
- SimHash: Near-duplicate detection
Mathematical Connections:
-
Probability Theory:
- Birthday paradox → Collision probability
- Poisson distribution → Chain length distribution
- Load factor → Expected performance
-
Number Theory:
- Prime numbers → Table size selection
- Modular arithmetic → Index computation
- Universal hashing → Prime field arithmetic
-
Information Theory:
- Entropy → Hash distribution quality
- Collision resistance → Information loss
- Perfect hashing → Bijective function
7.4 Practical Considerations
Library Implementations
Standard Library Implementations Across Languages:
-
Python dict:
# Internal: Open addressing with random probing d = {} # or dict() d[key] = value # O(1) average value = d[key] # Raises KeyError if not found value = d.get(key, default) # Returns default if not found # Advanced features: d.setdefault(key, default) # Get or insert with default d.pop(key, default) # Remove and return, or default d.keys() # dict_keys view (dynamic) d.values() # dict_values view d.items() # dict_items view (key-value pairs) # Python 3.7+: Maintains insertion order # Implementation: Combined hash/index table -
Java HashMap:
// Internal: Chaining, converts to tree when chain > 8 HashMap<K, V> map = new HashMap<>(); map.put(key, value); // O(1) average V value = map.get(key); // null if not found map.remove(key); // Advanced features: map.putIfAbsent(key, value); map.computeIfAbsent(key, k -> computeValue(k)); map.getOrDefault(key, defaultValue); map.forEach((k, v) -> process(k, v)); // Java 8+: Streams support map.entrySet().stream() .filter(e -> e.getValue() > 10) .collect(Collectors.toMap(...)); -
C++ std::unordered_map:
// Internal: Traditionally chaining, modern uses Swiss Tables std::unordered_map<K, V> map; map[key] = value; // O(1) average auto it = map.find(key); // Returns iterator if (it != map.end()) { V value = it->second; } // Advanced features: map.emplace(key, value); // Construct in-place map.insert_or_assign(key, value); // C++17 map.try_emplace(key, args...); // C++17 // Custom hash function: struct CustomHash { size_t operator()(const K& k) const { return std::hash<K>{}(k); } }; std::unordered_map<K, V, CustomHash> custom_map; -
JavaScript Map:
// Internal: Varies by engine (V8 uses Swiss Tables) const map = new Map(); map.set(key, value); // O(1) average const value = map.get(key); // undefined if not found map.has(key); // boolean map.delete(key); // Maintains insertion order for (const [key, value] of map) { console.log(key, value); } // Advanced: map.forEach((value, key) => process(key, value)); const keys = Array.from(map.keys()); // Note: Objects can also be used as maps, but Map is better // Map vs Object: // - Map: Any key type, maintains order, .size property // - Object: String/Symbol keys only, may have prototype pollution -
Go map:
// Internal: Open addressing with some optimizations m := make(map[string]int) m[key] = value // O(1) average value, exists := m[key] // Check existence delete(m, key) // Iteration (order not guaranteed): for key, value := range m { fmt.Println(key, value) } // Pre-allocation: m := make(map[string]int, capacity) // Note: Maps are not safe for concurrent use // Use sync.Map for concurrent access
Hidden Optimizations in Libraries:
-
Python dict (3.6+):
- Compact dict: Separate hash/index table from key/value table
- Memory saving: 20-25% less memory than old implementation
- Insertion order: Free with new design
- Fast iteration: Keys/values stored contiguously
-
Java HashMap (8+):
- Treeification: Chains → Red-Black trees when length > 8
- Untreeification: Trees → chains when shrinking below 6
- Prevents: O(n) worst-case from hash collision attacks
- Trade-off: More complex, but guaranteed O(log n)
-
C++ std::unordered_map (recent):
- Swiss Tables: Group-based open addressing with SIMD
- Control bytes: Metadata for each slot (empty/deleted/H2 hash)
- SIMD matching: 16 parallel comparisons
- Performance: 2-3x faster than old chaining implementation
-
JavaScript V8 Map:
- Swiss Tables: Similar to C++
- Inline caching: JIT optimizations for repeated access patterns
- Hidden classes: Optimize property access on objects used as maps
When to Use Library vs Implement:
| Scenario | Use Library | Implement Custom | | ------------------------- | ------------- | --------------------------- | | General purpose | ✓ | | | Production code | ✓ | | | Learning/education | | ✓ | | Special requirements | | ✓ (e.g., lock-free) | | Language built-in | ✓ | | | Specialized hash function | Partial | ✓ (custom hash) | | Memory-constrained | Maybe | ✓ (control layout) | | Performance-critical | ✓ (try first) | ✓ (if library insufficient) |
API Differences Across Languages:
# Key operations across languages
# Python
value = d.get(key, default) # Safe get with default
# Java
V value = map.getOrDefault(key, defaultValue);
# C++
auto it = map.find(key);
V value = (it != map.end()) ? it->second : defaultValue;
# JavaScript
const value = map.get(key) ?? defaultValue;
# Go
value, exists := m[key]
if !exists {
value = defaultValue
}
# Common pattern: Safe get with default
# Python/Java/JS: Built-in method
# C++/Go: Manual check required
Production vs Academic
Real-World vs Interview Differences:
| Aspect | Interview | Production | | ------------------ | --------------------------- | ---------------------------------------- | | Implementation | From scratch | Use library | | Hash function | Simple (e.g., % table_size) | Robust (MurmurHash, SipHash) | | Collision handling | Chaining or linear probing | Sophisticated (Robin Hood, Swiss Tables) | | Error handling | Minimal | Comprehensive | | Thread safety | Single-threaded | Concurrent, lock-free | | Testing | Few test cases | Extensive testing, fuzzing | | Edge cases | Basic | Comprehensive (overflow, attacks, etc.) | | Documentation | None | Required | | Performance | Asymptotic analysis | Profiling, benchmarking | | Security | Ignored | Critical (DoS prevention) |
Production Concerns Not in Interviews:
-
Thread Safety:
# Interview: Simple hash table (not thread-safe) class HashTable: def put(self, key, value): # ... simple insertion ... # Production: Thread-safe with locks class ThreadSafeHashTable: def __init__(self): self.lock = threading.RLock() self.table = {} def put(self, key, value): with self.lock: self.table[key] = value # Or use built-in thread-safe structures: from multiprocessing import Manager manager = Manager() shared_dict = manager.dict() # Thread-safe -
Memory Management:
# Interview: Ignore memory details # Production: Monitor memory usage import sys import tracemalloc tracemalloc.start() hash_table = {} for i in range(1000000): hash_table[i] = i current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 10**6}MB") print(f"Peak memory usage: {peak / 10**6}MB") tracemalloc.stop() -
Security (Hash Collision Attacks):
# Interview: Use simple hash def simple_hash(key): return sum(ord(c) for c in str(key)) # Production: Randomized hash to prevent DoS import hashlib import secrets class SecureHashTable: def __init__(self): self.salt = secrets.token_bytes(16) # Random per instance def _hash(self, key): return hashlib.sha256( str(key).encode() + self.salt ).digest() -
Monitoring and Metrics:
# Production: Track performance metrics class MonitoredHashTable: def __init__(self): self.table = {} self.hit_count = 0 self.miss_count = 0 self.collision_count = 0 def get(self, key): if key in self.table: self.hit_count += 1 return self.table[key] else: self.miss_count += 1 return None def metrics(self): total = self.hit_count + self.miss_count hit_rate = self.hit_count / total if total > 0 else 0 return { 'hit_rate': hit_rate, 'collision_rate': self.collision_count / total, 'load_factor': len(self.table) / self.capacity }
Constant Factors in Practice:
Despite same Big-O, implementations can vary 10x in speed:
import time
def benchmark_hash_implementations():
n = 1000000
keys = list(range(n))
# Simple division method
start = time.perf_counter()
hash1 = [hash(k) % n for k in keys]
time1 = time.perf_counter() - start
# Multiplication method
start = time.perf_counter()
A = 0.6180339887 # Golden ratio
hash2 = [int(n * (k * A % 1)) for k in keys]
time2 = time.perf_counter() - start
# Cryptographic hash (overkill for hash table)
import hashlib
start = time.perf_counter()
hash3 = [int(hashlib.md5(str(k).encode()).hexdigest(), 16) % n for k in keys]
time3 = time.perf_counter() - start
print(f"Division method: {time1:.3f}s")
print(f"Multiplication method: {time2:.3f}s")
print(f"Cryptographic hash: {time3:.3f}s")
# Typical output:
# Division method: 0.15s
# Multiplication method: 0.18s (slightly slower, better distribution)
# Cryptographic hash: 8.50s (much slower, unnecessary for hash table)
When Asymptotic Guarantees Don't Match Reality:
-
Small datasets (n < 100):
- O(1) hash table vs O(n) linear search
- Linear search often faster due to:
- No hash computation overhead
- Better cache locality
- No resizing
-
Cache effects dominate:
- Chaining: O(1 + α) average
- Linear probing: O(1 / (1-α)) average
- But linear probing 2-3x faster due to cache locality
-
Dynamic resizing:
- Amortized O(1) insertion
- But occasional O(n) spike
- Can cause tail latency issues in real-time systems
Best Practices from Production:
-
Pre-size if possible:
# Bad: Multiple resizes d = {} for i in range(1000000): d[i] = i # Good: Pre-allocate d = dict.fromkeys(range(1000000)) for i in range(1000000): d[i] = i -
Use appropriate data structure:
# Counting: Use Counter from collections import Counter counts = Counter(items) # Optimized for frequency counting # Ordered: Use OrderedDict or dict (Python 3.7+) from collections import OrderedDict ordered = OrderedDict() # Explicit ordering guarantee # Default values: Use defaultdict from collections import defaultdict groups = defaultdict(list) # Auto-initialize lists for item in items: groups[get_key(item)].append(item) -
Profile before optimizing:
import cProfile import pstats def hash_intensive_function(): d = {} for i in range(100000): d[i] = i _ = d.get(i) profiler = cProfile.Profile() profiler.enable() hash_intensive_function() profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(10) # Top 10 functions
8. Edge Cases and Limitations
8.1 Known Limitations
Structural Limitations
Fundamental Limitations of Hash Tables:
-
No Ordering:
# Hash tables don't maintain order (except Python 3.7+ dict) hash_table = {3: "three", 1: "one", 2: "two"} # Can't efficiently: # - Iterate in sorted order # - Find minimum/maximum key # - Get range of keys [a, b] # - Find predecessor/successor # Workaround: Extract and sort sorted_items = sorted(hash_table.items()) # O(n log n) # Better alternative: Use BST if ordering is primary requirement from sortedcontainers import SortedDict sorted_dict = SortedDict({3: "three", 1: "one", 2: "two"}) print(sorted_dict.keys()) # Always sorted: [1, 2, 3] -
No Range Queries:
# Can't efficiently find all keys in range [low, high] def range_query(hash_table, low, high): # Must check every key: O(n) return {k: v for k, v in hash_table.items() if low <= k <= high} # BST alternative: O(log n + k) where k = results # Segment tree: O(log n + k) -
Rehashing Cost:
# Occasional O(n) spike during resize import time hash_table = {} times = [] for i in range(100000): start = time.perf_counter() hash_table[i] = i elapsed = time.perf_counter() - start if elapsed > 0.001: # More than 1ms times.append((i, elapsed)) print(f"Slow insertion at i={i}: {elapsed*1000:.2f}ms") # Output shows spikes at powers of 2: # Slow insertion at i=12: 1.2ms # Slow insertion at i=24: 2.3ms # Slow insertion at i=48: 4.1ms # Slow insertion at i=96: 7.8ms -
Hash Function Dependency:
# Poor hash function → O(n) worst case class BadHashKey: def __init__(self, value): self.value = value def __hash__(self): return 42 # All objects hash to same value! hash_table = {} for i in range(1000): key = BadHashKey(i) hash_table[key] = i # All keys collide → effectively a linked list → O(n) operations -
Memory Overhead:
# Hash table requires load factor < 1 (usually 0.75) # Means 25-33% wasted space import sys # Sorted array: No overhead sorted_array = list(range(1000000)) print(f"Sorted array: {sys.getsizeof(sorted_array) / 10**6:.2f}MB") # Hash table: Load factor overhead hash_table = {i: i for i in range(1000000)} print(f"Hash table: {sys.getsizeof(hash_table) / 10**6:.2f}MB") # Output: # Sorted array: 8.0MB # Hash table: 36.0MB (4.5x more!)
Problems Hash Tables Cannot Solve Efficiently:
| Problem | Hash Table | Better Alternative | Reason | | ----------------- | --------------- | ---------------------- | ---------------------- | | Find kth smallest | O(n log n) sort | Quickselect O(n) avg | Need ordering | | Range sum query | O(n) scan | Segment tree O(log n) | No range support | | Prefix matching | O(n) scan | Trie O(k) | Need prefix structure | | Nearest neighbor | O(n) scan | K-D tree O(log n) | Need spatial structure | | Interval queries | O(n) scan | Interval tree O(log n) | Need range support | | Sorted iteration | O(n log n) sort | BST O(n) | No ordering | | Median finding | O(n log n) sort | Two heaps O(log n) | Need ordering |
Scalability Limits:
# Maximum practical size depends on available memory
import sys
def estimate_max_size(available_memory_gb=8):
"""
Estimate maximum hash table size.
"""
# Assume 64-byte per entry (key + value + overhead)
bytes_per_entry = 64
load_factor = 0.75
available_bytes = available_memory_gb * 10**9
max_entries = int((available_bytes / bytes_per_entry) * load_factor)
print(f"Available memory: {available_memory_gb}GB")
print(f"Estimated max entries: {max_entries:,}")
print(f"At 1M ops/sec, would take {max_entries / 10**6:.1f} seconds to build")
return max_entries
estimate_max_size(8)
# Output:
# Available memory: 8GB
# Estimated max entries: 93,750,000
# At 1M ops/sec, would take 93.8 seconds to build
# For larger datasets: Use distributed hash tables or approximate structures
When NOT to Use
Anti-Patterns and Failure Modes:
-
Small datasets (< 20 elements):
# Bad: Hash table overhead not worth it user_types = {"admin": 1, "user": 2, "guest": 3} # Better: Simple list of tuples user_types = [("admin", 1), ("user", 2), ("guest", 3)] # Lookup: O(n) but faster in practice for small n def get_type(name): for n, t in user_types: if n == name: return t return None -
Need sorted/ordered access:
# Bad: Hash table + sorting events = {} for event in event_stream: events[event.timestamp] = event sorted_events = sorted(events.items()) # O(n log n) every time! # Better: Use sorted structure from start from sortedcontainers import SortedDict events = SortedDict() for event in event_stream: events[event.timestamp] = event # Already sorted: O(1) to iterate -
Mutable keys:
# Bad: Using mutable object as key key = [1, 2, 3] hash_table = {} hash_table[key] = "value" # TypeError: unhashable type: 'list' # Also bad: Mutable object that is hashable but changes class MutableKey: def __init__(self, value): self.value = value def __hash__(self): return hash(self.value) def __eq__(self, other): return self.value == other.value key = MutableKey(5) hash_table = {} hash_table[key] = "value" print(hash_table[key]) # Works key.value = 10 # Mutate key print(hash_table[key]) # May not find it! Hash changed. # Correct: Use immutable keys key = (1, 2, 3) # Tuple hash_table[key] = "value" # Works correctly -
Real-time systems with strict latency:
# Bad: Resizing causes unpredictable latency spikes hash_table = {} for i in range(1000000): hash_table[i] = i # Occasional O(n) spike during resize # Better for real-time: Pre-allocate or use Cuckoo hashing hash_table = dict.fromkeys(range(1000000)) # Pre-allocate # Or use Cuckoo hashing: O(1) worst-case lookup -
Prefix/fuzzy matching:
# Bad: Hash table for prefix search dictionary = {"apple", "application", "apply", "banana"} hash_set = set(dictionary) def find_prefix(prefix): # O(n) scan return [word for word in hash_set if word.startswith(prefix)] # Better: Use Trie class Trie: # O(k) search where k = prefix length pass
Extreme Conditions
Handling Extreme Inputs:
-
Maximum integer keys:
# Large integers import sys hash_table = {} max_int = sys.maxsize # 2^63 - 1 on 64-bit hash_table[max_int] = "max" hash_table[-max_int] = "min" # Hash function should handle full range print(hash(max_int)) # Works in Python # Potential issue: Hash collision with large integers # Python's hash is identity for small integers, but hashes large ones -
Memory pressure:
# What happens when running out of memory? import sys hash_table = {} try: for i in range(10**9): # Try to insert 1 billion elements hash_table[i] = [0] * 1000 # Large values if i % 1000000 == 0: print(f"Inserted {i:,} elements") except MemoryError: print(f"Out of memory at {i:,} elements") print(f"Table size: {sys.getsizeof(hash_table) / 10**9:.2f}GB") # Graceful degradation: # 1. Use approximate structures (Bloom filter, Count-Min Sketch) # 2. Distributed hash tables # 3. Disk-based hash tables -
Adversarial input (Hash collision attacks):
# Attacker crafts keys that all hash to same value # → All collide → O(n) operations → DoS # Example: Predictable hash function def bad_hash(s): return sum(ord(c) for c in s) % 1000 # Attacker generates many strings with same hash: # "abc", "bac", "cab" all have same character sum # Defense 1: Randomized hashing import random import hashlib class SecureHashTable: def __init__(self): self.salt = random.randbytes(16) def _hash(self, key): return hashlib.sha256(str(key).encode() + self.salt).digest() # Defense 2: Switch to tree when chain too long (Java 8 HashMap) class TreeifyingHashTable: TREEIFY_THRESHOLD = 8 def _handle_collision(self, bucket): if len(bucket) > self.TREEIFY_THRESHOLD: # Convert linked list to red-black tree bucket = RedBlackTree(bucket) # Now O(log n) even with all collisions -
Numerical overflow in hash function:
# Multiplication can overflow in languages without arbitrary precision def hash_with_overflow(key): prime = 31 hash_val = 0 for char in str(key): hash_val = hash_val * prime + ord(char) # In C/Java: This can overflow! # Python: No problem, arbitrary precision return hash_val # C/Java solution: Use modulo to keep in bounds def safe_hash(key, modulo=2**32): prime = 31 hash_val = 0 for char in str(key): hash_val = (hash_val * prime + ord(char)) % modulo return hash_val
8.2 Edge Cases
Input Edge Cases
Comprehensive Edge Case Testing:
class HashTableTester:
"""
Test hash table with all edge cases.
"""
def test_empty_table(self, hash_table):
"""Empty table operations."""
assert hash_table.size() == 0
assert hash_table.get("nonexistent") is None
assert hash_table.delete("nonexistent") is False
assert list(hash_table.keys()) == []
def test_single_element(self, hash_table):
"""Single element edge case."""
hash_table.put("only", "value")
assert hash_table.size() == 1
assert hash_table.get("only") == "value"
hash_table.delete("only")
assert hash_table.size() == 0
def test_duplicate_keys(self, hash_table):
"""Duplicate insertions should update."""
hash_table.put("key", "value1")
hash_table.put("key", "value2")
assert hash_table.size() == 1
assert hash_table.get("key") == "value2"
def test_null_none_keys(self, hash_table):
"""None/null as key."""
# Python allows None as key
hash_table.put(None, "null_value")
assert hash_table.get(None) == "null_value"
# Empty string as key
hash_table.put("", "empty")
assert hash_table.get("") == "empty"
def test_special_values(self, hash_table):
"""Special numeric values."""
hash_table.put(0, "zero")
hash_table.put(-0, "negative_zero") # Same as 0 in Python
assert hash_table.size() == 1 # -0 == 0
# Infinity
hash_table.put(float('inf'), "infinity")
assert hash_table.get(float('inf')) == "infinity"
# NaN (tricky: NaN != NaN)
nan = float('nan')
hash_table.put(nan, "not_a_number")
# May not be retrievable: hash(nan) works but nan != nan
def test_maximum_values(self, hash_table):
"""Maximum integer values."""
import sys
max_int = sys.maxsize
hash_table.put(max_int, "max")
hash_table.put(-max_int, "min")
assert hash_table.get(max_int) == "max"
def test_unicode_keys(self, hash_table):
"""Unicode and special characters."""
hash_table.put("hello", "english")
hash_table.put("こんにちは", "japanese")
hash_table.put("🔥", "emoji")
assert hash_table.size() == 3
def test_collision_scenario(self, hash_table):
"""Force collisions (if using simple hash)."""
# Keys that may collide with simple hash
keys = ["abc", "bac", "cab"] # Same character sum
for i, key in enumerate(keys):
hash_table.put(key, i)
# All should be retrievable
assert hash_table.get("abc") == 0
assert hash_table.get("bac") == 1
assert hash_table.get("cab") == 2
def test_resize_boundary(self, hash_table):
"""Test around resize threshold."""
# Assuming load factor = 0.75, capacity = 16
# Resize should occur at 12th element
for i in range(15):
hash_table.put(i, i)
print(f"Size: {hash_table.size()}, Capacity: {hash_table.capacity()}")
def test_delete_nonexistent(self, hash_table):
"""Delete key that doesn't exist."""
result = hash_table.delete("nonexistent")
assert result is False or result is None
def test_update_during_iteration(self, hash_table):
"""Concurrent modification detection."""
hash_table.put(1, "one")
hash_table.put(2, "two")
try:
for key in hash_table.keys():
hash_table.put(3, "three") # Modify during iteration
# Some implementations throw ConcurrentModificationException
except RuntimeError:
pass # Expected in some implementations
Edge Case Categories:
-
Size edge cases:
- Empty table (size = 0)
- Single element (size = 1)
- At resize threshold (size = capacity * load_factor)
- Maximum size (limited by memory)
-
Key edge cases:
- None/null keys
- Empty string keys
- Very long keys (10KB+ strings)
- Duplicate keys (should update)
- Non-existent keys (lookups)
-
Value edge cases:
- None/null values
- Large values (stress memory)
- Mutable values (should be allowed)
-
Numeric edge cases:
- Zero (0, -0, 0.0)
- Negative numbers
- Maximum/minimum integers
- Infinity (float('inf'))
- NaN (float('nan'))
-
String edge cases:
- Empty string ("")
- Unicode strings
- Very long strings
- Strings with special characters
8.3 Error Handling and Robustness
Failure Modes
Common Failure Scenarios:
class RobustHashTable:
"""
Hash table with comprehensive error handling.
"""
def __init__(self, capacity=16, load_factor=0.75):
if capacity <= 0:
raise ValueError(f"Capacity must be positive, got {capacity}")
if not (0 < load_factor < 1):
raise ValueError(f"Load factor must be in (0, 1), got {load_factor}")
self.capacity = capacity
self.load_factor = load_factor
self.size = 0
self.table = [[] for _ in range(capacity)]
def put(self, key, value):
"""
Insert with error handling.
"""
# Validate key
if key is None:
raise KeyError("None is not a valid key")
# Check if key is hashable
try:
hash_val = hash(key)
except TypeError as e:
raise TypeError(f"Key {key} is not hashable: {e}")
# Check memory availability before resize
if self._should_resize():
try:
self._resize()
except MemoryError:
raise MemoryError("Cannot resize: out of memory")
# Insert
index = hash_val % self.capacity
bucket = self.table[index]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value) # Update
return
bucket.append((key, value))
self.size += 1
def get(self, key, default=None):
"""
Safe get with default value.
"""
try:
hash_val = hash(key)
except TypeError:
return default # Unhashable key
index = hash_val % self.capacity
bucket = self.table[index]
for k, v in bucket:
if k == key:
return v
return default
def delete(self, key):
"""
Delete with error handling.
"""
try:
hash_val = hash(key)
except TypeError as e:
raise TypeError(f"Cannot delete unhashable key: {e}")
index = hash_val % self.capacity
bucket = self.table[index]
for i, (k, v) in enumerate(bucket):
if k == key:
del bucket[i]
self.size -= 1
return True
# Key not found
return False
def _should_resize(self):
"""Check if resize needed."""
return self.size >= self.capacity * self.load_factor
def _resize(self):
"""
Resize with error handling.
"""
old_table = self.table
self.capacity *= 2
try:
self.table = [[] for _ in range(self.capacity)]
except MemoryError:
# Restore old state
self.capacity //= 2
raise
self.size = 0
for bucket in old_table:
for key, value in bucket:
self.put(key, value)
Error Categories:
-
Invalid Input:
- Non-hashable keys
- None as key (if not allowed)
- Invalid capacity/load factor
-
Resource Exhaustion:
- Out of memory during resize
- Too many elements (platform limits)
-
Concurrent Modification:
- Modifying table during iteration
- Thread safety violations
-
Corruption:
- Hash function changes for same key
- Mutable key modified after insertion
Recovery Mechanisms:
class ResilientHashTable:
"""
Hash table with recovery mechanisms.
"""
def __init__(self):
self.table = {}
self.backup = None
self.operation_log = []
def put(self, key, value):
"""
Transactional put with rollback.
"""
# Create backup before modification
old_value = self.table.get(key)
self.operation_log.append(('put', key, old_value))
try:
self.table[key] = value
except Exception as e:
# Rollback on error
self.rollback()
raise
def batch_put(self, items):
"""
Batch insert with all-or-nothing semantics.
"""
self.create_checkpoint()
try:
for key, value in items:
self.table[key] = value
except Exception as e:
# Rollback all changes
self.restore_checkpoint()
raise Exception(f"Batch put failed, rolled back: {e}")
finally:
self.clear_checkpoint()
def create_checkpoint(self):
"""Create backup for rollback."""
import copy
self.backup = copy.deepcopy(self.table)
def restore_checkpoint(self):
"""Restore from backup."""
if self.backup is not None:
self.table = self.backup
self.backup = None
def clear_checkpoint(self):
"""Clear backup."""
self.backup = None
def rollback(self):
"""Rollback last operation."""
if self.operation_log:
op, key, old_value = self.operation_log.pop()
if op == 'put':
if old_value is None:
del self.table[key]
else:
self.table[key] = old_value
Validation
Data Integrity Checks:
class ValidatedHashTable:
"""
Hash table with validation.
"""
def __init__(self):
self.table = {}
self.checksum = 0
def put(self, key, value):
"""Insert with validation."""
# Validate inputs
self._validate_key(key)
self._validate_value(value)
# Update checksum
old_checksum = self.checksum
self.checksum ^= hash((key, value))
try:
self.table[key] = value
except Exception:
# Restore checksum on failure
self.checksum = old_checksum
raise
def verify_integrity(self):
"""
Verify hash table integrity.
"""
# Recompute checksum from scratch
computed_checksum = 0
for key, value in self.table.items():
computed_checksum ^= hash((key, value))
if computed_checksum != self.checksum:
raise ValueError("Hash table corrupted: checksum mismatch")
return True
def _validate_key(self, key):
"""Validate key."""
if key is None:
raise ValueError("Key cannot be None")
try:
hash(key)
except TypeError:
raise TypeError(f"Key must be hashable, got {type(key)}")
def _validate_value(self, value):
"""Validate value (optional)."""
# Can add custom validation logic
pass
9. Practical Mastery
9.1 Implementation
Coding from Scratch
Complete Hash Table Implementation (All 12 Topics Integrated)
This section provides production-ready Python implementations for all 12 hash table topics.
Topic 58 & 62: Hash Table Fundamentals & HashMap Implementation
class HashMap:
"""
Complete HashMap implementation with separate chaining.
Demonstrates hash table fundamentals with dynamic resizing.
"""
class _Node:
"""Internal node for chaining."""
__slots__ = ['key', 'value', 'next']
def __init__(self, key, value, next_node=None):
self.key = key
self.value = value
self.next = next_node
def __init__(self, initial_capacity=16, load_factor_threshold=0.75):
"""
Initialize hash map.
Args:
initial_capacity: Initial size of bucket array
load_factor_threshold: Resize when n/m exceeds this
"""
if initial_capacity < 1:
raise ValueError("Capacity must be positive")
if not (0 < load_factor_threshold < 1):
raise ValueError("Load factor must be between 0 and 1")
self._capacity = self._next_power_of_2(initial_capacity)
self._buckets = [None] * self._capacity
self._size = 0
self._load_factor_threshold = load_factor_threshold
self._modification_count = 0 # For fail-fast iterators
def _next_power_of_2(self, n):
"""Find next power of 2 >= n."""
power = 1
while power < n:
power <<= 1
return power
def _hash(self, key):
"""
Compute hash for key.
Uses Python's built-in hash with perturbation for better distribution.
"""
h = hash(key)
# Additional mixing for better distribution
h ^= (h >> 16)
return h
def _get_index(self, key):
"""Get bucket index for key."""
return self._hash(key) & (self._capacity - 1) # Fast modulo for power of 2
def put(self, key, value):
"""
Insert or update key-value pair.
Time: O(1) average, O(n) worst case
Space: O(1)
"""
if key is None:
raise ValueError("Key cannot be None")
index = self._get_index(key)
node = self._buckets[index]
# Search for existing key
while node:
if node.key == key or (node.key is not None and node.key == key):
# Update existing
node.value = value
return
node = node.next
# Insert new node at head of chain
new_node = self._Node(key, value, self._buckets[index])
self._buckets[index] = new_node
self._size += 1
self._modification_count += 1
# Check if resize needed
if self._size / self._capacity > self._load_factor_threshold:
self._resize(self._capacity * 2)
def get(self, key, default=None):
"""
Get value for key.
Time: O(1) average, O(n) worst case
Space: O(1)
"""
if key is None:
return default
index = self._get_index(key)
node = self._buckets[index]
while node:
if node.key == key:
return node.value
node = node.next
return default
def remove(self, key):
"""
Remove key-value pair.
Time: O(1) average, O(n) worst case
Space: O(1)
Returns:
Value that was removed, or None if key not found
"""
if key is None:
return None
index = self._get_index(key)
node = self._buckets[index]
prev = None
while node:
if node.key == key:
# Found it
if prev:
prev.next = node.next
else:
self._buckets[index] = node.next
self._size -= 1
self._modification_count += 1
return node.value
prev = node
node = node.next
return None
def contains_key(self, key):
"""Check if key exists."""
return self.get(key, object()) is not object()
def _resize(self, new_capacity):
"""
Resize and rehash all elements.
Time: O(n) where n = number of elements
Space: O(n) for new array
"""
old_buckets = self._buckets
self._capacity = new_capacity
self._buckets = [None] * self._capacity
self._size = 0
# Rehash all elements
for bucket in old_buckets:
node = bucket
while node:
self.put(node.key, node.value)
node = node.next
def keys(self):
"""Return list of all keys."""
result = []
for bucket in self._buckets:
node = bucket
while node:
result.append(node.key)
node = node.next
return result
def values(self):
"""Return list of all values."""
result = []
for bucket in self._buckets:
node = bucket
while node:
result.append(node.value)
node = node.next
return result
def items(self):
"""Return list of (key, value) tuples."""
result = []
for bucket in self._buckets:
node = bucket
while node:
result.append((node.key, node.value))
node = node.next
return result
def __len__(self):
return self._size
def __contains__(self, key):
return self.contains_key(key)
def __getitem__(self, key):
value = self.get(key, object())
if value is object():
raise KeyError(key)
return value
def __setitem__(self, key, value):
self.put(key, value)
def __delitem__(self, key):
if self.remove(key) is None:
raise KeyError(key)
def __iter__(self):
"""Iterate over keys."""
return iter(self.keys())
def __repr__(self):
items = ', '.join(f'{k!r}: {v!r}' for k, v in self.items())
return f'{{{items}}}'
def get_load_factor(self):
"""Get current load factor."""
return self._size / self._capacity if self._capacity > 0 else 0
def get_stats(self):
"""Get statistics about hash table."""
chain_lengths = []
for bucket in self._buckets:
length = 0
node = bucket
while node:
length += 1
node = node.next
chain_lengths.append(length)
non_empty = sum(1 for l in chain_lengths if l > 0)
max_chain = max(chain_lengths) if chain_lengths else 0
avg_chain = sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0
return {
'size': self._size,
'capacity': self._capacity,
'load_factor': self.get_load_factor(),
'non_empty_buckets': non_empty,
'max_chain_length': max_chain,
'avg_chain_length': avg_chain,
}
# Example usage and testing
if __name__ == "__main__":
# Create hash map
hm = HashMap()
# Insert elements
hm.put("apple", 5)
hm.put("banana", 3)
hm.put("cherry", 7)
hm.put("date", 2)
# Retrieve elements
print(f"apple: {hm.get('apple')}") # 5
print(f"banana: {hm['banana']}") # 3
# Update
hm["apple"] = 10
print(f"apple updated: {hm['apple']}") # 10
# Check existence
print(f"Contains 'cherry': {hm.contains_key('cherry')}") # True
print(f"Contains 'grape': {'grape' in hm}") # False
# Remove
hm.remove("banana")
print(f"After removing banana: {'banana' in hm}") # False
# Iterate
print("Keys:", hm.keys())
print("Values:", hm.values())
print("Items:", hm.items())
# Statistics
print("Stats:", hm.get_stats())
Topic 63: HashSet Implementation
class HashSet:
"""
HashSet implementation - stores unique elements only.
Built on top of HashMap, values are ignored.
"""
_PRESENT = object() # Dummy value
def __init__(self, initial_capacity=16):
"""Initialize hash set using internal hash map."""
self._map = HashMap(initial_capacity)
def add(self, element):
"""
Add element to set.
Time: O(1) average
Space: O(1)
"""
self._map.put(element, self._PRESENT)
def remove(self, element):
"""
Remove element from set.
Time: O(1) average
Returns: True if element was present, False otherwise
"""
return self._map.remove(element) is not None
def contains(self, element):
"""Check if element exists in set."""
return self._map.contains_key(element)
def size(self):
"""Return number of elements."""
return len(self._map)
def is_empty(self):
"""Check if set is empty."""
return len(self._map) == 0
def clear(self):
"""Remove all elements."""
self._map = HashMap()
def to_list(self):
"""Convert to list."""
return self._map.keys()
def union(self, other):
"""Return union of this set and other set."""
result = HashSet()
for elem in self.to_list():
result.add(elem)
for elem in other.to_list():
result.add(elem)
return result
def intersection(self, other):
"""Return intersection of this set and other set."""
result = HashSet()
for elem in self.to_list():
if other.contains(elem):
result.add(elem)
return result
def difference(self, other):
"""Return elements in this set but not in other."""
result = HashSet()
for elem in self.to_list():
if not other.contains(elem):
result.add(elem)
return result
def is_subset(self, other):
"""Check if this is subset of other."""
for elem in self.to_list():
if not other.contains(elem):
return False
return True
def __len__(self):
return self.size()
def __contains__(self, element):
return self.contains(element)
def __iter__(self):
return iter(self.to_list())
def __repr__(self):
return '{' + ', '.join(repr(e) for e in self.to_list()) + '}'
# Example usage
if __name__ == "__main__":
# Create set
s1 = HashSet()
s1.add(1)
s1.add(2)
s1.add(3)
s1.add(2) # Duplicate ignored
print(f"Set: {s1}") # {1, 2, 3}
print(f"Size: {len(s1)}") # 3
print(f"Contains 2: {2 in s1}") # True
# Set operations
s2 = HashSet()
s2.add(2)
s2.add(3)
s2.add(4)
print(f"Union: {s1.union(s2)}") # {1, 2, 3, 4}
print(f"Intersection: {s1.intersection(s2)}") # {2, 3}
print(f"Difference: {s1.difference(s2)}") # {1}
Topic 59: Hash Functions and Collision Handling
class HashFunctions:
"""Collection of hash function implementations."""
@staticmethod
def division_method(key, table_size):
"""
Division method: h(k) = k mod m
Works well when m is prime and not close to power of 2.
"""
return key % table_size
@staticmethod
def multiplication_method(key, table_size):
"""
Multiplication method: h(k) = floor(m * (k * A mod 1))
A is constant, typically golden ratio (√5 - 1)/2 ≈ 0.618
Works well for any table size.
"""
A = 0.6180339887 # Golden ratio
return int(table_size * ((key * A) % 1))
@staticmethod
def string_hash_polynomial(s, table_size, base=31):
"""
Polynomial rolling hash for strings.
h(s) = (s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1]) mod table_size
Common bases: 31, 33, 37 (small primes)
"""
hash_value = 0
for char in s:
hash_value = (hash_value * base + ord(char))
return hash_value % table_size
@staticmethod
def string_hash_djb2(s):
"""
DJB2 hash function by Dan Bernstein.
Simple and effective for strings.
hash(i) = hash(i-1) * 33 + str[i]
"""
hash_value = 5381
for char in s:
hash_value = ((hash_value << 5) + hash_value) + ord(char)
return hash_value & 0xFFFFFFFF # Keep it 32-bit
@staticmethod
def integer_hash_knuth(key):
"""
Knuth's multiplicative hash.
Multiply by large prime and take middle bits.
"""
key = ((key >> 3) * 2654435761) & 0xFFFFFFFF
return key
@staticmethod
def universal_hash(key, a, b, prime, table_size):
"""
Universal hashing: h_ab(k) = ((a*k + b) mod p) mod m
Choose random a, b from [1, p-1] where p is prime > universe size.
Guarantees collision probability <= 1/m for any pair of keys.
"""
return ((a * key + b) % prime) % table_size
@staticmethod
def mix_bits(key):
"""
Bit mixing for better distribution.
Used in Java's HashMap.
"""
key ^= (key >> 20) ^ (key >> 12)
return key ^ (key >> 7) ^ (key >> 4)
class CollisionResolutionStrategies:
"""Implementations of different collision resolution strategies."""
@staticmethod
def linear_probing_sequence(index, i, table_size):
"""
Linear probing: h(k, i) = (h(k) + i) mod m
Probe sequence: index, index+1, index+2, ...
Simple but causes primary clustering.
"""
return (index + i) % table_size
@staticmethod
def quadratic_probing_sequence(index, i, table_size, c1=1, c2=1):
"""
Quadratic probing: h(k, i) = (h(k) + c1*i + c2*i^2) mod m
Probe sequence: index, index+1, index+4, index+9, ...
Reduces primary clustering but can cause secondary clustering.
"""
return (index + c1 * i + c2 * i * i) % table_size
@staticmethod
def double_hashing_sequence(key, i, table_size, hash1_val):
"""
Double hashing: h(k, i) = (h1(k) + i * h2(k)) mod m
Uses two hash functions.
h2(k) must be relatively prime to m (use h2(k) = 1 + (k mod (m-1))).
Best open addressing method - minimal clustering.
"""
hash2_val = 1 + (key % (table_size - 1))
return (hash1_val + i * hash2_val) % table_size
# Example: Hash table with different collision resolution
class OpenAddressingHashMap:
"""Hash map using open addressing with linear probing."""
class _Entry:
__slots__ = ['key', 'value', 'is_deleted']
def __init__(self, key, value):
self.key = key
self.value = value
self.is_deleted = False
def __init__(self, capacity=16):
self._capacity = capacity
self._table = [None] * capacity
self._size = 0
def _hash(self, key):
return hash(key) % self._capacity
def put(self, key, value):
"""Insert with linear probing."""
if self._size >= self._capacity * 0.7:
self._resize(self._capacity * 2)
index = self._hash(key)
i = 0
while i < self._capacity:
probe_index = (index + i) % self._capacity
entry = self._table[probe_index]
if entry is None or entry.is_deleted:
# Found empty slot
self._table[probe_index] = self._Entry(key, value)
self._size += 1
return
elif entry.key == key:
# Update existing
entry.value = value
return
i += 1
raise Exception("Hash table full")
def get(self, key):
"""Search with linear probing."""
index = self._hash(key)
i = 0
while i < self._capacity:
probe_index = (index + i) % self._capacity
entry = self._table[probe_index]
if entry is None:
return None # Not found
elif not entry.is_deleted and entry.key == key:
return entry.value
i += 1
return None
def remove(self, key):
"""Delete with tombstone marking."""
index = self._hash(key)
i = 0
while i < self._capacity:
probe_index = (index + i) % self._capacity
entry = self._table[probe_index]
if entry is None:
return None
elif not entry.is_deleted and entry.key == key:
entry.is_deleted = True
self._size -= 1
return entry.value
i += 1
return None
def _resize(self, new_capacity):
"""Resize and rehash."""
old_table = self._table
self._capacity = new_capacity
self._table = [None] * new_capacity
self._size = 0
for entry in old_table:
if entry and not entry.is_deleted:
self.put(entry.key, entry.value)
Topic 60: Chaining vs Open Addressing - Detailed Comparison
class ChainingVsOpenAddressingDemo:
"""
Demonstration comparing chaining and open addressing.
"""
@staticmethod
def benchmark_insertion(n=10000):
"""Compare insertion performance."""
import time
# Chaining
chaining_map = HashMap()
start = time.time()
for i in range(n):
chaining_map.put(i, i * 2)
chaining_time = time.time() - start
# Open addressing
open_map = OpenAddressingHashMap(capacity=n * 2)
start = time.time()
for i in range(n):
open_map.put(i, i * 2)
open_time = time.time() - start
print(f"Insertion of {n} elements:")
print(f" Chaining: {chaining_time:.4f}s")
print(f" Open Addressing: {open_time:.4f}s")
return chaining_time, open_time
@staticmethod
def compare_memory_usage(n=1000):
"""
Compare memory overhead.
Chaining: Each node has key, value, next pointer
Open addressing: Only key, value in array slots
"""
import sys
# Chaining
chaining_map = HashMap()
for i in range(n):
chaining_map.put(i, i)
# Open addressing
open_map = OpenAddressingHashMap(capacity=n * 2)
for i in range(n):
open_map.put(i, i)
# Estimate memory (rough approximation)
chaining_size = sys.getsizeof(chaining_map._buckets)
open_size = sys.getsizeof(open_map._table)
print(f"Memory for {n} elements:")
print(f" Chaining: ~{chaining_size} bytes")
print(f" Open Addressing: ~{open_size} bytes")
return chaining_size, open_size
@staticmethod
def demonstrate_clustering():
"""Show clustering effect in linear probing."""
# Create small table with linear probing
table_size = 10
table = [None] * table_size
# Insert keys that cause clustering
keys_to_insert = [0, 10, 20, 1, 11] # All hash to 0 or 1
print("Inserting keys with linear probing:")
for key in keys_to_insert:
index = key % table_size
i = 0
while i < table_size:
probe_index = (index + i) % table_size
if table[probe_index] is None:
table[probe_index] = key
print(f" Key {key} (hash={index}) → slot {probe_index} (probes={i+1})")
break
i += 1
print(f"\nFinal table: {table}")
print("Notice clustering at indices 0-4!")
# Run demonstrations
if __name__ == "__main__":
demo = ChainingVsOpenAddressingDemo()
print("=== Performance Comparison ===")
demo.benchmark_insertion(10000)
print("\n=== Memory Comparison ===")
demo.compare_memory_usage(1000)
print("\n=== Clustering Demonstration ===")
demo.demonstrate_clustering()
Topic 61: Load Factor and Rehashing
class DynamicHashMap:
"""
Hash map with detailed load factor tracking and rehashing.
"""
def __init__(self, initial_capacity=8, max_load_factor=0.75, min_load_factor=0.25):
"""
Args:
initial_capacity: Starting size
max_load_factor: Resize up when exceeded
min_load_factor: Resize down when below (optional)
"""
self._capacity = initial_capacity
self._buckets = [[] for _ in range(self._capacity)]
self._size = 0
self._max_load_factor = max_load_factor
self._min_load_factor = min_load_factor
self._resize_count = 0
self._resize_history = []
def _hash(self, key):
return hash(key) % self._capacity
def get_load_factor(self):
"""Current load factor α = n/m."""
return self._size / self._capacity if self._capacity > 0 else 0
def put(self, key, value):
"""Insert with automatic resize when load factor exceeded."""
# Check if resize needed BEFORE insertion
if self.get_load_factor() >= self._max_load_factor:
self._resize_up()
index = self._hash(key)
bucket = self._buckets[index]
# Update if exists
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
# Insert new
bucket.append((key, value))
self._size += 1
def remove(self, key):
"""Remove with automatic resize down when load factor too low."""
index = self._hash(key)
bucket = self._buckets[index]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket.pop(i)
self._size -= 1
# Resize down if too sparse
if self.get_load_factor() < self._min_load_factor and self._capacity > 8:
self._resize_down()
return v
return None
def _resize_up(self):
"""Double capacity and rehash."""
old_capacity = self._capacity
self._resize(self._capacity * 2)
self._resize_history.append(('up', old_capacity, self._capacity, self._size))
print(f"Resized UP: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")
def _resize_down(self):
"""Halve capacity and rehash."""
old_capacity = self._capacity
self._resize(self._capacity // 2)
self._resize_history.append(('down', old_capacity, self._capacity, self._size))
print(f"Resized DOWN: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")
def _resize(self, new_capacity):
"""Rehash all elements to new capacity."""
self._resize_count += 1
old_buckets = self._buckets
self._capacity = new_capacity
self._buckets = [[] for _ in range(self._capacity)]
self._size = 0
# Rehash all elements
for bucket in old_buckets:
for key, value in bucket:
self.put(key, value)
def get_stats(self):
"""Get detailed statistics."""
chain_lengths = [len(bucket) for bucket in self._buckets]
non_empty = sum(1 for length in chain_lengths if length > 0)
return {
'size': self._size,
'capacity': self._capacity,
'load_factor': self.get_load_factor(),
'resize_count': self._resize_count,
'non_empty_buckets': non_empty,
'max_chain_length': max(chain_lengths) if chain_lengths else 0,
'avg_chain_length': sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0,
'resize_history': self._resize_history
}
# Demonstration of load factor and rehashing
if __name__ == "__main__":
print("=== Load Factor and Rehashing Demo ===\n")
hm = DynamicHashMap(initial_capacity=4, max_load_factor=0.75)
print("Inserting elements to trigger resizes:")
for i in range(20):
hm.put(f"key{i}", i)
if i % 5 == 4:
print(f" After {i+1} inserts: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")
print("\nRemoving elements to trigger shrink:")
for i in range(18):
hm.remove(f"key{i}")
if i % 5 == 4:
print(f" After {i+1} removes: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")
print("\nFinal stats:")
stats = hm.get_stats()
for key, value in stats.items():
print(f" {key}: {value}")
Now continuing with the remaining topics (64-69) which are problem-solving applications...
Topic 64: Two Sum Problem and Variations
class TwoSumProblems:
"""Solutions to Two Sum and its variations using hash tables."""
@staticmethod
def two_sum(nums, target):
"""
Find two numbers that sum to target.
Time: O(n)
Space: O(n)
Args:
nums: List of integers
target: Target sum
Returns:
Indices of two numbers, or None if not found
"""
seen = {} # value → index
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return None
@staticmethod
def two_sum_all_pairs(nums, target):
"""
Find ALL pairs that sum to target.
Time: O(n)
Space: O(n)
"""
seen = set()
pairs = set()
for num in nums:
complement = target - num
if complement in seen:
# Add as sorted tuple to avoid duplicates
pair = tuple(sorted([num, complement]))
pairs.add(pair)
seen.add(num)
return list(pairs)
@staticmethod
def two_sum_count(nums, target):
"""
Count number of pairs that sum to target.
Time: O(n)
Space: O(n)
"""
from collections import Counter
count_map = Counter(nums)
pairs_count = 0
for num in count_map:
complement = target - num
if complement in count_map:
if num == complement:
# Same number, use combination formula
n = count_map[num]
pairs_count += n * (n - 1) // 2
elif num < complement:
# Count once to avoid duplicates
pairs_count += count_map[num] * count_map[complement]
return pairs_count
@staticmethod
def three_sum(nums, target=0):
"""
Find all unique triplets that sum to target.
Time: O(n²)
Space: O(n)
Combination of sorting + two-pointer + hash set for dedup.
"""
nums.sort()
result = []
seen_triplets = set()
for i in range(len(nums) - 2):
# Skip duplicates for first element
if i > 0 and nums[i] == nums[i-1]:
continue
# Two sum on remaining array
seen = {}
for j in range(i + 1, len(nums)):
complement = target - nums[i] - nums[j]
if complement in seen:
triplet = (nums[i], complement, nums[j])
if triplet not in seen_triplets:
result.append(list(triplet))
seen_triplets.add(triplet)
seen[nums[j]] = j
return result
@staticmethod
def four_sum(nums, target):
"""
Find all unique quadruplets that sum to target.
Time: O(n³)
Space: O(n)
"""
nums.sort()
result = []
n = len(nums)
for i in range(n - 3):
if i > 0 and nums[i] == nums[i-1]:
continue
for j in range(i + 1, n - 2):
if j > i + 1 and nums[j] == nums[j-1]:
continue
# Two sum with hash map
seen = {}
for k in range(j + 1, n):
complement = target - nums[i] - nums[j] - nums[k]
if complement in seen:
result.append([nums[i], nums[j], complement, nums[k]])
# Skip duplicates
while k + 1 < n and nums[k] == nums[k+1]:
k += 1
seen[nums[k]] = k
return result
@staticmethod
def two_sum_closest(nums, target):
"""
Find pair with sum closest to target.
Time: O(n log n) due to sorting
Space: O(1)
Note: This uses two pointers after sorting, not hash table,
but included for completeness of "two sum" family.
"""
nums.sort()
left, right = 0, len(nums) - 1
closest_sum = float('inf')
result = None
while left < right:
current_sum = nums[left] + nums[right]
if abs(current_sum - target) < abs(closest_sum - target):
closest_sum = current_sum
result = [nums[left], nums[right]]
if current_sum < target:
left += 1
else:
right -= 1
return result, closest_sum
# Test Two Sum variations
if __name__ == "__main__":
solver = TwoSumProblems()
print("=== Two Sum Problem ===")
nums = [2, 7, 11, 15]
target = 9
print(f"Input: {nums}, target: {target}")
print(f"Indices: {solver.two_sum(nums, target)}")
print("\n=== Two Sum All Pairs ===")
nums = [1, 5, 3, 7, 2, 8, 4]
target = 9
print(f"Input: {nums}, target: {target}")
print(f"All pairs: {solver.two_sum_all_pairs(nums, target)}")
print("\n=== Two Sum Count ===")
nums = [1, 1, 1, 1]
target = 2
print(f"Input: {nums}, target: {target}")
print(f"Pair count: {solver.two_sum_count(nums, target)}")
print("\n=== Three Sum ===")
nums = [-1, 0, 1, 2, -1, -4]
print(f"Input: {nums}")
print(f"Triplets: {solver.three_sum(nums, target=0)}")
Topic 65: Group Anagrams Using Hash Map
class AnagramProblems:
"""Solutions using hash maps for anagram problems."""
@staticmethod
def group_anagrams_sorted(strs):
"""
Group anagrams by sorting each string as key.
Time: O(n * k log k) where n = number of strings, k = max length
Space: O(n * k)
Args:
strs: List of strings
Returns:
List of grouped anagrams
"""
from collections import defaultdict
groups = defaultdict(list)
for s in strs:
# Use sorted string as canonical form (key)
key = ''.join(sorted(s))
groups[key].append(s)
return list(groups.values())
@staticmethod
def group_anagrams_count(strs):
"""
Group anagrams by character count.
Time: O(n * k) where n = number of strings, k = max length
Space: O(n * k)
Faster than sorting approach.
"""
from collections import defaultdict
groups = defaultdict(list)
for s in strs:
# Count frequency of each character
count = [0] * 26 # Assuming lowercase a-z
for char in s:
count[ord(char) - ord('a')] += 1
# Use tuple of counts as key (lists aren't hashable)
key = tuple(count)
groups[key].append(s)
return list(groups.values())
@staticmethod
def are_anagrams(s1, s2):
"""
Check if two strings are anagrams.
Time: O(n)
Space: O(1) - fixed size alphabet
"""
if len(s1) != len(s2):
return False
from collections import Counter
return Counter(s1) == Counter(s2)
@staticmethod
def find_anagrams_in_string(s, p):
"""
Find all start indices of anagrams of p in s.
Time: O(n) where n = len(s)
Space: O(1) - fixed size alphabet
Uses sliding window + hash map.
"""
from collections import Counter
if len(p) > len(s):
return []
result = []
p_count = Counter(p)
window_count = Counter()
# Initialize window
for i in range(len(p)):
window_count[s[i]] += 1
if window_count == p_count:
result.append(0)
# Slide window
for i in range(len(p), len(s)):
# Add new character
window_count[s[i]] += 1
# Remove old character
old_char = s[i - len(p)]
window_count[old_char] -= 1
if window_count[old_char] == 0:
del window_count[old_char]
# Check if anagram
if window_count == p_count:
result.append(i - len(p) + 1)
return result
@staticmethod
def group_shifted_strings(strings):
"""
Group strings that are shifts of each other.
E.g., "abc" and "bcd" are shifts (each char shifted by 1).
Time: O(n * k) where n = number of strings, k = max length
Space: O(n * k)
"""
from collections import defaultdict
def get_shift_key(s):
"""
Get canonical form by computing differences between consecutive chars.
E.g., "abc" → (1, 1), "xyz" → (1, 1), "za" → (25)
"""
if len(s) <= 1:
return ()
shifts = []
for i in range(1, len(s)):
shift = (ord(s[i]) - ord(s[i-1])) % 26
shifts.append(shift)
return tuple(shifts)
groups = defaultdict(list)
for s in strings:
key = get_shift_key(s)
groups[key].append(s)
return list(groups.values())
# Test anagram problems
if __name__ == "__main__":
solver = AnagramProblems()
print("=== Group Anagrams (Sorting) ===")
strs = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(f"Input: {strs}")
print(f"Groups: {solver.group_anagrams_sorted(strs)}")
print("\n=== Group Anagrams (Counting) ===")
print(f"Groups: {solver.group_anagrams_count(strs)}")
print("\n=== Find Anagrams in String ===")
s = "cbaebabacd"
p = "abc"
print(f"String: {s}, Pattern: {p}")
print(f"Anagram indices: {solver.find_anagrams_in_string(s, p)}")
print("\n=== Group Shifted Strings ===")
strings = ["abc", "bcd", "acef", "xyz", "az", "ba", "a", "z"]
print(f"Input: {strings}")
print(f"Shift groups: {solver.group_shifted_strings(strings)}")
Topic 66: Longest Consecutive Sequence
class LongestConsecutiveSequence:
"""
Find longest consecutive sequence in unsorted array.
Time: O(n)
Space: O(n)
Key insight: Use hash set for O(1) lookup.
Only start counting from numbers that are sequence starts.
"""
@staticmethod
def longest_consecutive(nums):
"""
Find length of longest consecutive elements sequence.
Example: [100, 4, 200, 1, 3, 2] → 4 (sequence: 1, 2, 3, 4)
"""
if not nums:
return 0
num_set = set(nums) # O(n) space, O(1) lookup
longest = 0
for num in num_set:
# Only start counting if this is a sequence start
# (i.e., num-1 is not in set)
if num - 1 not in num_set:
current_num = num
current_length = 1
# Count consecutive numbers
while current_num + 1 in num_set:
current_num += 1
current_length += 1
longest = max(longest, current_length)
return longest
@staticmethod
def longest_consecutive_with_sequence(nums):
"""Return both length and the actual sequence."""
if not nums:
return 0, []
num_set = set(nums)
longest = 0
best_sequence = []
for num in num_set:
if num - 1 not in num_set: # Sequence start
sequence = [num]
current_num = num
while current_num + 1 in num_set:
current_num += 1
sequence.append(current_num)
if len(sequence) > longest:
longest = len(sequence)
best_sequence = sequence
return longest, best_sequence
# Test
if __name__ == "__main__":
solver = LongestConsecutiveSequence()
nums = [100, 4, 200, 1, 3, 2]
print(f"Input: {nums}")
print(f"Longest consecutive length: {solver.longest_consecutive(nums)}")
length, seq = solver.longest_consecutive_with_sequence(nums)
print(f"Sequence: {seq} (length {length})")
Topic 67: Subarray Sum Equals K
class SubarraySumProblems:
"""Subarray sum problems using hash map with prefix sums."""
@staticmethod
def subarray_sum_equals_k(nums, k):
"""
Count subarrays with sum equal to k.
Time: O(n)
Space: O(n)
Key insight: Use prefix sum + hash map.
If prefix_sum[j] - prefix_sum[i] = k, then subarray[i+1:j+1] sums to k.
Equivalently: if prefix_sum - k exists, found valid subarray.
"""
prefix_sum = 0
sum_count = {0: 1} # prefix_sum → count (0:1 handles subarrays from start)
count = 0
for num in nums:
prefix_sum += num
# If (prefix_sum - k) exists, those positions form valid subarrays ending here
if prefix_sum - k in sum_count:
count += sum_count[prefix_sum - k]
# Record current prefix sum
sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1
return count
@staticmethod
def max_subarray_sum_k(nums, k):
"""
Find maximum length subarray with sum equal to k.
Time: O(n)
Space: O(n)
"""
prefix_sum = 0
sum_index = {0: -1} # prefix_sum → earliest index
max_length = 0
for i, num in enumerate(nums):
prefix_sum += num
if prefix_sum - k in sum_index:
length = i - sum_index[prefix_sum - k]
max_length = max(max_length, length)
# Only store first occurrence of each prefix sum
if prefix_sum not in sum_index:
sum_index[prefix_sum] = i
return max_length
@staticmethod
def subarray_sum_divisible_by_k(nums, k):
"""
Count subarrays with sum divisible by k.
Time: O(n)
Space: O(k)
Key: Use modulo arithmetic. If (prefix % k) repeats, difference is divisible by k.
"""
prefix_sum = 0
mod_count = {0: 1} # remainder → count
count = 0
for num in nums:
prefix_sum += num
remainder = prefix_sum % k
# Handle negative remainders
if remainder < 0:
remainder += k
if remainder in mod_count:
count += mod_count[remainder]
mod_count[remainder] = mod_count.get(remainder, 0) + 1
return count
# Test
if __name__ == "__main__":
solver = SubarraySumProblems()
print("=== Subarray Sum Equals K ===")
nums = [1, 1, 1]
k = 2
print(f"Input: {nums}, k={k}")
print(f"Count: {solver.subarray_sum_equals_k(nums, k)}") # 2: [1,1] at two positions
nums = [1, 2, 3]
k = 3
print(f"\nInput: {nums}, k={k}")
print(f"Count: {solver.subarray_sum_equals_k(nums, k)}") # 2: [3] and [1,2]
Topic 69: Rolling Hash (Rabin-Karp Algorithm)
class RollingHash:
"""
Rolling hash for pattern matching (Rabin-Karp algorithm).
Key idea: Efficiently compute hash of sliding window.
Hash of window [i:i+m] can be updated to [i+1:i+m+1] in O(1).
"""
def __init__(self, base=256, mod=10**9 + 7):
"""
Args:
base: Base for polynomial hash (typically 256 for ASCII)
mod: Large prime for modulo to avoid overflow
"""
self.base = base
self.mod = mod
def compute_hash(self, s):
"""Compute hash of string s."""
h = 0
for char in s:
h = (h * self.base + ord(char)) % self.mod
return h
def rabin_karp_search(self, text, pattern):
"""
Find all occurrences of pattern in text.
Time: O(n + m) average case, O(nm) worst case
Space: O(1)
Returns: List of starting indices
"""
n, m = len(text), len(pattern)
if m > n:
return []
# Precompute base^(m-1) mod mod
h_multiplier = pow(self.base, m - 1, self.mod)
# Compute hash of pattern
pattern_hash = self.compute_hash(pattern)
# Compute hash of first window
window_hash = self.compute_hash(text[:m])
result = []
for i in range(n - m + 1):
# Check if hashes match
if window_hash == pattern_hash:
# Verify actual string (to handle hash collisions)
if text[i:i+m] == pattern:
result.append(i)
# Roll the hash to next window
if i < n - m:
# Remove leftmost character
window_hash = (window_hash - ord(text[i]) * h_multiplier) % self.mod
# Add rightmost character
window_hash = (window_hash * self.base + ord(text[i + m])) % self.mod
# Handle negative values
if window_hash < 0:
window_hash += self.mod
return result
def longest_common_substring(self, s1, s2):
"""
Find longest common substring using rolling hash + binary search.
Time: O((n+m) * log(min(n,m)))
Space: O(n+m)
"""
def get_all_hashes(s, length):
"""Get all hashes of substrings of given length."""
if length > len(s):
return set()
h_multiplier = pow(self.base, length - 1, self.mod)
window_hash = self.compute_hash(s[:length])
hashes = {window_hash}
for i in range(len(s) - length):
window_hash = (window_hash - ord(s[i]) * h_multiplier) % self.mod
window_hash = (window_hash * self.base + ord(s[i + length])) % self.mod
hashes.add(window_hash)
return hashes
# Binary search on length
left, right = 0, min(len(s1), len(s2))
result_length = 0
while left <= right:
mid = (left + right) // 2
hashes1 = get_all_hashes(s1, mid)
hashes2 = get_all_hashes(s2, mid)
if hashes1 & hashes2: # Common hash found
result_length = mid
left = mid + 1
else:
right = mid - 1
return result_length
# Test
if __name__ == "__main__":
rh = RollingHash()
print("=== Rabin-Karp Pattern Search ===")
text = "ABABCABABA"
pattern = "ABA"
indices = rh.rabin_karp_search(text, pattern)
print(f"Text: {text}")
print(f"Pattern: {pattern}")
print(f"Found at indices: {indices}") # [0, 5, 7]
print("\n=== Longest Common Substring ===")
s1 = "abcdxyz"
s2 = "xyzabcd"
length = rh.longest_common_substring(s1, s2)
print(f"String 1: {s1}")
print(f"String 2: {s2}")
print(f"Longest common substring length: {length}") # 4 ("abcd" or "xyz")
9.3 Problem-Solving Practice
Learning Path
Progressive Problem Sequence:
Easy (Master Fundamentals):
- Two Sum
- Contains Duplicate
- Valid Anagram
- Intersection of Two Arrays
- First Unique Character in String
Medium (Build Proficiency): 6. Group Anagrams 7. Longest Substring Without Repeating Characters 8. Subarray Sum Equals K 9. Top K Frequent Elements 10. Longest Consecutive Sequence
Hard (Demonstrate Mastery): 11. Substring with Concatenation of All Words 12. LRU Cache 13. Design Twitter 14. All O(1) Data Structure 15. Longest Duplicate Substring
9.4 Integration and Ecosystem
Library Usage
Python Built-in Dictionaries:
# Python's dict is highly optimized hash table
d = {}
d = dict()
d = {"key": "value"}
# Common operations
d[key] = value # Put
value = d.get(key, default) # Get with default
del d[key] # Remove
key in d # Contains
# Collections module
from collections import defaultdict, Counter, OrderedDict
# defaultdict: Auto-initialize missing keys
dd = defaultdict(list)
dd["key"].append("value") # No KeyError
# Counter: Specialized for counting
from collections import Counter
counts = Counter([1, 1, 2, 2, 2, 3])
# Counter({2: 3, 1: 2, 3: 1})
# OrderedDict: Maintains insertion order (Python 3.7+ dicts are ordered by default)
od = OrderedDict()
10. Advanced Concepts
10.1 Advanced Variants and Extensions
Perfect Hashing
class PerfectHashTable:
"""
Perfect hashing for static datasets - no collisions!
Two-level hashing:
- Level 1: Hash to n buckets using h1
- Level 2: Each bucket has its own hash table sized O(m_i^2)
where m_i = number of keys hashing to bucket i
Guarantees: O(1) worst-case lookup, O(n) total space.
"""
def __init__(self, keys):
"""Build perfect hash table for given static keys."""
self.n = len(keys)
# Level 1 table
self.buckets = [[] for _ in range(self.n)]
# Hash keys to buckets
for key in keys:
bucket_idx = hash(key) % self.n
self.buckets[bucket_idx].append(key)
# Level 2: Create perfect hash table for each bucket
self.secondary = []
for bucket in self.buckets:
if len(bucket) == 0:
self.secondary.append(None)
else:
# Size is O(m^2) to guarantee no collisions
size = len(bucket) ** 2
secondary_table = [None] * size
for key in bucket:
idx = hash(key * 31) % size # Different hash function
secondary_table[idx] = key
self.secondary.append(secondary_table)
def lookup(self, key):
"""O(1) worst-case lookup."""
bucket_idx = hash(key) % self.n
if self.secondary[bucket_idx] is None:
return False
secondary_table = self.secondary[bucket_idx]
idx = hash(key * 31) % len(secondary_table)
return secondary_table[idx] == key
Cuckoo Hashing
class CuckooHashTable:
"""
Cuckoo hashing: Worst-case O(1) lookup!
Two hash tables, two hash functions.
If collision, evict existing element and relocate it.
"""
def __init__(self, capacity=16):
self.capacity = capacity
self.table1 = [None] * capacity
self.table2 = [None] * capacity
self.size = 0
self.max_iterations = 500 # Prevent infinite loops
def _hash1(self, key):
return hash(key) % self.capacity
def _hash2(self, key):
return (hash(key) * 31) % self.capacity
def insert(self, key):
"""Insert with cuckoo eviction."""
if self.contains(key):
return
current_key = key
for _ in range(self.max_iterations):
# Try table 1
idx1 = self._hash1(current_key)
if self.table1[idx1] is None:
self.table1[idx1] = current_key
self.size += 1
return
# Evict from table 1, try table 2
self.table1[idx1], current_key = current_key, self.table1[idx1]
idx2 = self._hash2(current_key)
if self.table2[idx2] is None:
self.table2[idx2] = current_key
self.size += 1
return
# Evict from table 2, loop continues
self.table2[idx2], current_key = current_key, self.table2[idx2]
# Too many evictions → rebuild with larger tables
self._resize(self.capacity * 2)
self.insert(key)
def contains(self, key):
"""O(1) worst-case lookup!"""
idx1 = self._hash1(key)
idx2 = self._hash2(key)
return self.table1[idx1] == key or self.table2[idx2] == key
def _resize(self, new_capacity):
"""Rebuild tables."""
old_keys = []
for key in self.table1:
if key is not None:
old_keys.append(key)
for key in self.table2:
if key is not None:
old_keys.append(key)
self.capacity = new_capacity
self.table1 = [None] * new_capacity
self.table2 = [None] * new_capacity
self.size = 0
for key in old_keys:
self.insert(key)
10.2 Theoretical Depth
Lower Bounds:
For dictionary operations (insert, delete, search) with n elements:
- Comparison-based: Ω(log n) per operation (like BST)
- Hashing: O(1) expected time (but not comparison-based)
Hash tables achieve O(1) expected time, which is optimal for expected case.
Complexity Classes:
- Dictionary problem is in P (polynomial time solvable)
- Perfect hashing construction is polynomial
- Universal hashing provides probabilistic guarantees
10.3 Cutting-Edge and Future Directions
Recent Innovations:
-
Swiss Tables (Google, 2017):
- Open addressing with SIMD metadata
- 2x faster than std::unordered_map
-
Learned Hash Functions (2020s):
- Use machine learning to learn optimal hash function for dataset
- Outperforms traditional hash functions on specific distributions
-
Quantum Hashing:
- Grovers algorithm: O(√n) quantum search
- Post-quantum cryptographic hash functions
11. Meta-Learning Questions
Practical Judgment:
Use hash tables when:
- Need O(1) average lookup
- Don't need sorted order
- Memory not extremely limited
Use alternatives when:
- Need sorted iteration → BST
- Need range queries → Segment tree
- Memory critical → Bloom filter (approximate)
Prerequisites:
- Arrays, linked lists
- Big-O notation
- Modular arithmetic
Builds Upon:
- Dynamic arrays → dynamic resizing
- Linked lists → chaining
- Probability → collision analysis
Enables:
- Graph adjacency lists
- LRU/LFU caches
- Deduplication algorithms
- Indexing in databases
Section 9: Practical Mastery
9.1 Complete Python Implementations for All 12 Topics
This section provides production-ready, thoroughly documented implementations for each of the 12 hash table topics. Each implementation includes error handling, edge case management, and comprehensive testing.
Topic 58: Two Sum - Complete Implementation
"""
Two Sum - Find indices of two numbers that add up to target
Time: O(n), Space: O(n)
"""
from typing import List, Optional, Tuple
class TwoSumSolver:
"""
Complete implementation of Two Sum problem with multiple variants.
Key Insight: Hash table allows O(1) lookup of complement (target - current)
"""
def two_sum_basic(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
"""
Basic two sum: return indices of two numbers that sum to target.
Args:
nums: List of integers
target: Target sum
Returns:
Tuple of (index1, index2) or None if no solution exists
Examples:
>>> solver = TwoSumSolver()
>>> solver.two_sum_basic([2, 7, 11, 15], 9)
(0, 1)
>>> solver.two_sum_basic([3, 2, 4], 6)
(1, 2)
"""
if not nums or len(nums) < 2:
return None
seen = {} # value -> index mapping
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return (seen[complement], i)
seen[num] = i
return None
def two_sum_all_pairs(self, nums: List[int], target: int) -> List[Tuple[int, int]]:
"""
Find ALL pairs that sum to target (return indices).
Time: O(n), Space: O(n)
Examples:
>>> solver = TwoSumSolver()
>>> solver.two_sum_all_pairs([1, 1, 1, 1], 2)
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
"""
if not nums or len(nums) < 2:
return []
# Map value to list of indices
value_indices = {}
for i, num in enumerate(nums):
if num not in value_indices:
value_indices[num] = []
value_indices[num].append(i)
result = []
seen_pairs = set()
for i, num in enumerate(nums):
complement = target - num
if complement in value_indices:
for j in value_indices[complement]:
if i < j: # Ensure no duplicates and i < j
pair = (i, j)
if pair not in seen_pairs:
result.append(pair)
seen_pairs.add(pair)
return result
def two_sum_sorted(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
"""
Two sum on sorted array using two pointers.
More space efficient: O(1) space vs O(n) for hash table.
Time: O(n), Space: O(1)
"""
if not nums or len(nums) < 2:
return None
left, right = 0, len(nums) - 1
while left < right:
current_sum = nums[left] + nums[right]
if current_sum == target:
return (left, right)
elif current_sum < target:
left += 1
else:
right -= 1
return None
def two_sum_with_values(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
"""
Return the actual values instead of indices.
Examples:
>>> solver = TwoSumSolver()
>>> solver.two_sum_with_values([2, 7, 11, 15], 9)
(2, 7)
"""
result = self.two_sum_basic(nums, target)
if result:
i, j = result
return (nums[i], nums[j])
return None
def two_sum_count(self, nums: List[int], target: int) -> int:
"""
Count number of pairs that sum to target.
Time: O(n), Space: O(n)
Examples:
>>> solver = TwoSumSolver()
>>> solver.two_sum_count([1, 1, 1, 1], 2)
6 # C(4,2) = 6 pairs
"""
if not nums or len(nums) < 2:
return 0
from collections import Counter
count_map = Counter(nums)
pairs = 0
for num in count_map:
complement = target - num
if complement in count_map:
if num == complement:
# Same number: choose 2 from count
n = count_map[num]
pairs += n * (n - 1) // 2
elif num < complement:
# Different numbers: multiply counts
pairs += count_map[num] * count_map[complement]
return pairs
def two_sum_closest(self, nums: List[int], target: int) -> Tuple[int, int]:
"""
Find pair with sum closest to target.
Time: O(n log n) due to sorting, Space: O(1)
Examples:
>>> solver = TwoSumSolver()
>>> solver.two_sum_closest([1, 3, 5, 7, 9], 10)
(1, 9) # sum = 10, exact match
>>> solver.two_sum_closest([1, 2, 3, 4], 10)
(3, 4) # sum = 7, closest to 10
"""
if not nums or len(nums) < 2:
raise ValueError("Need at least 2 numbers")
nums_sorted = sorted(enumerate(nums), key=lambda x: x[1])
left, right = 0, len(nums_sorted) - 1
closest_diff = float('inf')
result = (0, 1)
while left < right:
current_sum = nums_sorted[left][1] + nums_sorted[right][1]
diff = abs(current_sum - target)
if diff < closest_diff:
closest_diff = diff
result = (nums_sorted[left][0], nums_sorted[right][0])
if current_sum < target:
left += 1
elif current_sum > target:
right -= 1
else:
break # Exact match found
return result
class TwoSumDataStructure:
"""
Design a data structure that supports:
- add(number): Add number to internal data structure
- find(value): Find if there exists any pair that sums to value
LeetCode 170: Two Sum III - Data structure design
"""
def __init__(self):
"""Initialize with empty counter."""
from collections import Counter
self.nums = Counter()
def add(self, number: int) -> None:
"""
Add a number to the data structure.
Time: O(1)
"""
self.nums[number] += 1
def find(self, value: int) -> bool:
"""
Find if there exists any pair of numbers which sum is equal to value.
Time: O(n) where n is number of unique values
Examples:
>>> ds = TwoSumDataStructure()
>>> ds.add(1)
>>> ds.add(3)
>>> ds.add(5)
>>> ds.find(4)
True # 1 + 3 = 4
>>> ds.find(7)
False
"""
for num in self.nums:
complement = value - num
if complement in self.nums:
# If same number, need at least 2 occurrences
if num == complement:
if self.nums[num] >= 2:
return True
else:
return True
return False
# Comprehensive Test Suite for Two Sum
def test_two_sum():
"""Complete test suite covering all edge cases."""
solver = TwoSumSolver()
# Test 1: Basic case
assert solver.two_sum_basic([2, 7, 11, 15], 9) == (0, 1)
# Test 2: Multiple valid pairs (returns first found)
result = solver.two_sum_basic([3, 3], 6)
assert result == (0, 1)
# Test 3: No solution
assert solver.two_sum_basic([1, 2, 3], 10) is None
# Test 4: Empty array
assert solver.two_sum_basic([], 5) is None
# Test 5: Single element
assert solver.two_sum_basic([5], 5) is None
# Test 6: Negative numbers
assert solver.two_sum_basic([-1, -2, -3, -4, -5], -8) == (2, 4)
# Test 7: All pairs
pairs = solver.two_sum_all_pairs([1, 1, 1, 1], 2)
assert len(pairs) == 6 # C(4,2) = 6
# Test 8: Count pairs
assert solver.two_sum_count([1, 1, 1, 1], 2) == 6
# Test 9: Closest sum
assert solver.two_sum_closest([1, 2, 3, 4], 10) in [(2, 3), (3, 2)]
# Test 10: Data structure
ds = TwoSumDataStructure()
ds.add(1)
ds.add(3)
ds.add(5)
assert ds.find(4) == True
assert ds.find(7) == False
ds.add(2)
assert ds.find(7) == True # 2 + 5 = 7
print("All Two Sum tests passed!")
if __name__ == "__main__":
test_two_sum()
Topic 59: Group Anagrams - Complete Implementation
"""
Group Anagrams - Group strings that are anagrams of each other
Time: O(n * k log k) where n = number of strings, k = max string length
Space: O(n * k)
"""
from typing import List, Dict
from collections import defaultdict
class AnagramGrouper:
"""
Complete implementation of anagram grouping with multiple approaches.
Key Insight: Anagrams have same character frequency or same sorted form
"""
def group_anagrams_sorted(self, strs: List[str]) -> List[List[str]]:
"""
Group anagrams using sorted string as key.
Time: O(n * k log k) where k = average string length
Space: O(n * k) for storing all strings
Args:
strs: List of strings
Returns:
List of grouped anagrams
Examples:
>>> grouper = AnagramGrouper()
>>> result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
>>> sorted([sorted(group) for group in result])
[['ate', 'eat', 'tea'], ['bat'], ['nat', 'tan']]
"""
if not strs:
return []
anagram_groups = defaultdict(list)
for s in strs:
# Use sorted string as key
key = ''.join(sorted(s))
anagram_groups[key].append(s)
return list(anagram_groups.values())
def group_anagrams_frequency(self, strs: List[str]) -> List[List[str]]:
"""
Group anagrams using character frequency as key.
More efficient for long strings (O(n*k) vs O(n*k log k)).
Time: O(n * k) where k = average string length
Space: O(n * k)
Examples:
>>> grouper = AnagramGrouper()
>>> result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
>>> len(result)
2
"""
if not strs:
return []
anagram_groups = defaultdict(list)
for s in strs:
# Create frequency tuple as key (only for lowercase a-z)
count = [0] * 26
for c in s:
count[ord(c) - ord('a')] += 1
# Tuple is hashable, list is not
key = tuple(count)
anagram_groups[key].append(s)
return list(anagram_groups.values())
def group_anagrams_prime(self, strs: List[str]) -> List[List[str]]:
"""
Group anagrams using prime number product as key.
Mathematical approach: each letter maps to prime number.
Product is unique for each anagram group (fundamental theorem of arithmetic).
Time: O(n * k), Space: O(n * k)
Note: Can overflow for very long strings. Use for educational purposes.
"""
if not strs:
return []
# First 26 prime numbers for a-z
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
anagram_groups = defaultdict(list)
for s in strs:
# Calculate product of primes
key = 1
for c in s:
key *= primes[ord(c) - ord('a')]
anagram_groups[key].append(s)
return list(anagram_groups.values())
def find_anagrams_of_word(self, strs: List[str], word: str) -> List[str]:
"""
Find all anagrams of a specific word in the list.
Time: O(n * k), Space: O(k)
Examples:
>>> grouper = AnagramGrouper()
>>> grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
['eat', 'tea', 'ate']
"""
if not strs or not word:
return []
# Create frequency signature for target word
target_freq = [0] * 26
for c in word:
target_freq[ord(c) - ord('a')] += 1
target_key = tuple(target_freq)
anagrams = []
for s in strs:
# Create frequency signature for current string
freq = [0] * 26
for c in s:
freq[ord(c) - ord('a')] += 1
if tuple(freq) == target_key:
anagrams.append(s)
return anagrams
def count_anagram_groups(self, strs: List[str]) -> int:
"""
Count number of distinct anagram groups.
Time: O(n * k log k), Space: O(n)
Examples:
>>> grouper = AnagramGrouper()
>>> grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"])
3 # [eat,tea,ate], [tan,nat], [bat]
"""
if not strs:
return 0
seen_signatures = set()
for s in strs:
key = ''.join(sorted(s))
seen_signatures.add(key)
return len(seen_signatures)
def largest_anagram_group(self, strs: List[str]) -> List[str]:
"""
Find the largest group of anagrams.
Time: O(n * k log k), Space: O(n * k)
Examples:
>>> grouper = AnagramGrouper()
>>> result = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
>>> sorted(result)
['ate', 'eat', 'tea']
"""
if not strs:
return []
groups = self.group_anagrams_sorted(strs)
return max(groups, key=len) if groups else []
class AnagramDictionary:
"""
Data structure for efficient anagram lookups.
Useful for word games like Scrabble.
"""
def __init__(self, word_list: List[str]):
"""
Build anagram dictionary from word list.
Time: O(n * k log k), Space: O(n * k)
"""
self.anagrams = defaultdict(list)
for word in word_list:
key = ''.join(sorted(word.lower()))
self.anagrams[key].append(word)
def find_anagrams(self, word: str) -> List[str]:
"""
Find all anagrams of given word.
Time: O(k log k), Space: O(k)
"""
key = ''.join(sorted(word.lower()))
return self.anagrams.get(key, [])
def is_anagram(self, word1: str, word2: str) -> bool:
"""
Check if two words are anagrams.
Time: O(k log k), Space: O(k)
"""
return sorted(word1.lower()) == sorted(word2.lower())
# Comprehensive Test Suite
def test_group_anagrams():
"""Complete test suite for anagram grouping."""
grouper = AnagramGrouper()
# Test 1: Basic grouping
result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
assert len(result) == 3
# Test 2: Empty input
assert grouper.group_anagrams_sorted([]) == []
# Test 3: Single string
assert grouper.group_anagrams_sorted(["a"]) == [["a"]]
# Test 4: No anagrams
result = grouper.group_anagrams_sorted(["abc", "def", "ghi"])
assert len(result) == 3
# Test 5: All anagrams
result = grouper.group_anagrams_sorted(["abc", "bca", "cab", "acb"])
assert len(result) == 1
assert len(result[0]) == 4
# Test 6: Frequency method
result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
assert len(result) == 2
# Test 7: Find specific anagrams
anagrams = grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
assert sorted(anagrams) == ['ate', 'eat', 'tea']
# Test 8: Count groups
assert grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"]) == 3
# Test 9: Largest group
largest = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
assert len(largest) == 3
# Test 10: Anagram dictionary
word_list = ["listen", "silent", "enlist", "hello", "world"]
dictionary = AnagramDictionary(word_list)
assert sorted(dictionary.find_anagrams("listen")) == ['enlist', 'listen', 'silent']
assert dictionary.is_anagram("listen", "silent") == True
print("All Group Anagrams tests passed!")
if __name__ == "__main__":
test_group_anagrams()
Topic 60: Valid Anagram - Complete Implementation
"""
Valid Anagram - Check if two strings are anagrams
Time: O(n), Space: O(1) for lowercase English letters
"""
from typing import Dict
from collections import Counter
class AnagramValidator:
"""
Complete implementation of anagram validation with multiple approaches.
Key Insight: Anagrams have identical character frequencies
"""
def is_anagram_sorting(self, s: str, t: str) -> bool:
"""
Check if two strings are anagrams using sorting.
Time: O(n log n), Space: O(n)
Args:
s: First string
t: Second string
Returns:
True if s and t are anagrams
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_sorting("anagram", "nagaram")
True
>>> validator.is_anagram_sorting("rat", "car")
False
"""
if len(s) != len(t):
return False
return sorted(s) == sorted(t)
def is_anagram_hashtable(self, s: str, t: str) -> bool:
"""
Check if two strings are anagrams using hash table.
More efficient for long strings: O(n) vs O(n log n).
Time: O(n), Space: O(1) for fixed alphabet size
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_hashtable("anagram", "nagaram")
True
"""
if len(s) != len(t):
return False
return Counter(s) == Counter(t)
def is_anagram_array(self, s: str, t: str) -> bool:
"""
Check anagrams using fixed-size array (only lowercase a-z).
Most space-efficient: O(1) space.
Time: O(n), Space: O(1)
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_array("anagram", "nagaram")
True
"""
if len(s) != len(t):
return False
# Fixed array for 26 lowercase letters
char_count = [0] * 26
for i in range(len(s)):
char_count[ord(s[i]) - ord('a')] += 1
char_count[ord(t[i]) - ord('a')] -= 1
return all(count == 0 for count in char_count)
def is_anagram_unicode(self, s: str, t: str) -> bool:
"""
Check anagrams for Unicode strings (supports all characters).
Time: O(n), Space: O(k) where k = unique characters
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_unicode("你好", "好你")
True
"""
if len(s) != len(t):
return False
return Counter(s) == Counter(t)
def is_anagram_case_insensitive(self, s: str, t: str) -> bool:
"""
Check anagrams ignoring case.
Time: O(n), Space: O(1)
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_case_insensitive("Anagram", "Nagaram")
True
"""
return sorted(s.lower()) == sorted(t.lower())
def is_anagram_ignore_spaces(self, s: str, t: str) -> bool:
"""
Check anagrams ignoring spaces and case.
Time: O(n), Space: O(n)
Examples:
>>> validator = AnagramValidator()
>>> validator.is_anagram_ignore_spaces("conversation", "voices rant on")
True
"""
s_clean = s.replace(" ", "").lower()
t_clean = t.replace(" ", "").lower()
return sorted(s_clean) == sorted(t_clean)
def find_anagram_difference(self, s: str, t: str) -> Dict[str, int]:
"""
Find character frequency differences between two strings.
Useful for debugging or partial anagram matching.
Time: O(n), Space: O(k)
Examples:
>>> validator = AnagramValidator()
>>> validator.find_anagram_difference("aab", "aac")
{'b': 1, 'c': -1}
"""
count_s = Counter(s)
count_t = Counter(t)
diff = {}
all_chars = set(count_s.keys()) | set(count_t.keys())
for char in all_chars:
difference = count_s.get(char, 0) - count_t.get(char, 0)
if difference != 0:
diff[char] = difference
return diff
def minimum_swaps_to_anagram(self, s: str, t: str) -> int:
"""
Find minimum character swaps to make s an anagram of t.
Returns -1 if impossible.
Time: O(n), Space: O(1)
Examples:
>>> validator = AnagramValidator()
>>> validator.minimum_swaps_to_anagram("abc", "bca")
0 # Already anagrams
>>> validator.minimum_swaps_to_anagram("abc", "def")
-1 # Impossible
"""
if len(s) != len(t):
return -1
# Check if same character set
if Counter(s) != Counter(t):
return -1
# Count mismatched positions
mismatches = 0
for i in range(len(s)):
if s[i] != t[i]:
mismatches += 1
# Each swap fixes 2 positions
return mismatches // 2
class AnagramGame:
"""
Anagram-based word game utilities.
"""
def __init__(self, dictionary: list[str]):
"""Initialize with word dictionary."""
self.words = set(word.lower() for word in dictionary)
# Build anagram index
self.anagram_index = {}
for word in self.words:
key = ''.join(sorted(word))
if key not in self.anagram_index:
self.anagram_index[key] = []
self.anagram_index[key].append(word)
def get_valid_anagrams(self, letters: str) -> list[str]:
"""
Get all valid dictionary words that are anagrams of given letters.
Time: O(k log k + m) where k = len(letters), m = anagrams found
Examples:
>>> game = AnagramGame(["listen", "silent", "enlist", "hello"])
>>> sorted(game.get_valid_anagrams("silent"))
['enlist', 'listen', 'silent']
"""
key = ''.join(sorted(letters.lower()))
return self.anagram_index.get(key, [])
def score_anagram(self, word: str) -> int:
"""
Score word based on letter frequency (Scrabble-like).
Examples:
>>> game = AnagramGame([])
>>> game.score_anagram("quiz")
22 # High value letters
"""
# Scrabble letter values
values = {
'a': 1, 'e': 1, 'i': 1, 'o': 1, 'u': 1, 'l': 1, 'n': 1, 's': 1, 't': 1, 'r': 1,
'd': 2, 'g': 2,
'b': 3, 'c': 3, 'm': 3, 'p': 3,
'f': 4, 'h': 4, 'v': 4, 'w': 4, 'y': 4,
'k': 5,
'j': 8, 'x': 8,
'q': 10, 'z': 10
}
return sum(values.get(c.lower(), 0) for c in word)
# Comprehensive Test Suite
def test_valid_anagram():
"""Complete test suite for anagram validation."""
validator = AnagramValidator()
# Test 1: Basic anagrams
assert validator.is_anagram_sorting("anagram", "nagaram") == True
assert validator.is_anagram_sorting("rat", "car") == False
# Test 2: Empty strings
assert validator.is_anagram_sorting("", "") == True
# Test 3: Different lengths
assert validator.is_anagram_sorting("a", "ab") == False
# Test 4: Single character
assert validator.is_anagram_sorting("a", "a") == True
# Test 5: Hash table method
assert validator.is_anagram_hashtable("listen", "silent") == True
# Test 6: Array method
assert validator.is_anagram_array("anagram", "nagaram") == True
# Test 7: Case insensitive
assert validator.is_anagram_case_insensitive("Anagram", "Nagaram") == True
# Test 8: Ignore spaces
assert validator.is_anagram_ignore_spaces("conversation", "voices rant on") == True
# Test 9: Unicode
assert validator.is_anagram_unicode("你好", "好你") == True
# Test 10: Find difference
diff = validator.find_anagram_difference("aab", "aac")
assert diff == {'b': 1, 'c': -1}
# Test 11: Minimum swaps
assert validator.minimum_swaps_to_anagram("abc", "bca") == 0
assert validator.minimum_swaps_to_anagram("abc", "def") == -1
# Test 12: Anagram game
game = AnagramGame(["listen", "silent", "enlist", "hello", "world"])
anagrams = game.get_valid_anagrams("silent")
assert len(anagrams) == 3
assert all(word in anagrams for word in ["listen", "silent", "enlist"])
# Test 13: Scoring
score = game.score_anagram("quiz")
assert score == 22 # q=10, u=1, i=1, z=10
print("All Valid Anagram tests passed!")
if __name__ == "__main__":
test_valid_anagram()
Topic 61: Contains Duplicate - Complete Implementation
"""
Contains Duplicate - Check if array has duplicate elements
Time: O(n), Space: O(n)
"""
from typing import List, Set, Dict
from collections import Counter
class DuplicateDetector:
"""
Complete implementation of duplicate detection with various constraints.
Key Insight: Hash set provides O(1) duplicate checking
"""
def contains_duplicate_set(self, nums: List[int]) -> bool:
"""
Check if array contains any duplicates using set.
Time: O(n), Space: O(n)
Args:
nums: List of integers
Returns:
True if any value appears at least twice
Examples:
>>> detector = DuplicateDetector()
>>> detector.contains_duplicate_set([1,2,3,1])
True
>>> detector.contains_duplicate_set([1,2,3,4])
False
"""
if not nums:
return False
return len(nums) != len(set(nums))
def contains_duplicate_early_return(self, nums: List[int]) -> bool:
"""
Check duplicates with early return (more efficient).
Time: O(n), Space: O(n) worst case
Best case: O(1) if duplicate found early
Examples:
>>> detector = DuplicateDetector()
>>> detector.contains_duplicate_early_return([1,1,2,3])
True # Returns immediately on first duplicate
"""
if not nums:
return False
seen = set()
for num in nums:
if num in seen:
return True
seen.add(num)
return False
def contains_duplicate_ii(self, nums: List[int], k: int) -> bool:
"""
Check if duplicates exist within k indices of each other.
LeetCode 219: Contains Duplicate II
Time: O(n), Space: O(min(n, k))
Args:
nums: List of integers
k: Maximum index difference
Returns:
True if duplicates exist with index difference <= k
Examples:
>>> detector = DuplicateDetector()
>>> detector.contains_duplicate_ii([1,2,3,1], 3)
True # indices 0 and 3, diff = 3
>>> detector.contains_duplicate_ii([1,0,1,1], 1)
True # indices 2 and 3, diff = 1
>>> detector.contains_duplicate_ii([1,2,3,1,2,3], 2)
False
"""
if not nums or k <= 0:
return False
# Sliding window: maintain window of size k+1
window = set()
for i, num in enumerate(nums):
if num in window:
return True
window.add(num)
# Maintain window size
if len(window) > k:
window.remove(nums[i - k])
return False
def contains_duplicate_iii(self, nums: List[int], index_diff: int, value_diff: int) -> bool:
"""
Check if duplicates exist with constraints on both index and value difference.
LeetCode 220: Contains Duplicate III
Find if there exist indices i and j such that:
- abs(i - j) <= index_diff
- abs(nums[i] - nums[j]) <= value_diff
Time: O(n log k), Space: O(k)
Using bucket approach: O(n), O(k)
Examples:
>>> detector = DuplicateDetector()
>>> detector.contains_duplicate_iii([1,2,3,1], 3, 0)
True # Exact duplicate within range
>>> detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3)
False
"""
if not nums or index_diff < 0 or value_diff < 0:
return False
# Bucket approach: each bucket has width value_diff + 1
bucket_width = value_diff + 1
buckets = {}
for i, num in enumerate(nums):
bucket_id = num // bucket_width
# Check current bucket
if bucket_id in buckets:
return True
# Check adjacent buckets
if bucket_id - 1 in buckets and abs(num - buckets[bucket_id - 1]) <= value_diff:
return True
if bucket_id + 1 in buckets and abs(num - buckets[bucket_id + 1]) <= value_diff:
return True
# Add to bucket
buckets[bucket_id] = num
# Remove old bucket
if i >= index_diff:
del buckets[nums[i - index_diff] // bucket_width]
return False
def find_all_duplicates(self, nums: List[int]) -> List[int]:
"""
Find all numbers that appear more than once.
Time: O(n), Space: O(n)
Examples:
>>> detector = DuplicateDetector()
>>> sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1]))
[2, 3]
"""
if not nums:
return []
counts = Counter(nums)
return [num for num, count in counts.items() if count > 1]
def find_duplicate_single(self, nums: List[int]) -> int:
"""
Find the one duplicate in array where:
- nums contains n+1 integers in range [1, n]
- Only one number appears twice, all others appear once
Floyd's cycle detection: O(n) time, O(1) space
Examples:
>>> detector = DuplicateDetector()
>>> detector.find_duplicate_single([1,3,4,2,2])
2
>>> detector.find_duplicate_single([3,1,3,4,2])
3
"""
# Treat array as linked list: nums[i] points to nums[nums[i]]
# Duplicate creates a cycle
# Phase 1: Find intersection point in cycle
slow = fast = nums[0]
while True:
slow = nums[slow]
fast = nums[nums[fast]]
if slow == fast:
break
# Phase 2: Find entrance to cycle (duplicate)
slow = nums[0]
while slow != fast:
slow = nums[slow]
fast = nums[fast]
return slow
def count_duplicates(self, nums: List[int]) -> int:
"""
Count how many numbers appear more than once.
Time: O(n), Space: O(n)
Examples:
>>> detector = DuplicateDetector()
>>> detector.count_duplicates([1,2,3,1,2,4])
2 # 1 and 2 are duplicates
"""
if not nums:
return 0
counts = Counter(nums)
return sum(1 for count in counts.values() if count > 1)
def remove_duplicates_inplace(self, nums: List[int]) -> int:
"""
Remove duplicates from sorted array in-place.
Return new length.
Time: O(n), Space: O(1)
Examples:
>>> detector = DuplicateDetector()
>>> nums = [1,1,2,2,3,4,4]
>>> length = detector.remove_duplicates_inplace(nums)
>>> length
4
>>> nums[:length]
[1, 2, 3, 4]
"""
if not nums:
return 0
write_idx = 1
for i in range(1, len(nums)):
if nums[i] != nums[i - 1]:
nums[write_idx] = nums[i]
write_idx += 1
return write_idx
class DuplicateAnalyzer:
"""
Advanced duplicate analysis and statistics.
"""
def __init__(self, nums: List[int]):
"""Initialize with number array."""
self.nums = nums
self.frequency = Counter(nums)
def get_duplicate_stats(self) -> Dict:
"""
Get comprehensive duplicate statistics.
Returns:
Dictionary with various statistics
Examples:
>>> analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
>>> stats = analyzer.get_duplicate_stats()
>>> stats['total_duplicates']
2
>>> stats['most_frequent']
3
"""
duplicates = {num: count for num, count in self.frequency.items() if count > 1}
return {
'total_elements': len(self.nums),
'unique_elements': len(self.frequency),
'total_duplicates': len(duplicates),
'duplicate_elements': list(duplicates.keys()),
'most_frequent': max(self.frequency, key=self.frequency.get) if self.frequency else None,
'max_frequency': max(self.frequency.values()) if self.frequency else 0,
'duplicate_count': sum(count - 1 for count in duplicates.values()),
}
def find_k_most_frequent(self, k: int) -> List[int]:
"""
Find k most frequent elements.
Time: O(n log k), Space: O(n)
Examples:
>>> analyzer = DuplicateAnalyzer([1,1,1,2,2,3])
>>> analyzer.find_k_most_frequent(2)
[1, 2]
"""
from heapq import nlargest
return nlargest(k, self.frequency.keys(), key=self.frequency.get)
# Comprehensive Test Suite
def test_contains_duplicate():
"""Complete test suite for duplicate detection."""
detector = DuplicateDetector()
# Test 1: Basic duplicate
assert detector.contains_duplicate_set([1,2,3,1]) == True
assert detector.contains_duplicate_set([1,2,3,4]) == False
# Test 2: Empty array
assert detector.contains_duplicate_set([]) == False
# Test 3: Single element
assert detector.contains_duplicate_set([1]) == False
# Test 4: All duplicates
assert detector.contains_duplicate_set([1,1,1,1]) == True
# Test 5: Early return optimization
assert detector.contains_duplicate_early_return([1,1,2,3]) == True
# Test 6: Contains Duplicate II
assert detector.contains_duplicate_ii([1,2,3,1], 3) == True
assert detector.contains_duplicate_ii([1,0,1,1], 1) == True
assert detector.contains_duplicate_ii([1,2,3,1,2,3], 2) == False
# Test 7: Contains Duplicate III
assert detector.contains_duplicate_iii([1,2,3,1], 3, 0) == True
assert detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3) == False
# Test 8: Find all duplicates
assert sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1])) == [2, 3]
# Test 9: Find single duplicate (Floyd's algorithm)
assert detector.find_duplicate_single([1,3,4,2,2]) == 2
assert detector.find_duplicate_single([3,1,3,4,2]) == 3
# Test 10: Count duplicates
assert detector.count_duplicates([1,2,3,1,2,4]) == 2
# Test 11: Remove duplicates in-place
nums = [1,1,2,2,3,4,4]
length = detector.remove_duplicates_inplace(nums)
assert length == 4
assert nums[:length] == [1, 2, 3, 4]
# Test 12: Duplicate analyzer
analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
stats = analyzer.get_duplicate_stats()
assert stats['total_duplicates'] == 2
assert stats['most_frequent'] == 3
assert stats['max_frequency'] == 3
# Test 13: K most frequent
assert analyzer.find_k_most_frequent(2) == [3, 2]
print("All Contains Duplicate tests passed!")
if __name__ == "__main__":
test_contains_duplicate()
Topic 62: Longest Consecutive Sequence - Complete Implementation
"""
Longest Consecutive Sequence - Find length of longest consecutive elements sequence
Time: O(n), Space: O(n)
"""
from typing import List, Set, Dict
class ConsecutiveSequenceFinder:
"""
Complete implementation of consecutive sequence problems.
Key Insight: Hash set allows O(1) lookup for sequence building
"""
def longest_consecutive_optimal(self, nums: List[int]) -> int:
"""
Find longest consecutive sequence using hash set.
Only start counting from sequence starts (num-1 not in set).
Time: O(n), Space: O(n)
Args:
nums: List of integers
Returns:
Length of longest consecutive sequence
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.longest_consecutive_optimal([100,4,200,1,3,2])
4 # Sequence: [1,2,3,4]
>>> finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1])
9 # Sequence: [0,1,2,3,4,5,6,7,8]
"""
if not nums:
return 0
num_set = set(nums)
max_length = 0
for num in num_set:
# Only start counting from sequence beginnings
if num - 1 not in num_set:
current_num = num
current_length = 1
# Count consecutive numbers
while current_num + 1 in num_set:
current_num += 1
current_length += 1
max_length = max(max_length, current_length)
return max_length
def longest_consecutive_with_sequence(self, nums: List[int]) -> List[int]:
"""
Return the actual longest consecutive sequence, not just length.
Time: O(n), Space: O(n)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
[1, 2, 3, 4]
"""
if not nums:
return []
num_set = set(nums)
longest_sequence = []
for num in num_set:
if num - 1 not in num_set:
# Build sequence from this start
sequence = []
current = num
while current in num_set:
sequence.append(current)
current += 1
if len(sequence) > len(longest_sequence):
longest_sequence = sequence
return longest_sequence
def longest_consecutive_union_find(self, nums: List[int]) -> int:
"""
Alternative approach using Union-Find data structure.
Educational purpose - shows connection to graph connectivity.
Time: O(n * α(n)) ≈ O(n), Space: O(n)
where α is inverse Ackermann function (practically constant)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.longest_consecutive_union_find([100,4,200,1,3,2])
4
"""
if not nums:
return 0
class UnionFind:
def __init__(self):
self.parent = {}
self.size = {}
def add(self, x):
if x not in self.parent:
self.parent[x] = x
self.size[x] = 1
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
root_x, root_y = self.find(x), self.find(y)
if root_x == root_y:
return
# Union by size
if self.size[root_x] < self.size[root_y]:
root_x, root_y = root_y, root_x
self.parent[root_y] = root_x
self.size[root_x] += self.size[root_y]
def get_max_size(self):
return max(self.size.values()) if self.size else 0
uf = UnionFind()
num_set = set(nums)
for num in num_set:
uf.add(num)
if num + 1 in num_set:
uf.union(num, num + 1)
return uf.get_max_size()
def count_consecutive_sequences(self, nums: List[int]) -> int:
"""
Count number of consecutive sequences.
Time: O(n), Space: O(n)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.count_consecutive_sequences([100,4,200,1,3,2])
3 # [100], [200], [1,2,3,4]
"""
if not nums:
return 0
num_set = set(nums)
count = 0
for num in num_set:
if num - 1 not in num_set:
count += 1
return count
def all_consecutive_sequences(self, nums: List[int]) -> List[List[int]]:
"""
Find all consecutive sequences.
Time: O(n), Space: O(n)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> result = finder.all_consecutive_sequences([100,4,200,1,3,2])
>>> sorted([sorted(seq) for seq in result])
[[1, 2, 3, 4], [100], [200]]
"""
if not nums:
return []
num_set = set(nums)
sequences = []
for num in num_set:
if num - 1 not in num_set:
# Build sequence
sequence = []
current = num
while current in num_set:
sequence.append(current)
current += 1
sequences.append(sequence)
return sequences
def longest_consecutive_range(self, nums: List[int]) -> tuple:
"""
Return (start, end) of longest consecutive sequence.
Time: O(n), Space: O(n)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.longest_consecutive_range([100,4,200,1,3,2])
(1, 4)
"""
if not nums:
return (0, 0)
num_set = set(nums)
max_start, max_end = 0, 0
max_length = 0
for num in num_set:
if num - 1 not in num_set:
start = num
end = num
while end + 1 in num_set:
end += 1
if end - start + 1 > max_length:
max_length = end - start + 1
max_start, max_end = start, end
return (max_start, max_end)
def consecutive_with_duplicates(self, nums: List[int]) -> int:
"""
Handle duplicates explicitly in consecutive sequence.
Time: O(n), Space: O(n)
Examples:
>>> finder = ConsecutiveSequenceFinder()
>>> finder.consecutive_with_duplicates([1,2,2,3,3,3,4])
4 # [1,2,3,4]
"""
# Same as optimal since we use set
return self.longest_consecutive_optimal(nums)
class BinaryArrayConsecutive:
"""
Find longest consecutive sequence of 1s in binary array.
Related problem showing hash table application to different domain.
"""
def longest_consecutive_ones(self, nums: List[int]) -> int:
"""
Find longest sequence of consecutive 1s.
Time: O(n), Space: O(1)
Examples:
>>> solver = BinaryArrayConsecutive()
>>> solver.longest_consecutive_ones([1,1,0,1,1,1])
3
"""
max_length = 0
current_length = 0
for num in nums:
if num == 1:
current_length += 1
max_length = max(max_length, current_length)
else:
current_length = 0
return max_length
def longest_consecutive_ones_with_flip(self, nums: List[int], k: int) -> int:
"""
Find longest sequence of 1s after flipping at most k 0s.
Time: O(n), Space: O(1)
Examples:
>>> solver = BinaryArrayConsecutive()
>>> solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2)
6 # Flip two 0s: [1,1,1,0,0,0,1,1,1,1] -> [1,1,1,1,1,0,1,1,1,1]
"""
left = 0
zeros_count = 0
max_length = 0
for right in range(len(nums)):
if nums[right] == 0:
zeros_count += 1
while zeros_count > k:
if nums[left] == 0:
zeros_count -= 1
left += 1
max_length = max(max_length, right - left + 1)
return max_length
# Comprehensive Test Suite
def test_longest_consecutive():
"""Complete test suite for consecutive sequence."""
finder = ConsecutiveSequenceFinder()
# Test 1: Basic case
assert finder.longest_consecutive_optimal([100,4,200,1,3,2]) == 4
# Test 2: Long sequence
assert finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1]) == 9
# Test 3: Empty array
assert finder.longest_consecutive_optimal([]) == 0
# Test 4: Single element
assert finder.longest_consecutive_optimal([1]) == 1
# Test 5: No consecutive
assert finder.longest_consecutive_optimal([1,3,5,7,9]) == 1
# Test 6: All consecutive
assert finder.longest_consecutive_optimal([1,2,3,4,5]) == 5
# Test 7: Duplicates
assert finder.longest_consecutive_optimal([1,2,0,1]) == 3
# Test 8: Negative numbers
assert finder.longest_consecutive_optimal([-1,0,1,2]) == 4
# Test 9: Get actual sequence
sequence = finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
assert sequence == [1,2,3,4]
# Test 10: Union-Find approach
assert finder.longest_consecutive_union_find([100,4,200,1,3,2]) == 4
# Test 11: Count sequences
assert finder.count_consecutive_sequences([100,4,200,1,3,2]) == 3
# Test 12: All sequences
sequences = finder.all_consecutive_sequences([100,4,200,1,3,2])
assert len(sequences) == 3
# Test 13: Get range
assert finder.longest_consecutive_range([100,4,200,1,3,2]) == (1, 4)
# Test 14: Binary array consecutive 1s
binary_solver = BinaryArrayConsecutive()
assert binary_solver.longest_consecutive_ones([1,1,0,1,1,1]) == 3
# Test 15: With flips
assert binary_solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2) == 6
print("All Longest Consecutive Sequence tests passed!")
if __name__ == "__main__":
test_longest_consecutive()
Topic 63: Top K Frequent Elements - Complete Implementation
"""
Top K Frequent Elements - Find k most frequent elements
Time: O(n log k) using heap, O(n) using bucket sort
Space: O(n)
"""
from typing import List, Dict, Tuple
from collections import Counter
import heapq
class TopKFrequentSolver:
"""
Complete implementation of Top K Frequent Elements.
Key Insight: Use frequency counting + heap or bucket sort
"""
def top_k_frequent_heap(self, nums: List[int], k: int) -> List[int]:
"""
Find k most frequent elements using min heap.
Time: O(n log k), Space: O(n)
Args:
nums: List of integers
k: Number of most frequent elements to return
Returns:
List of k most frequent elements
Examples:
>>> solver = TopKFrequentSolver()
>>> sorted(solver.top_k_frequent_heap([1,1,1,2,2,3], 2))
[1, 2]
>>> sorted(solver.top_k_frequent_heap([1], 1))
[1]
"""
if not nums or k <= 0:
return []
# Count frequencies
freq_map = Counter(nums)
# Use min heap of size k
# Heap stores (frequency, number)
min_heap = []
for num, freq in freq_map.items():
heapq.heappush(min_heap, (freq, num))
if len(min_heap) > k:
heapq.heappop(min_heap)
# Extract numbers from heap
return [num for freq, num in min_heap]
def top_k_frequent_bucket(self, nums: List[int], k: int) -> List[int]:
"""
Find k most frequent using bucket sort.
More efficient: O(n) time.
Time: O(n), Space: O(n)
Examples:
>>> solver = TopKFrequentSolver()
>>> sorted(solver.top_k_frequent_bucket([1,1,1,2,2,3], 2))
[1, 2]
"""
if not nums or k <= 0:
return []
# Count frequencies
freq_map = Counter(nums)
# Bucket sort: index = frequency, value = list of numbers
# Maximum frequency is len(nums)
buckets = [[] for _ in range(len(nums) + 1)]
for num, freq in freq_map.items():
buckets[freq].append(num)
# Collect k most frequent from highest frequency buckets
result = []
for i in range(len(buckets) - 1, -1, -1):
if buckets[i]:
result.extend(buckets[i])
if len(result) >= k:
return result[:k]
return result
def top_k_frequent_quickselect(self, nums: List[int], k: int) -> List[int]:
"""
Find k most frequent using Quickselect algorithm.
Average O(n), worst case O(n²).
Time: O(n) average, Space: O(n)
Examples:
>>> solver = TopKFrequentSolver()
>>> result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
>>> len(result)
2
"""
if not nums or k <= 0:
return []
freq_map = Counter(nums)
unique_nums = list(freq_map.keys())
def partition(left, right, pivot_idx):
pivot_freq = freq_map[unique_nums[pivot_idx]]
# Move pivot to end
unique_nums[pivot_idx], unique_nums[right] = unique_nums[right], unique_nums[pivot_idx]
store_idx = left
for i in range(left, right):
if freq_map[unique_nums[i]] < pivot_freq:
unique_nums[store_idx], unique_nums[i] = unique_nums[i], unique_nums[store_idx]
store_idx += 1
# Move pivot to final position
unique_nums[right], unique_nums[store_idx] = unique_nums[store_idx], unique_nums[right]
return store_idx
def quickselect(left, right, k_smallest):
if left == right:
return
# Select random pivot
import random
pivot_idx = random.randint(left, right)
pivot_idx = partition(left, right, pivot_idx)
if k_smallest == pivot_idx:
return
elif k_smallest < pivot_idx:
quickselect(left, pivot_idx - 1, k_smallest)
else:
quickselect(pivot_idx + 1, right, k_smallest)
n = len(unique_nums)
quickselect(0, n - 1, n - k)
return unique_nums[n - k:]
def top_k_frequent_with_counts(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
"""
Return k most frequent elements with their frequencies.
Time: O(n log k), Space: O(n)
Examples:
>>> solver = TopKFrequentSolver()
>>> solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
[(1, 3), (2, 2)]
"""
if not nums or k <= 0:
return []
freq_map = Counter(nums)
# Use max heap (negate frequencies for min heap)
heap = []
for num, freq in freq_map.items():
heapq.heappush(heap, (-freq, num))
result = []
for _ in range(min(k, len(heap))):
freq, num = heapq.heappop(heap)
result.append((num, -freq))
return result
def top_k_frequent_sorted(self, nums: List[int], k: int) -> List[int]:
"""
Simple approach using sorting.
Time: O(n log n), Space: O(n)
Not optimal but simple and clear.
"""
if not nums or k <= 0:
return []
freq_map = Counter(nums)
# Sort by frequency (descending)
sorted_items = sorted(freq_map.items(), key=lambda x: x[1], reverse=True)
return [num for num, freq in sorted_items[:k]]
def bottom_k_frequent(self, nums: List[int], k: int) -> List[int]:
"""
Find k least frequent elements.
Time: O(n log k), Space: O(n)
Examples:
>>> solver = TopKFrequentSolver()
>>> sorted(solver.bottom_k_frequent([1,1,1,2,2,3], 2))
[2, 3]
"""
if not nums or k <= 0:
return []
freq_map = Counter(nums)
# Use max heap to keep k smallest frequencies
max_heap = []
for num, freq in freq_map.items():
heapq.heappush(max_heap, (-freq, num))
if len(max_heap) > k:
heapq.heappop(max_heap)
return [num for freq, num in max_heap]
class FrequencyAnalyzer:
"""
Advanced frequency analysis tools.
"""
def __init__(self, nums: List[int]):
"""Initialize with number array."""
self.nums = nums
self.freq_map = Counter(nums)
def get_frequency_distribution(self) -> Dict[int, List[int]]:
"""
Get distribution: frequency -> list of numbers with that frequency.
Time: O(n), Space: O(n)
Examples:
>>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
>>> dist = analyzer.get_frequency_distribution()
>>> dist[3]
[1]
>>> sorted(dist[2])
[2]
>>> dist[1]
[3]
"""
distribution = {}
for num, freq in self.freq_map.items():
if freq not in distribution:
distribution[freq] = []
distribution[freq].append(num)
return distribution
def frequency_stats(self) -> Dict:
"""
Get comprehensive frequency statistics.
Returns:
Dictionary with various statistics
"""
if not self.freq_map:
return {}
frequencies = list(self.freq_map.values())
return {
'total_elements': len(self.nums),
'unique_elements': len(self.freq_map),
'min_frequency': min(frequencies),
'max_frequency': max(frequencies),
'avg_frequency': sum(frequencies) / len(frequencies),
'mode': max(self.freq_map, key=self.freq_map.get),
'mode_frequency': max(frequencies),
}
def elements_with_frequency(self, target_freq: int) -> List[int]:
"""
Get all elements that appear exactly target_freq times.
Time: O(n), Space: O(1)
Examples:
>>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
>>> analyzer.elements_with_frequency(2)
[2]
"""
return [num for num, freq in self.freq_map.items() if freq == target_freq]
def frequency_percentile(self, num: int) -> float:
"""
Calculate what percentile the frequency of num is at.
Examples:
>>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
>>> analyzer.frequency_percentile(1)
100.0 # Highest frequency
"""
if num not in self.freq_map:
return 0.0
target_freq = self.freq_map[num]
count_lower = sum(1 for freq in self.freq_map.values() if freq < target_freq)
return (count_lower / len(self.freq_map)) * 100
# Comprehensive Test Suite
def test_top_k_frequent():
"""Complete test suite for Top K Frequent."""
solver = TopKFrequentSolver()
# Test 1: Basic case with heap
result = solver.top_k_frequent_heap([1,1,1,2,2,3], 2)
assert sorted(result) == [1, 2]
# Test 2: Single element
assert solver.top_k_frequent_heap([1], 1) == [1]
# Test 3: K equals number of unique elements
result = solver.top_k_frequent_heap([1,2,3], 3)
assert len(result) == 3
# Test 4: Bucket sort approach
result = solver.top_k_frequent_bucket([1,1,1,2,2,3], 2)
assert sorted(result) == [1, 2]
# Test 5: Quickselect approach
result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
assert len(result) == 2
assert 1 in result and 2 in result
# Test 6: With counts
result = solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
assert len(result) == 2
assert (1, 3) in result
assert (2, 2) in result
# Test 7: Sorted approach
result = solver.top_k_frequent_sorted([1,1,1,2,2,3], 2)
assert result == [1, 2]
# Test 8: Bottom k frequent
result = solver.bottom_k_frequent([1,1,1,2,2,3], 2)
assert sorted(result) == [2, 3]
# Test 9: Frequency analyzer
analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
dist = analyzer.get_frequency_distribution()
assert dist[3] == [1]
assert 2 in dist[2]
assert dist[1] == [3]
# Test 10: Frequency stats
stats = analyzer.frequency_stats()
assert stats['total_elements'] == 6
assert stats['unique_elements'] == 3
assert stats['max_frequency'] == 3
assert stats['mode'] == 1
# Test 11: Elements with frequency
assert analyzer.elements_with_frequency(2) == [2]
# Test 12: Frequency percentile
assert analyzer.frequency_percentile(1) == 100.0
# Test 13: Large array performance
import random
large_array = [random.randint(1, 100) for _ in range(10000)]
result = solver.top_k_frequent_bucket(large_array, 10)
assert len(result) <= 10
print("All Top K Frequent tests passed!")
if __name__ == "__main__":
test_top_k_frequent()
Topic 64: Subarray Sum Equals K - Complete Implementation
"""
Subarray Sum Equals K - Count subarrays with sum equal to k
Time: O(n), Space: O(n)
"""
from typing import List, Dict, Tuple
from collections import defaultdict
class SubarraySumSolver:
"""
Complete implementation of subarray sum problems.
Key Insight: Use prefix sum + hash map for O(n) solution
"""
def subarray_sum_count(self, nums: List[int], k: int) -> int:
"""
Count number of continuous subarrays with sum equal to k.
Uses prefix sum: if prefix[j] - prefix[i] = k, then subarray[i+1:j+1] sums to k
Store prefix sums in hash map for O(1) lookup.
Time: O(n), Space: O(n)
Args:
nums: List of integers
k: Target sum
Returns:
Count of subarrays summing to k
Examples:
>>> solver = SubarraySumSolver()
>>> solver.subarray_sum_count([1,1,1], 2)
2 # [1,1] appears twice
>>> solver.subarray_sum_count([1,2,3], 3)
2 # [3] and [1,2]
"""
if not nums:
return 0
# Map: prefix_sum -> count of occurrences
prefix_count = defaultdict(int)
prefix_count[0] = 1 # Empty prefix
current_sum = 0
count = 0
for num in nums:
current_sum += num
# Check if (current_sum - k) exists
# If yes, those positions form valid subarrays
if current_sum - k in prefix_count:
count += prefix_count[current_sum - k]
prefix_count[current_sum] += 1
return count
def subarray_sum_indices(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
"""
Return all subarray indices (start, end) with sum equal to k.
Time: O(n²) worst case (all subarrays sum to k)
Space: O(n)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.subarray_sum_indices([1,1,1], 2)
[(0, 1), (1, 2)]
"""
if not nums:
return []
# Map: prefix_sum -> list of indices
prefix_indices = defaultdict(list)
prefix_indices[0].append(-1) # Before array starts
current_sum = 0
result = []
for i, num in enumerate(nums):
current_sum += num
target = current_sum - k
if target in prefix_indices:
for start_idx in prefix_indices[target]:
result.append((start_idx + 1, i))
prefix_indices[current_sum].append(i)
return result
def subarray_sum_max_length(self, nums: List[int], k: int) -> int:
"""
Find maximum length of subarray with sum equal to k.
Time: O(n), Space: O(n)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.subarray_sum_max_length([1,-1,5,-2,3], 3)
4 # [1,-1,5,-2] or [-2,3] -> actually [-1,5,-2,3] = 5? Let me recalc
# Actually [1,-1,5,-2] = 3, length 4
"""
if not nums:
return 0
# Map: prefix_sum -> first index where it appears
prefix_first = {0: -1}
current_sum = 0
max_length = 0
for i, num in enumerate(nums):
current_sum += num
target = current_sum - k
if target in prefix_first:
max_length = max(max_length, i - prefix_first[target])
# Only store first occurrence of each prefix sum
if current_sum not in prefix_first:
prefix_first[current_sum] = i
return max_length
def subarray_sum_min_length(self, nums: List[int], k: int) -> int:
"""
Find minimum length of subarray with sum equal to k.
Returns -1 if no such subarray exists.
Time: O(n), Space: O(n)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.subarray_sum_min_length([1,2,3,4,5], 9)
2 # [4,5]
"""
if not nums:
return -1
# Map: prefix_sum -> last index where it appears
prefix_last = {0: -1}
current_sum = 0
min_length = float('inf')
for i, num in enumerate(nums):
current_sum += num
target = current_sum - k
if target in prefix_last:
min_length = min(min_length, i - prefix_last[target])
# Always update to latest index
prefix_last[current_sum] = i
return min_length if min_length != float('inf') else -1
def subarray_sum_equals_zero(self, nums: List[int]) -> bool:
"""
Check if there exists a subarray with sum equal to 0.
Time: O(n), Space: O(n)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.subarray_sum_equals_zero([1,-1,2,3])
True # [1,-1]
>>> solver.subarray_sum_equals_zero([1,2,3])
False
"""
if not nums:
return False
seen_sums = {0} # Empty prefix
current_sum = 0
for num in nums:
current_sum += num
if current_sum in seen_sums:
return True
seen_sums.add(current_sum)
return False
def longest_subarray_sum_zero(self, nums: List[int]) -> int:
"""
Find length of longest subarray with sum 0.
Time: O(n), Space: O(n)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23])
5 # [-2,2,-8,1,7]
"""
return self.subarray_sum_max_length(nums, 0)
def count_subarrays_sum_divisible_by_k(self, nums: List[int], k: int) -> int:
"""
Count subarrays with sum divisible by k.
Key insight: (prefix[j] - prefix[i]) % k = 0
means prefix[j] % k = prefix[i] % k
Time: O(n), Space: O(k)
Examples:
>>> solver = SubarraySumSolver()
>>> solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5)
7
"""
if not nums:
return 0
# Map: remainder -> count
remainder_count = defaultdict(int)
remainder_count[0] = 1 # Empty prefix
current_sum = 0
count = 0
for num in nums:
current_sum += num
remainder = current_sum % k
# Handle negative remainders
if remainder < 0:
remainder += k
count += remainder_count[remainder]
remainder_count[remainder] += 1
return count
class RangeSumQuery:
"""
Data structure for efficient range sum queries.
Preprocess array to answer sum queries in O(1).
"""
def __init__(self, nums: List[int]):
"""
Preprocess array to build prefix sum array.
Time: O(n), Space: O(n)
"""
self.prefix = [0]
for num in nums:
self.prefix.append(self.prefix[-1] + num)
def sum_range(self, left: int, right: int) -> int:
"""
Return sum of elements from index left to right (inclusive).
Time: O(1), Space: O(1)
Examples:
>>> query = RangeSumQuery([1,2,3,4,5])
>>> query.sum_range(0, 2)
6 # 1+2+3
>>> query.sum_range(2, 4)
12 # 3+4+5
"""
return self.prefix[right + 1] - self.prefix[left]
def count_range_sum_k(self, k: int) -> int:
"""
Count all ranges with sum equal to k.
Time: O(n²), Space: O(1)
"""
count = 0
n = len(self.prefix) - 1
for i in range(n):
for j in range(i, n):
if self.sum_range(i, j) == k:
count += 1
return count
class SubarrayProductSolver:
"""
Related problem: Product instead of sum.
Shows how hash map pattern applies to different operations.
"""
def subarray_product_less_than_k(self, nums: List[int], k: int) -> int:
"""
Count subarrays where product of elements is less than k.
Note: All elements must be positive for this to work.
Time: O(n), Space: O(1)
Examples:
>>> solver = SubarrayProductSolver()
>>> solver.subarray_product_less_than_k([10,5,2,6], 100)
8
"""
if k <= 1:
return 0
count = 0
product = 1
left = 0
for right in range(len(nums)):
product *= nums[right]
while product >= k and left <= right:
product //= nums[left]
left += 1
# All subarrays ending at right and starting from left to right
count += right - left + 1
return count
# Comprehensive Test Suite
def test_subarray_sum():
"""Complete test suite for subarray sum."""
solver = SubarraySumSolver()
# Test 1: Basic count
assert solver.subarray_sum_count([1,1,1], 2) == 2
assert solver.subarray_sum_count([1,2,3], 3) == 2
# Test 2: Empty array
assert solver.subarray_sum_count([], 5) == 0
# Test 3: Single element
assert solver.subarray_sum_count([5], 5) == 1
assert solver.subarray_sum_count([5], 3) == 0
# Test 4: Negative numbers
assert solver.subarray_sum_count([1,-1,0], 0) == 3
# Test 5: Get indices
indices = solver.subarray_sum_indices([1,1,1], 2)
assert len(indices) == 2
assert (0, 1) in indices and (1, 2) in indices
# Test 6: Max length
assert solver.subarray_sum_max_length([1,-1,5,-2,3], 3) == 4
# Test 7: Min length
assert solver.subarray_sum_min_length([1,2,3,4,5], 9) == 2
assert solver.subarray_sum_min_length([1,2,3], 10) == -1
# Test 8: Sum equals zero
assert solver.subarray_sum_equals_zero([1,-1,2,3]) == True
assert solver.subarray_sum_equals_zero([1,2,3]) == False
# Test 9: Longest zero sum
assert solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23]) == 5
# Test 10: Divisible by k
assert solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5) == 7
# Test 11: Range sum query
query = RangeSumQuery([1,2,3,4,5])
assert query.sum_range(0, 2) == 6
assert query.sum_range(2, 4) == 12
assert query.sum_range(0, 4) == 15
# Test 12: Product less than k
product_solver = SubarrayProductSolver()
assert product_solver.subarray_product_less_than_k([10,5,2,6], 100) == 8
# Test 13: All elements sum to k
assert solver.subarray_sum_count([1,1,1,1], 4) == 1
# Test 14: No subarray sums to k
assert solver.subarray_sum_count([1,2,3], 10) == 0
print("All Subarray Sum tests passed!")
if __name__ == "__main__":
test_subarray_sum()
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles