Comprehensive Guide to Hash Tables and Hashing

A complete deep-dive into hash tables, hash functions, collision resolution strategies, and their applications in problem-solving.

1. Foundational Understanding

1.1 What, Why, and When Questions

Basic Definition and Scope

What is a Hash Table?

A hash table (also called hash map) is a data structure that implements an associative array abstract data type—a structure that maps keys to values. It uses a hash function to compute an index (hash code) into an array of buckets or slots, from which the desired value can be found.

Precise Definition:

A hash table is a combination of:
1. An array (the underlying storage)
2. A hash function (maps keys to array indices)
3. A collision resolution strategy (handles when multiple keys hash to same index)

Operations Supported:

Insert/Put: Add a key-value pair to the table - O(1) average
Delete/Remove: Remove a key-value pair from the table - O(1) average
Search/Get: Retrieve the value associated with a key - O(1) average
Update: Modify the value for an existing key - O(1) average
Contains: Check if a key exists in the table - O(1) average
Keys/Values/Entries: Iterate through all keys, values, or key-value pairs - O(n)

What Problems Does It Solve That Simpler Structures Cannot?

Fast lookups with arbitrary keys: Unlike arrays (which need integer indices), hash tables allow O(1) lookup with any hashable key type (strings, custom objects, etc.)
Efficient membership testing: Checking if an element exists is O(1) average, compared to O(n) for unsorted arrays or O(log n) for sorted arrays
Frequency counting and grouping: Perfect for counting occurrences or grouping related items
Caching and memoization: Store computed results for fast retrieval
Set operations: Efficient union, intersection, difference operations

Boundaries of the Concept:

Included in Hash Table Family:

Hash maps (key-value pairs)
Hash sets (keys only, no values)
Multisets/bags (allow duplicate keys)
Concurrent hash tables (thread-safe versions)
Persistent hash tables (immutable versions)
Distributed hash tables (DHT)

Excluded:

Trees (balanced BSTs, tries) - different structure despite some overlap in use cases
Direct-address tables - no hashing, just direct indexing
Bloom filters - probabilistic, don't store actual data

Mathematical and Computational Foundations:

Hash tables build upon:

Array indexing: O(1) random access
Hash functions: Mapping from large key space to small index space
Modular arithmetic: Computing indices within array bounds
Probability theory: Load factor analysis, collision probability
Amortized analysis: Understanding dynamic resizing costs

What Inspired Its Creation?

Hash tables emerged from the need to:

Store and retrieve data faster than O(log n) binary search
Use meaningful keys (strings, objects) instead of just integer indices
Handle large symbol tables in compilers efficiently
Implement associative memory in early AI systems

The concept evolved from:

Scatter storage (1950s): Early ideas of computed indexing
Symbol tables in compilers: Need for fast identifier lookup
Database indexing: Efficient record retrieval
Cache memory: Hardware-level associative lookup

Purpose and Context

Why Was the Hash Table Invented?

In the 1950s, as computers began processing larger datasets, researchers faced a fundamental challenge:

Arrays required integer indices and had no semantic meaning
Linked lists required O(n) search time
Sorted arrays required O(log n) search but O(n) insertion
Binary search trees weren't yet well-developed

Hans Peter Luhn (IBM, 1953) and others independently developed hashing to achieve near-constant-time operations using a mathematical transformation (hash function) to convert keys into array indices.

Why Is It Preferred Over Simpler Alternatives?

Operation	Array (unsorted)	Array (sorted)	Linked List	Hash Table
Search	O(n)	O(log n)	O(n)	O(1) avg
Insert	O(1)	O(n)	O(1)	O(1) avg
Delete	O(n)	O(n)	O(n)	O(1) avg

Hash tables win when:

You need fast lookups with non-integer keys
Insert/delete frequency is high
You can afford extra memory (load factor < 1)
You don't need ordered traversal

Real-World Motivating Problems:

Compiler symbol tables (1950s-1960s):
- Problem: During compilation, need to look up variable names thousands of times
- Solution: Hash table maps identifier strings → memory locations/types
Database indexing (1970s):
- Problem: Retrieve customer records by ID/email instantly
- Solution: Hash index for O(1) retrieval
Spell checkers (1980s):
- Problem: Check if word exists in 100,000+ word dictionary
- Solution: Hash set for O(1) membership test
Network routing (1990s):
- Problem: Map IP addresses to MAC addresses (ARP tables)
- Solution: Hash table for fast address resolution

Limitations Addressed:

Linear search inefficiency: Hash tables provide O(1) average lookup vs O(n)
Binary search insertion cost: Hash tables avoid the O(n) insertion time of maintaining sorted order
Memory rigidity of arrays: Hash tables grow dynamically
Lack of semantic keys: Arrays use indices; hash tables use meaningful keys

Historical Context

Evolution and Variants:

1950s - Birth of Hashing:
- Hans Peter Luhn (IBM): Internal memorandum on hashing (1953)
- Arnold Dumey: "Indexing for rapid random access memory" (1956)
- Early collision strategies: linear probing
1960s - Theoretical Foundation:
- W.W. Peterson: Analysis of open addressing (1957)
- Bell Labs: Development of chaining
- Donald Knuth: Systematic analysis in "The Art of Computer Programming"
1970s - Refinement:
- Robin Hood hashing: Improved open addressing
- Double hashing: Better probe sequences
- Universal hashing theory (Carter & Wegman, 1979)
1980s-1990s - Practical Variants:
- Cuckoo hashing (2001): Worst-case O(1) lookup
- Perfect hashing: No collisions for static datasets
- Concurrent hash tables: Lock-free data structures
2000s-Present - Modern Innovations:
- Swiss Tables (Google, 2017): SIMD-optimized hash tables
- Consistent hashing: Distributed systems
- Cryptographic hash tables: Secure against collision attacks

Current Variants:

Variant	Key Feature	Use Case
Chaining	Linked lists at each bucket	General purpose, variable load
Open Addressing	Probing sequence	Cache-friendly, bounded memory
Robin Hood	Variance reduction	Predictable performance
Cuckoo	Worst-case O(1) lookup	Real-time systems
Hopscotch	Bounded search	High performance
Swiss Tables	SIMD optimization	Modern C++ std::unordered_map

Key Milestones:

1953: First hash table implementation (Luhn)
1963: Linear probing analysis (Konheim & Weiss)
1979: Universal hashing proved (Carter & Wegman)
1997: SHA-1 becomes standard hash function
2001: Cuckoo hashing introduced
2012: MurmurHash3 widely adopted
2017: Google's Swiss Tables revolutionize std::unordered_map

Computational Problems Made Solvable:

Real-time symbol lookup in compilers: Enabled faster compilation
Large-scale database indexing: Made relational databases practical
Distributed caching: Memcached, Redis rely on hash tables
Deduplication at scale: Git uses SHA-1 hashing for content addressing
Password verification: Cryptographic hashes (bcrypt, scrypt)

Historical Failures and Lessons

Famous Bugs and Failures:

Perl Hash Collision DoS (2003):
- Issue: Predictable hash functions allowed attackers to craft inputs that all hashed to same bucket
- Impact: POST requests with crafted keys caused O(n²) behavior, DoS attacks
- Lesson: Hash functions need randomization (salt) to prevent algorithmic complexity attacks
- Fix: Randomized hash seeds, switched to SipHash in many languages
Java HashMap Infinite Loop (pre-Java 8):
- Issue: Concurrent modification during resize could create circular linked lists in buckets
- Impact: get() operations would loop infinitely
- Lesson: Need proper synchronization or use ConcurrentHashMap
- Fix: Java 8 switched to trees for large buckets, added better resize logic
Redis Hash Collision Attack (2012):
- Issue: Predictable hash function allowed attackers to degrade performance
- Impact: Redis instances became unresponsive
- Lesson: Even in-memory databases vulnerable to hash attacks
- Fix: Implemented hash function randomization
Git Hash Collision (SHA-1, 2017):
- Issue: Researchers demonstrated SHA-1 collision (SHAttered attack)
- Impact: Theoretical ability to forge Git commits
- Lesson: Cryptographic hash functions have finite lifespan
- Fix: Git migrating to SHA-256
Birthday Paradox Underestimation:
- Mistake: Developers often underestimate collision probability
- Example: 32-bit hash with 77,163 elements has 50% collision chance
- Lesson: Birthday paradox means collisions happen at √n, not n
- Fix: Use sufficient hash bit width (64-bit or 128-bit)

Common Historical Misconceptions:

"Hash tables are always O(1)"
- Reality: O(1) is average case; worst case is O(n) with all collisions
- Modern understanding: Amortized analysis, probabilistic guarantees
"Any hash function will do"
- Reality: Poor hash functions cause clustering and performance degradation
- Modern understanding: Need avalanche effect, good distribution
"Larger tables always perform better"
- Reality: Cache misses can make sparse tables slower than denser ones
- Modern understanding: Balance load factor vs cache locality
"Chaining is always better than open addressing"
- Reality: Open addressing has better cache performance on modern CPUs
- Modern understanding: Hardware matters; Swiss Tables use open addressing

"Obvious" Optimizations That Were Wrong:

Using multiplication instead of modulo:
- Thought: Modulo is slow, use (hash * 2654435761) >> shift instead
- Reality: Modern CPUs have fast modulo; multiplication can have worse distribution
- Truth: It depends on CPU architecture and hash function
Always using prime-sized tables:
- Thought: Prime sizes avoid patterns in modulo operation
- Reality: Power-of-2 sizes enable fast bitwise AND instead of modulo
- Truth: Prime sizes help only with poor hash functions; good hash functions work with any size
Pre-allocating huge tables to avoid resizing:
- Thought: Eliminate resize cost by making table oversized from start
- Reality: Memory waste and cache misses dominate, slower than resizing
- Truth: Dynamic resizing with good load factor (0.7-0.75) is optimal

Real-World Incidents:

Algorithmic Complexity Attacks (2011):
- Multiple languages (PHP, Python, Ruby, Java) vulnerable simultaneously
- Crafted POST data caused hash collisions → O(n²) behavior
- Web servers crashed under attack
- Led to widespread adoption of randomized hashing
Cloudflare Bug (2017):
- Hash table overflow in edge cases caused memory corruption
- Leaked sensitive data (Cloudbleed)
- Lesson: Careful bounds checking even in "safe" operations

Evolution of Best Practices:

Era	Practice	Rationale
1960s	Simple division method	Fast on limited hardware
1970s	Multiplication method	Better distribution
1980s	Prime table sizes	Avoid patterns with poor hash functions
1990s	Cryptographic hashes	Security-conscious hashing
2000s	Randomized seeds	Prevent DoS attacks
2010s	SIMD-optimized structures	Leverage modern CPU features
2020s	Hardware-aware design	Cache-oblivious algorithms

Philosophical and Conceptual

Fundamental Trade-offs:

Time vs. Space:
- Lower load factor → faster operations, more memory
- Higher load factor → more collisions, less memory
- Sweet spot: 0.7-0.75 load factor for most applications
Determinism vs. Security:
- Predictable hash functions → reproducible behavior, vulnerable to attacks
- Randomized hash functions → secure, non-deterministic iteration order
Simplicity vs. Performance:
- Chaining → simple to implement, pointer overhead
- Open addressing → complex deletion, better cache locality
Worst-case vs. Average-case:
- Standard hash table → O(1) average, O(n) worst
- Cuckoo hashing → O(1) worst-case lookup, more complex insertion

Core Assumptions:

Good hash function: Keys distribute uniformly across buckets
Hashability: Keys can be hashed consistently (immutable, implement equality)
Independence: Hash values are independent random variables (under simple uniform hashing assumption)
Sufficient memory: Can maintain load factor below threshold
Amortized cost acceptable: Occasional expensive resize is okay

Paradigm:

Hash tables follow the randomized algorithm paradigm:

Use randomness (hash function) to achieve expected-case performance
Trade worst-case guarantees for better average performance
Rely on probabilistic analysis rather than deterministic bounds

Core Insight:

The fundamental insight of hashing is trading uniqueness for speed:

We can't guarantee unique locations for keys (collisions happen)
But we can make collisions rare enough that average-case is constant time
The hash function compresses a large key space into a small index space
Mathematically: We accept the pigeonhole principle (|Keys| > |Buckets|) and handle collisions efficiently

Philosophical Connections:

Pigeonhole Principle: If you have more keys than buckets, collisions are inevitable
Birthday Paradox: Collisions happen much sooner than intuition suggests
Locality of Reference: Modern implementations optimize for cache performance
Probabilistic Reasoning: Performance guarantees are statistical, not deterministic

1.2 Classification and Categorization

Type and Family

Data Structure Type:

Primary Classification: Non-linear, random access data structure
Abstract Data Type: Implements the Map/Dictionary ADT and Set ADT
Storage Category: Array-based with computed indexing

Algorithm Family (for hashing):

Searching Algorithms: Hash-based search
String Algorithms: Rolling hash (Rabin-Karp)
Cryptographic Algorithms: Secure hash functions (SHA-256, bcrypt)

Major Variations:

By Collision Resolution:
- Separate Chaining: Each bucket contains a linked list/array/tree
- Open Addressing: All elements stored in the array itself
  - Linear probing
  - Quadratic probing
  - Double hashing
- Robin Hood Hashing: Variant of open addressing with fairness
- Cuckoo Hashing: Multiple hash functions, relocate on collision
- Hopscotch Hashing: Bounded search neighborhood
By Purpose:
- Hash Map: Stores key-value pairs
- Hash Set: Stores unique keys only
- Multimap: Allows duplicate keys
- Multiset/Bag: Allows duplicate elements
By Implementation:
- Fixed-size hash table: Static capacity
- Dynamic hash table: Grows/shrinks with load factor
- Perfect hash table: No collisions (for static datasets)
- Minimal perfect hash table: No collisions, minimal space
By Thread-Safety:
- Single-threaded: Standard hash table
- Concurrent hash table: Lock-free or fine-grained locking
- Distributed hash table (DHT): Spread across multiple machines
By Persistence:
- Ephemeral: Standard mutable hash table
- Persistent: Immutable, preserves old versions
- Cache-oblivious: Optimized for unknown cache parameters

Most Common Variant:

Separate Chaining with Dynamic Resizing is most common because:

Simple to implement correctly
Handles variable load factors gracefully
Deletion is straightforward
Predictable performance

However, Open Addressing is gaining popularity in modern systems (e.g., Google's Swiss Tables) due to:

Better cache locality
No pointer overhead
Smaller memory footprint

Distinguishing Factors:

Aspect	Chaining	Open Addressing	Cuckoo
Collision Strategy	Linked structures	Probe sequence	Multiple tables
Cache Performance	Poor (pointer chasing)	Excellent	Good
Deletion	Easy	Complex (tombstones)	Easy
Load Factor	Can exceed 1.0	Must stay < 1.0	Typically < 0.5
Worst-case Lookup	O(n)	O(n)	O(1)

Domain and Application

Where Most Commonly Used:

Compilers and Interpreters:
- Symbol tables: Map identifiers to types, scopes, addresses
- Constant pools: Deduplicate string/numeric literals
- Example: Python's dict for variable namespaces
Databases:
- Hash indexes: Fast equality lookups (not range queries)
- Join operations: Hash join algorithm
- Example: PostgreSQL hash indexes
Operating Systems:
- Page tables: Virtual → physical address mapping
- File systems: Directory structures (ext4, NTFS)
- Process tables: PID → process structure
- Buffer cache: Inode hashing
Networking:
- Routing tables: IP address lookup
- ARP cache: IP → MAC address mapping
- NAT tables: Connection tracking
- DNS caching
Caching Systems:
- Web caches: URL → cached response
- CDNs: Content distribution
- Example: Redis, Memcached
Security:
- Password storage: Hash + salt
- Digital signatures: Document integrity
- Certificate validation: Key fingerprinting

Industries Relying Heavily:

Industry	Application	Specific Use
Finance	Trading systems	Order matching, position tracking
E-commerce	Product catalogs	SKU → product info
Social Media	User data	User ID → profile
Gaming	Game state	Entity ID → game object
Telecommunications	Call routing	Phone number → routing info
Bioinformatics	Sequence analysis	k-mer counting, genome assembly

Scale Suitability:

Small Datasets (< 1,000 elements):
- Hash tables work but may be overkill
- Simple linear search might be comparable
- Memory overhead not justified
Medium Datasets (1,000 - 1,000,000):
- Ideal range for hash tables
- O(1) lookup clearly beats O(log n) trees
- Memory footprint reasonable
Large Datasets (1,000,000 - 1,000,000,000):
- Hash tables excel if they fit in RAM
- Cache performance becomes critical
- Consider Swiss Tables or other cache-optimized variants
Massive Datasets (> 1 billion):
- May not fit in memory of single machine
- Use distributed hash tables (DHT): Chord, Kademlia
- Or disk-based hash structures: Extendible hashing
Streaming Data:
- Use Count-Min Sketch or Bloom filters for approximate answers
- Standard hash tables for exact tracking of recent items

Problem Type Excellence:

Hash tables excel at:

Membership testing: "Does this element exist?"
Counting/frequency: "How many times does X appear?"
Grouping/partitioning: "Group elements by property"
Deduplication: "Remove duplicates"
Caching: "Have I computed this before?"
Two-pointer patterns: "Find pair summing to target"

Hash tables are poor at:

Ordered operations: Can't iterate in sorted order
Range queries: Can't find "all keys between X and Y"
Nearest neighbor: Can't find "closest key to X"
Prefix searches: Can't find "all keys starting with X"
- (Use tries or prefix trees instead)

2. Theory and Principles

2.1 Core Concepts and Structure

Fundamental Ideas

Underpinning Concepts:

Hashing Concept:
- Transform arbitrary key into fixed-range integer
- Key → Hash Function → Hash Code → Index
- Hash code typically much larger than table size
- Compression: index = hash_code % table_size
Simple Uniform Hashing Assumption (SUHA):
- Each key is equally likely to hash to any bucket
- Hash values are uniformly distributed
- Independence: Hash values of different keys are independent
- This is an ideal; real hash functions approximate it
Load Factor (α):
- α = n / m (n = number of elements, m = table size)
- Critical threshold: typically resize when α > 0.75
- Determines average collision chain length
Collision:
- When two different keys hash to the same index
- Inevitable by pigeonhole principle when |keys| > |buckets|
- Must be handled by collision resolution strategy

Underlying Mathematical Model:

Hash tables are modeled as:

Balls and Bins: Throwing n balls (keys) into m bins (buckets)
Probability Theory: Analyze collision rates using birthday paradox
Random Variables: Chain length is random variable; analyze expected value

Expected chain length under SUHA:

E[chain length] = α (load factor)
With chaining, as α → ∞, performance degrades linearly
With open addressing, must keep α < 1

Invariants

Must Always Be True:

Hash Consistency:
- If key1.equals(key2), then hash(key1) == hash(key2)
- Violation → same key could be stored in multiple locations
- Critical for correctness of get/put/delete
Capacity Invariant:
- For open addressing: num_elements < table_size (α < 1)
- For chaining: no strict limit, but typically num_elements < threshold * table_size
Probe Sequence Covers All Slots (Open Addressing):
- Every position must be reachable from any starting position
- Ensures we can always find an empty slot (when α < 1)
- E.g., with linear probing: covers all slots trivially
- E.g., with double hashing: step size must be coprime to table size
Key Immutability:
- Keys must not change after insertion (or hash must not change)
- If key mutates → hash changes → can't find element
- Languages like Java enforce this for String, Integer, etc.

Boundary Conditions:

Empty table: All operations should handle gracefully
Full table (open addressing): Must trigger resize before truly full
Load factor threshold: Resize at specific α to maintain performance
Single element: Edge case for iteration, min/max operations

What Happens If Invariant Violated:

Hash consistency broken:

# Bad: Using mutable list as key
my_dict = {}
key = [1, 2, 3]
my_dict[key] = "value"  # Works initially
key.append(4)  # Mutates key → hash changes
print(my_dict[key])  # May not find the value!

Capacity exceeded (open addressing):
- Infinite loop in probe sequence looking for empty slot
- Modern implementations prevent this by resizing proactively

Restoration Mechanisms:

Resize operation: When load factor exceeds threshold, create new larger table and rehash all elements
Rehashing: Recompute all indices for new table size
Tombstones (open addressing): Mark deleted slots to maintain probe sequences

Architecture and Design

Internal Organization:

Hash Table Structure:
┌─────────────────────────────────────┐
│  Hash Function (key → hash code)    │
└─────────────────────────────────────┘
              ↓
┌─────────────────────────────────────┐
│  Compression (hash code % size)      │
└─────────────────────────────────────┘
              ↓
┌───────────────────────────────────────┐
│          Bucket Array                 │
│  ┌───┬───┬───┬───┬───┬───┬───┬───┐  │
│  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  │
│  └───┴───┴───┴───┴───┴───┴───┴───┘  │
└───────────────────────────────────────┘
       ↓
  Collision Resolution
  (Chaining or Open Addressing)

Components:

Bucket Array:
- Fixed-size array (grows dynamically)
- Each slot can hold one or more key-value pairs
- Size typically power of 2 (enables fast modulo via bitwise AND)
Hash Function:
- Maps key to hash code (typically 32-bit or 64-bit integer)
- Should distribute keys uniformly
- Examples: hash(key) = (a * key.hashCode() + b) % table_size
Collision Resolution Mechanism:
- Chaining: Each bucket points to linked list/array/tree
- Open Addressing: Probe sequence to find next available slot
Metadata:
- size: Number of elements currently stored
- capacity: Size of underlying array
- load_factor_threshold: When to resize (e.g., 0.75)
- hash_seed (optional): Randomization to prevent attacks

Memory Layout:

Separate Chaining:

Bucket Array (contiguous memory):
┌─────┬─────┬─────┬─────┐
│ ptr │ ptr │ ptr │ ptr │  (pointers to chains)
└──┬──┴──┬──┴──┬──┴──┬──┘
   │     │     │     │
   v     v     v     v
 Node  Node  Node  NULL
   │     │     │
   v     v     v
 Node  NULL  Node
   │           │
   v           v
 NULL        NULL

Each Node: [key | value | next_ptr]

Open Addressing (Linear Probing):

Single Array (contiguous memory):
┌──────────┬──────────┬──────────┬──────────┐
│ (k1, v1) │ (k2, v2) │   NULL   │ (k3, v3) │
└──────────┴──────────┴──────────┴──────────┘
         Probe sequence →

Logical vs. Physical Representation:

Logical: Abstract map {key1: value1, key2: value2, ...}
Physical: Array of buckets with collision chains or probe sequences

Pointers and Links:

Chaining: Each bucket has pointer to first node in chain; nodes link to each other
Open Addressing: No pointers; use arithmetic to find next probe location
Cuckoo Hashing: Multiple arrays, each with own hash function

Mathematical Foundations

Key Formulas:

Division Method:
```
h(k) = k mod m
```
- Simple and fast
- m should be prime to avoid clustering
- Not good if m is power of 2 and keys are even
Multiplication Method:
```
h(k) = ⌊m * (k * A mod 1)⌋
```
- A is constant (golden ratio ≈ 0.6180339887 works well)
- k * A mod 1 extracts fractional part
- Works well for any m (power of 2 is fine)
Universal Hashing:
```
h_ab(k) = ((a * k + b) mod p) mod m
```
- Choose random a, b from [0, p-1] where p is prime > |U| (universe of keys)
- Guarantees: For any two keys k1 ≠ k2, Pr[h(k1) = h(k2)] ≤ 1/m
- Provides theoretical protection against adversarial inputs
Load Factor:
```
α = n / m
```
- n = number of elements
- m = table size
- α determines average chain length (chaining) or probe length (open addressing)

Collision Probability (Birthday Paradox):

For m buckets and n keys:

P(no collision) ≈ e^(-n² / 2m)
P(collision) ≈ 1 - e^(-n² / 2m)

For 50% collision chance: n ≈ 1.177√m

Example: 32-bit hash (m = 2³²) has 50% collision with only ~77,000 keys

Expected Chain Length (Chaining):

Under SUHA, expected number of keys in a bucket:

E[chain length] = α = n/m

Expected Probes (Linear Probing):

Successful search:

E[probes] ≈ 1/2 * (1 + 1/(1-α))

Unsuccessful search:

E[probes] ≈ 1/2 * (1 + 1/(1-α)²)

As α → 1, probes → ∞ (table becomes full)

Theorems Supporting Correctness:

Universal Hashing Theorem (Carter & Wegman, 1979):
- For any two distinct keys, probability of collision is at most 1/m
- Guarantees good average-case performance even with adversarial input
Chernoff Bounds:
- Probability that chain length exceeds expected value by large amount is exponentially small
- Makes hash tables reliable in practice

Deriving Complexity:

Chaining - Successful Search:

Expected chain length = α
Expected comparisons = 1 + α/2
With constant α, this is O(1)

Chaining - Unsuccessful Search:

Must traverse entire chain
Expected comparisons = α
With constant α, this is O(1)

Open Addressing - Linear Probing:

Clustering causes longer probe sequences
Expected probes ≈ 1/(1-α)
With α = 0.75, about 4 probes on average

Models and Representations

Visualization:

Separate Chaining:

Index  Bucket (Chains)
  0    → NULL
  1    → [("apple", 5)] → NULL
  2    → [("banana", 3)] → [("berry", 7)] → NULL
  3    → NULL
  4    → [("cherry", 9)] → NULL
  5    → [("date", 2)] → [("durian", 4)] → NULL
  6    → NULL
  7    → [("fig", 6)] → NULL

Open Addressing (Linear Probing):

Index  Key        Value
  0    "apple"    5
  1    "banana"   3
  2    "cherry"   9
  3    <empty>
  4    "date"     2
  5    "fig"      6
  6    <empty>
  7    "grape"    8

If inserting "avocado" and hash("avocado") = 0:
  - Bucket 0 occupied → probe bucket 1
  - Bucket 1 occupied → probe bucket 2
  - Bucket 2 occupied → probe bucket 3
  - Bucket 3 empty → insert here

Cuckoo Hashing:

Table 1 (hash1):          Table 2 (hash2):
┌───┬─────────┐          ┌───┬─────────┐
│ 0 │ "apple" │          │ 0 │ "date"  │
│ 1 │ <empty> │          │ 1 │ "fig"   │
│ 2 │ "banana"│          │ 2 │ <empty> │
│ 3 │ "cherry"│          │ 3 │ "grape" │
└───┴─────────┘          └───┴─────────┘

Key exists if found in table1[hash1(key)] OR table2[hash2(key)]

Alternative Mental Models:

Phone Book Analogy:
- Hash function = "Turn to the section for first letter of last name"
- Bucket = "Page of names starting with that letter"
- Collision = "Multiple people with same first letter"
Library Catalog:
- Hash function = "Dewey Decimal System category"
- Bucket = "Shelf for that category"
- Collision = "Multiple books in same category"
Post Office Boxes:
- Hash function = "PO Box number assignment"
- Bucket = "Individual box"
- Collision = "Shared box for family members"

Notation:

T[i]: Bucket at index i
h(k): Hash value of key k
h'(k): Secondary hash function (double hashing)
k ↦ v: Key k maps to value v
α = n/m: Load factor
⟨k, v⟩: Key-value pair

2.2 Logical Framework

Invariants and Properties

Core Invariants:

Hash Determinism:
- Same key always produces same hash value
- Formally: ∀x, h(x) is a function (not a relation)
Equality Consistency:
- If keys are equal, hashes must be equal
- Formally: k₁ = k₂ ⇒ h(k₁) = h(k₂)
- Contrapositive: h(k₁) ≠ h(k₂) ⇒ k₁ ≠ k₂
Uniqueness Invariant (for maps):
- Each key appears at most once
- Inserting duplicate key updates existing value
Capacity Constraint (open addressing):
- n < m always (number of elements < table size)

Boundary Conditions:

α = 0: Empty table, all operations trivial
α → 1 (open addressing): Performance degrades severely, must resize
α ≫ 1 (chaining): Long chains, approaching O(n) search

Invariant Violation Consequences:

Hash changes for same key:

class BadKey:
    def __init__(self, val):
        self.val = val
    def __hash__(self):
        return random.randint(0, 100)  # WRONG: non-deterministic

d = {}
key = BadKey(5)
d[key] = "value"
print(d[key])  # May fail: hash changed between put and get

Mutable key:

d = {}
key = [1, 2, 3]  # Lists are mutable, can't be hashed
d[key] = "value"  # Raises TypeError in Python

Restoration After Violation:

Resize operation: Rehash all elements to restore load factor invariant
Deletion with tombstones: Maintain probe sequence integrity in open addressing
Rebuild hash table: If corruption detected, reconstruct from scratch

Correctness and Proofs

Algorithm Correctness:

Theorem: Hash table correctly implements the Map ADT under SUHA.

Proof Sketch:

Insertion: Key k hashed to index i, stored at T[i] (or chain at T[i])
Retrieval: Key k hashed to same index i, found at T[i]
Deletion: Key k hashed to index i, removed from T[i]
Correctness: Since h(k) is deterministic, same key always maps to same location

Loop Invariant (for searching in chaining):

def search(table, key):
    index = hash(key) % len(table)
    node = table[index]
    while node is not None:
        if node.key == key:
            return node.value
        node = node.next
    return None

Invariant: At the start of each iteration, if key exists in the chain, it's either at node or later in the chain.

Initialization: True before first iteration (key is either at first node or later). Maintenance: If key not at current node, move to next; invariant still holds. Termination: Either found key or reached end of chain (key doesn't exist).

Proof of Termination:

Chains are finite (acyclic linked lists)
Each iteration advances to next node
Eventually reaches end (NULL)
Therefore, loop terminates

Open Addressing Termination:

For linear probing with α < 1:

There exists at least one empty slot
Probe sequence covers all slots
Therefore, will eventually find empty slot or key

Optimal Substructure:

Hash tables don't directly exhibit optimal substructure (not a DP problem), but collision resolution has recursive structure:

Chaining: Insert into chain is a linked list insertion subproblem Cuckoo hashing: Eviction creates subproblem of inserting evicted element

Assumptions and Requirements

Preconditions:

Keys must be hashable:
- Immutable (or hash based on immutable properties)
- Implement __hash__() and __eq__() (Python) or hashCode() and equals() (Java)
Sufficient memory:
- Enough memory to maintain load factor < threshold
- Can allocate new array during resize
Hash function available:
- Efficient to compute
- Good distribution properties

Postconditions:

After put(key, value):

Key exists in table
get(key) returns value
Size incremented by 1 (if key was new)

After delete(key):

Key no longer in table
get(key) returns NULL/None
Size decremented by 1

After resize():

Load factor below threshold
All keys still retrievable
Indices may have changed (rehashing occurred)

Assumption Violations:

Mutable keys:

# Violation
my_dict = {}
key = [1, 2]  # Mutable
# Python raises: TypeError: unhashable type: 'list'

Inconsistent equals/hash:

// Java example
class BadKey {
    int val;
    @Override
    public boolean equals(Object o) {
        return ((BadKey)o).val == this.val;
    }
    // Missing hashCode() override
    // Violates contract: equal objects must have equal hash codes
}

Hash function returns different values:

import random

class BadHash:
    def __hash__(self):
        return random.randint(0, 1000)  # WRONG: non-deterministic

3. Implementation and Mechanics

3.1 How It Works

Internal Workings

Data Flow During Operations:

Insertion Flow:

Key + Value
    ↓
Compute hash code: hash(key)
    ↓
Compress to index: hash(key) % table_size
    ↓
Check bucket at index:
    - Empty? → Insert directly
    - Occupied? → Handle collision

Search Flow:

Key
    ↓
Compute hash code: hash(key)
    ↓
Compress to index: hash(key) % table_size
    ↓
Check bucket:
    - Chaining: Traverse linked list, compare keys
    - Open Addressing: Probe sequence, compare keys
    ↓
Found? → Return value
Not found? → Return NULL/None

Element Storage:

Chaining:

Bucket array stores pointers to first node of chain
Each node stores: (key, value, next_pointer)
Nodes allocated dynamically on heap

Open Addressing:

Bucket array stores (key, value) pairs directly
Empty slots marked as NULL or special sentinel
Deleted slots marked with tombstone (DELETED marker)

Modification:

Update: Find key, replace value in place
Delete: Remove entry, potentially leave tombstone
Resize: Allocate new array, rehash all entries

Bit/Byte Level:

Hash Computation (for integers):

def hash_int(key, table_size):
    # Step 1: Mix bits using multiplication
    key = ((key >> 16) ^ key) * 0x45d9f3b
    key = ((key >> 16) ^ key) * 0x45d9f3b
    key = (key >> 16) ^ key

    # Step 2: Reduce to table range
    index = key % table_size
    return index

String Hashing (polynomial rolling hash):

def hash_string(s, table_size):
    hash_val = 0
    prime = 31  # Small prime
    for char in s:
        hash_val = (hash_val * prime + ord(char)) % (2**32)
    return hash_val % table_size

At byte level:

Each character converted to ASCII/Unicode value
Accumulated into hash using multiplication and addition
Final modulo ensures result fits in table

Operation Mechanics

Insert Operation (Separate Chaining):

Step-by-step:

Compute index = hash(key) % table_size
Access bucket at table[index]
Traverse chain checking for existing key
If found: Update value
If not found: Create new node, prepend to chain
Increment size counter
If size/capacity > load_threshold: resize and rehash

Delete Operation (Separate Chaining):

Compute index = hash(key) % table_size
Access bucket at table[index]
Traverse chain with two pointers (prev, current)
If key found:
- If at head: table[index] = current.next
- Else: prev.next = current.next
- Decrement size counter
If not found: No-op or raise exception

Search Operation:

Compute index = hash(key) % table_size
Access bucket at table[index]
Chaining: Traverse linked list, compare each key
Open Addressing: Probe sequence until found or empty slot
Return value if found, else None

Insert Operation (Open Addressing - Linear Probing):

1. index = hash(key) % table_size
2. i = 0
3. while True:
4.     probe_index = (index + i) % table_size
5.     if table[probe_index] is EMPTY or DELETED:
6.         table[probe_index] = (key, value)
7.         size++
8.         break
9.     if table[probe_index].key == key:
10.        table[probe_index].value = value  # Update
11.        break
12.    i++
13. if size/capacity > threshold: resize()

Delete Operation (Open Addressing - with Tombstones):

1. index = hash(key) % table_size
2. i = 0
3. while True:
4.     probe_index = (index + i) % table_size
5.     if table[probe_index] is EMPTY:
6.         return  # Not found
7.     if table[probe_index].key == key:
8.         table[probe_index] = DELETED  # Tombstone
9.         size--
10.        break
11.    i++

Resize Operation:

1. old_table = table
2. table = new array of size (2 * old_capacity) or next prime
3. size = 0
4. for each bucket in old_table:
5.     if chaining:
6.         for each node in chain:
7.             insert(node.key, node.value)  # Rehash into new table
8.     else:
9.         if bucket not EMPTY and not DELETED:
10.            insert(bucket.key, bucket.value)

Algorithm Execution

Typical Execution Flow (Chaining):

# Example: Inserting multiple elements
hash_table = HashTable(size=7)

hash_table.put("apple", 5)
# 1. hash("apple") = 47896 → 47896 % 7 = 3
# 2. table[3] is empty
# 3. Create node ("apple", 5), set table[3] = node

hash_table.put("banana", 3)
# 1. hash("banana") = 52381 → 52381 % 7 = 2
# 2. table[2] is empty
# 3. Create node ("banana", 3), set table[2] = node

hash_table.put("apricot", 7)
# 1. hash("apricot") = 47901 → 47901 % 7 = 3
# 2. table[3] is occupied (collision!)
# 3. Create node ("apricot", 7)
# 4. Prepend to chain: node.next = table[3], table[3] = node

Phases of Hash Table Lifecycle:

Initialization:
- Allocate array of specified size
- Initialize all buckets to NULL/empty
- Set size = 0, capacity = initial_size
Growth Phase:
- Elements inserted
- Load factor increases
- When α > threshold: resize (typically at α = 0.75)
Steady State:
- Mix of insertions, deletions, lookups
- Occasional resizes maintain load factor
- Performance remains O(1) average
Shrink Phase (optional):
- If many deletions, may resize down to save memory
- Less common in practice

Iteration Lifecycle:

For linear probing insertion:

Iteration 0: Check table[hash(key) % size]
Iteration 1: Check table[(hash(key) + 1) % size]
Iteration 2: Check table[(hash(key) + 2) % size]
...
Until: Empty slot found or key matched

State and Transitions

Possible States:

Empty: No elements, all buckets NULL
Partially Full: 0 < α < threshold
Resize Triggered: α ≥ threshold, needs expansion
Resizing: In process of allocating new array and rehashing
Full (open addressing): α → 1, imminent performance collapse

State Transitions:

Empty --[insert]--> Partially Full
Partially Full --[insert (α > threshold)]--> Resize Triggered
Resize Triggered --[resize()]--> Resizing
Resizing --[rehash complete]--> Partially Full
Partially Full --[delete all]--> Empty

Triggers for State Changes:

Insert: May trigger resize if α exceeds threshold
Delete: May trigger shrink if α falls below lower threshold (rare)
Resize: Triggered automatically by insert/delete

Internal Bookkeeping:

size: Current number of elements
capacity: Size of underlying array
threshold: size at which to resize (e.g., 0.75 * capacity)
mod_count (optional): For fail-fast iterators, detect concurrent modification

Open Addressing Specific:

Track number of DELETED slots to decide when to rebuild (remove tombstones)

3.2 Dependencies and Interactions

Internal Dependencies

Component Interactions:

Hash Function ↔ Compression:
- Hash function produces hash code (any integer)
- Compression function (% table_size) maps to valid index
- Quality of hash function determines distribution
Collision Resolution ↔ Bucket Structure:
- Chaining requires linked list or dynamic array per bucket
- Open addressing requires probe sequence calculation
- Choice of resolution strategy determines memory layout
Load Factor ↔ Resize:
- Load factor monitored after each insert/delete
- Triggers resize when threshold exceeded
- Resize involves rehashing, which uses hash function again

Operation Dependencies:

Resize depends on Insert/Delete: Size changes trigger load factor check
Insert depends on Search: Must check if key exists before inserting
Delete depends on Search: Must find key before deletion

Concurrent Modifications:

Problem: Two threads inserting simultaneously can corrupt structure
Chaining: May lose elements if both modify same chain
Open Addressing: Probe sequences can be broken, elements lost

Example Failure (Java):

// Thread 1: Inserting during resize
HashMap<String, Integer> map = new HashMap<>();
// Thread 1 calls put() → triggers resize
// Thread 2 calls get() during resize
// Result: ConcurrentModificationException or undefined behavior

Solution: Use ConcurrentHashMap with fine-grained locking or lock-free algorithms

Causal Chains

Single Operation Cascading Effects:

Insert causing resize:

Insert "key" with α = 0.75 (at threshold)
  → size++ makes α = 0.76
  → Triggers resize()
  → Allocate new array (2x size)
  → Rehash ALL existing elements
  → Hash codes recomputed
  → Indices change (different table size → different % result)
  → All elements move to new locations
  → Old array garbage collected

Delete in open addressing:

Delete "key" at index i
  → Place DELETED tombstone at i
  → Future searches must continue past tombstone
  → Future inserts can reuse tombstone slot
  → If too many tombstones: degrade to O(n) search
  → Eventually triggers rebuild to remove tombstones

Hash collision propagation:

Poor hash function causes clustering
  → Multiple keys hash to same region
  → Collisions increase
  → Chains grow longer (chaining) or probes increase (open addressing)
  → Search time increases
  → More likely to trigger resize
  → Resize computes new hashes (hopefully better distribution)

Feedback Loops:

Negative Feedback (resize):
- High load factor → slow operations
- Slow operations → trigger resize
- Resize → lower load factor
- Lower load factor → faster operations
Positive Feedback (clustering in poor hash functions):
- Cluster forms → more collisions in that region
- Collisions → longer probe sequences
- Longer probes → more likely to add to cluster
- Cluster grows → even worse performance

Conditional Factors

External Factors Affecting Behavior:

Input Size:
- Small n: Hash table overhead may dominate, simple array faster
- Large n: Hash table O(1) advantage clear
Input Distribution:
- Uniformly random keys → excellent performance
- Clustered keys (e.g., sequential IDs) → depends on hash function quality
- Adversarial input (all hash to same bucket) → O(n) worst case
Key Types:
- Integers: Fast to hash (bitwise operations)
- Strings: Slower to hash (must process each character)
- Custom objects: Hash quality depends on implementation

Performance Variation:

Input Pattern	Chaining	Open Addressing
Random keys	O(1)	O(1)
Sequential keys, good hash	O(1)	O(1)
Sequential keys, bad hash	O(n) worst	O(n) worst
Many duplicates	O(1)	O(1)
All collide	O(n)	O(n)

Environmental Factors:

Cache Size:
- Small cache: Chaining suffers (pointer chasing)
- Large cache: Chaining benefits from prefetching
- Open addressing generally better cache performance
Processor:
- Branch predictor quality affects conditional-heavy code
- SIMD capabilities: Swiss Tables use SIMD for faster lookup
Memory Allocator:
- Chaining: Many small allocations (node per element)
- Open addressing: Few large allocations (array only)
- Allocator speed affects chaining more
Programming Language:
- Python: dict is built-in, optimized open addressing
- Java: HashMap uses chaining (pre-8) or trees (Java 8+)
- C++: std::unordered_map traditionally chaining, but Swiss Tables use open addressing

Emergent Properties

Complex Behaviors from Simple Rules:

Clustering (Linear Probing):
- Simple rule: "If occupied, try next slot"
- Emergence: Long contiguous runs of occupied slots form
- Consequence: "Rich get richer" - clusters attract more collisions
Load Factor Sweet Spot:
- Simple rule: "Resize when α > 0.75"
- Emergence: Trade-off between memory and speed naturally optimizes
- Consequence: Industry standard of 0.75 emerges across implementations
Hash Function Sensitivity:
- Simple rule: "Map key to integer"
- Emergence: Tiny changes in hash function drastically affect distribution
- Example: Changing prime in polynomial hash changes entire distribution

Local Operations → Global Properties:

Local: Insert single element into bucket
Global: Distribution of chain lengths follows Poisson distribution (under SUHA)
Local: Single key hashes to index
Global: Birthday paradox means collisions likely with √m elements

Patterns from Repeated Operations:

Pattern: Insert, insert, insert → resize → insert, insert, insert → resize
Emergence: Geometric growth pattern (table size: 16, 32, 64, 128, ...)
Pattern: Insert many elements with same hash prefix
Emergence: Primary clustering in linear probing (long runs of occupied slots)

5. Complexity Analysis

5.1 Time Complexity

Operation-Wise Analysis

Hash Table Operations - Separate Chaining:

Operation	Best Case	Average Case	Worst Case	Amortized
Search	O(1)	O(1 + α)	O(n)	O(1)
Insert	O(1)	O(1 + α)	O(n)	O(1)
Delete	O(1)	O(1 + α)	O(n)	O(1)
Resize	-	O(n)	O(n)	O(1)

Where α = n/m (load factor), n = number of elements, m = table size.

Hash Table Operations - Open Addressing (Linear Probing):

Operation	Best Case	Average Case	Worst Case
Search (successful)	O(1)	O(1/(1-α))	O(n)
Search (unsuccessful)	O(1)	O(1/(1-α)²)	O(n)
Insert	O(1)	O(1/(1-α)²)	O(n)
Delete	O(1)	O(1/(1-α))	O(n)

Complexity Analysis by Hash Table Topic:

Topic 58: Hash Table Fundamentals

Create empty table: O(m) where m is initial capacity
Hash function computation: O(k) where k is key size
Index computation: O(1)

Topic 59: Hash Functions and Collision Handling

Division method: O(1)
Multiplication method: O(1)
Universal hashing: O(1) for computation
Polynomial rolling hash: O(k) for string of length k
Collision resolution (chaining): O(chain length) = O(1 + α) average
Collision resolution (probing): O(probes) = O(1/(1-α)) average

Topic 60: Chaining vs Open Addressing

Chaining:

Insert: O(1) if inserting at head, O(n) if checking duplicates
Search: O(1 + α) average
Delete: O(1 + α) average
Supports α > 1 (load factor can exceed 1)

Open Addressing (Linear Probing):

Insert: O(1/(1-α)²) average
Search: O(1/(1-α)) average
Delete: O(1/(1-α)) with tombstones
Must maintain α < 1

Open Addressing (Double Hashing):

Better distribution than linear probing
Insert: O(1/(1-α)) average
Search: O(1/(1-α)) average

Topic 61: Load Factor and Rehashing

Load factor computation: O(1)
Rehashing operation: O(n + m) where n = elements, m = old capacity
Amortized cost per insert with rehashing: O(1)
Typical threshold: α = 0.75

Topic 62: HashMap Implementation

get(key): O(1) average, O(n) worst
put(key, value): O(1) average, O(n) worst
remove(key): O(1) average, O(n) worst
containsKey(key): O(1) average, O(n) worst
keySet/values/entries iteration: O(n + m)

Topic 63: HashSet Implementation

add(element): O(1) average, O(n) worst
remove(element): O(1) average, O(n) worst
contains(element): O(1) average, O(n) worst
size(): O(1)
isEmpty(): O(1)

Topic 64: Two Sum Problem and Variations

Two Sum:

Time: O(n) - single pass with hash map
Space: O(n) - store n elements in hash map

Two Sum II (sorted array):

Time: O(n) - two pointer approach (no hash table needed)
Space: O(1) - two pointers only

Three Sum:

Time: O(n²) - O(n) for outer loop, O(n) for two-sum per iteration
Space: O(n) - hash set for duplicates

Four Sum:

Time: O(n³) - three nested loops
Space: O(n) - hash set

Topic 65: Group Anagrams Using Hash Map

Time: O(n * k log k) where n = number of strings, k = max string length
  - Sorting each string: O(k log k)
  - n strings total: O(n * k log k)

Alternative with counting:
Time: O(n * k) where k is string length
  - Count chars for each string: O(k)
  - n strings: O(n * k)

Space: O(n * k) - storing all strings in groups

Topic 66: Longest Consecutive Sequence

Time: O(n) - add all to set O(n), check sequences O(n)
Space: O(n) - hash set storing n elements

Key insight: Each sequence only checked once from its start

Topic 67: Subarray Sum Equals K

Time: O(n) - single pass through array
Space: O(n) - hash map storing prefix sums

Optimization: Prefix sum + hash map approach
Without hash map: O(n²) nested loops

Topic 68: Designing Hash Map from Scratch

Initial allocation: O(m) where m = capacity
All operations: Same as standard hash map above
Implementing custom hash function: Depends on key type
Implementing collision resolution: Depends on strategy

Topic 69: Rolling Hash (Rabin-Karp Algorithm)

Pattern matching in text:
Time: O(n + m) average, O(nm) worst
  where n = text length, m = pattern length

Rolling hash computation: O(1) per character
Initial hash: O(m) for pattern
Hash update: O(1) per slide

Space: O(1) - only store hash values

Overall Algorithm Complexity

Deriving Hash Table Complexity from First Principles:

Successful Search Analysis (Chaining):

Under Simple Uniform Hashing Assumption (SUHA):

Each of n keys equally likely to hash to any of m buckets
Expected number of keys in a bucket = α = n/m

Expected search time:

E[search time] = 1 + E[chain length]
               = 1 + α
               = O(1 + α)

With constant load factor (α ≤ 0.75), this is O(1).

Insert Analysis (Open Addressing - Linear Probing):

Probability that i-th slot is occupied: α^i (approximation)

Expected number of probes:

E[probes] = Σ(i=0 to ∞) α^i
          = 1/(1-α)  (geometric series)

For α = 0.75: E[probes] ≈ 4

Rehashing Amortized Analysis:

Suppose we double table size when α > 0.75:

Rehash at n = 1, 2, 4, 8, 16, ..., n
Cost of i-th rehash: O(2^i)

Total cost for n insertions:

Σ(i=0 to log n) 2^i = 2^(log n + 1) - 1 = 2n - 1 = O(n)

Amortized cost per insert: O(n)/n = O(1)

Birthday Paradox and Collision Probability:

For m buckets and n keys:

P(no collision) ≈ e^(-n(n-1)/(2m))
P(collision) ≈ 1 - e^(-n²/(2m))

For 50% collision probability:

n ≈ 1.177 * √m

Example: 32-bit hash (m = 2^32) has 50% collision with only ~77,163 keys.

Conditional Complexity

Input Characteristics Affecting Performance:

Key Distribution:
- Uniform random: O(1) average case holds
- Clustered keys: Depends on hash function quality
- Sequential integers with poor hash: O(n) worst case (all collide)
- Adversarial input: O(n) if hash function predictable

Load Factor:

α = 0.25: Very fast, memory wasteful
α = 0.50: Fast, balanced
α = 0.75: Standard threshold, good balance
α = 0.90: Slower, memory efficient (chaining only)
α → 1.0: Very slow for open addressing
α > 1.0: Only possible with chaining, linear search

Hash Function Quality:
- Good hash (uniform distribution): O(1) average
- Poor hash (clustering): O(n) worst case
- Cryptographic hash: O(k) where k = key size, very uniform
Table Size:
- Prime size: Better with poor hash functions
- Power of 2: Faster modulo (bitwise AND), requires good hash
- Small table: More collisions, slower
- Large table: Fewer collisions, cache misses

Best Case Performance:

Empty table or perfect hash function (no collisions)
All operations: O(1)
Occurs when: Hash function perfectly distributes keys

Worst Case Performance:

All keys hash to same bucket
All operations: O(n)
Occurs when:
- Adversarial input with predictable hash function
- Poor hash function (e.g., always returns 0)
- Birthday paradox with insufficient table size

Average Case Performance:

Assumes Simple Uniform Hashing Assumption (SUHA)
Keys distribute uniformly across buckets
Load factor α is constant (typically ≤ 0.75)
All operations: O(1)

Amortized Analysis for Dynamic Resizing:

Using aggregate method:

Operation sequence: n insertions
Regular inserts: n * O(1) = O(n)
Resizes: O(1) + O(2) + O(4) + ... + O(n) = O(2n) = O(n)
Total: O(n) + O(n) = O(n)
Amortized per operation: O(1)

Using accounting method:

Charge 3 units per insert:
- 1 unit for immediate insert
- 1 unit credit for itself during future resize
- 1 unit credit for one old element during resize
Always have enough credit for resize
Amortized: O(1)

Complexity Trade-offs

Time vs Space:

Lower α (more space): Faster operations, less collisions
Higher α (less space): Slower operations, more collisions
Optimal α ≈ 0.75: Empirically best balance

Chaining vs Open Addressing:

Aspect	Chaining	Open Addressing
Best/avg complexity	O(1)	O(1/(1-α))
Space overhead	Pointers (8 bytes each)	None, but need α < 1
Cache performance	Poor (pointer chasing)	Excellent (contiguous)
Supports α > 1	Yes	No

Hash Function Complexity:

Fast hash (multiplication): O(1), possibly poor distribution
Moderate hash (polynomial): O(k), good distribution
Cryptographic hash (SHA-256): O(k), excellent distribution, slower

Collision Resolution:

Linear probing: Fast, clustering issues
Quadratic probing: Moderate, reduces clustering
Double hashing: Slower per probe, best distribution

5.2 Space Complexity

Memory Usage

Basic Space Requirements:

Separate Chaining:

Total space = Array + Nodes + Overhead

Array: O(m) pointers (m = table size)
  - 8 bytes per pointer on 64-bit system
  - Total: 8m bytes

Nodes: O(n) nodes (n = number of elements)
  - Each node: key + value + next pointer
  - Assuming 8-byte key, 8-byte value, 8-byte pointer
  - Total: 24n bytes

Metadata: O(1)
  - size, capacity, threshold, hash seed
  - ~32 bytes

Total: O(m + n)
In terms of n (with α = n/m):
  - m = n/α
  - Space = O(n/α + n) = O(n) with constant α

Open Addressing:

Total space = Array + Overhead

Array: O(m) entries
  - Each entry: key + value + status (empty/occupied/deleted)
  - Assuming 8-byte key, 8-byte value, 1-byte status
  - Total: 17m bytes

Must maintain α < 1, typically α ≤ 0.75
So m ≥ n/0.75 = 1.33n

Metadata: O(1)
  - size, capacity, threshold
  - ~24 bytes

Total: O(m) = O(n) with constant α

Space Comparison:

For n = 1,000,000 elements, α = 0.75:

Chaining:

Array: 8 * 1,333,333 ≈ 10.7 MB
Nodes: 24 * 1,000,000 = 24 MB
Total: ~34.7 MB

Open Addressing:

Array: 17 * 1,333,333 ≈ 22.7 MB
Total: ~22.7 MB

Open addressing uses ~35% less space (no pointer overhead).

Memory Layout

Separate Chaining Memory Layout:

Hash Table Object:
├─ table: pointer to array (8 bytes)
├─ size: integer (4 bytes)
├─ capacity: integer (4 bytes)
├─ threshold: float (4 bytes)
├─ hash_seed: integer (4 bytes)
└─ padding (4 bytes) → Total: 28 bytes

Array (heap):
┌────────────────────────────────────┐
│ ptr₀ │ ptr₁ │ ptr₂ │ ... │ ptrₘ₋₁│  (m * 8 bytes)
└──┬───┴──┬───┴──┬───┴─────┴──┬─────┘
   │      │      │            │
   ↓      ↓      ↓            ↓
  Node   Node   Node         Node  (each 24 bytes)
   ↓      ↓
  Node   Node
   ↓
  Node

Memory Fragmentation:

Chaining: Many small allocations (nodes) → fragmentation
Open addressing: Few large allocations → less fragmentation

Cache Line Analysis:

Modern CPUs have 64-byte cache lines.

Chaining:

Each node access: potential cache miss
Following linked list: poor spatial locality
Average cache misses per search: α/2 (half the chain)

Open Addressing (Linear Probing):

Contiguous array: excellent spatial locality
Probe sequence stays in same cache line (if cluster small)
Average cache misses per search: ≈1 (if probe sequence < 8 entries)

Memory Overhead Comparison:

Structure	Per-Element Overhead	Total Overhead
Chaining	8 bytes (next pointer)	O(n)
Open Addressing	1 byte (status) + slack	O(m)
Cuckoo Hashing	0 bytes, but 2 tables	O(2m)

Space-Time Trade-offs

Load Factor Trade-off:

# Lower α: More space, faster operations
α = 0.5:
  Space: 2n entries
  Search time: O(1 + 0.5) = O(1.5)
  Cache efficiency: Excellent

# Standard α: Balanced
α = 0.75:
  Space: 1.33n entries
  Search time: O(1 + 0.75) = O(1.75)
  Cache efficiency: Good

# Higher α: Less space, slower operations
α = 1.0:
  Space: n entries (chaining only)
  Search time: O(1 + 1.0) = O(2)
  Cache efficiency: Moderate

Memoization with Hash Tables:

Classic time-space trade-off in dynamic programming:

# Without memoization: Exponential time, O(1) space
def fib(n):
    if n <= 1:
        return n
    return fib(n-1) + fib(n-2)
# Time: O(2^n), Space: O(n) recursion stack

# With hash table memoization: Linear time, linear space
def fib_memo(n, memo={}):
    if n <= 1:
        return n
    if n not in memo:
        memo[n] = fib_memo(n-1, memo) + fib_memo(n-2, memo)
    return memo[n]
# Time: O(n), Space: O(n) memo table + O(n) stack

Auxiliary Space for Operations:

Search:

Iterative: O(1) auxiliary space
Recursive (rare): O(log n) stack space

Insert with Resize:

O(m) temporary space for new array
Old array freed after rehashing

Iteration:

O(1) auxiliary space for iterator state

Sorting hash table values:

Extract to array: O(n) auxiliary space
Sort: O(log n) stack space (quicksort)

Recursion Depth (where applicable)

Hash tables rarely use recursion, but some scenarios involve it:

Recursive Operations:

Cuckoo Hashing Insert:

def insert(key, value, depth=0):
    if depth > max_depth:
        rehash()  # Prevent infinite loop
        return insert(key, value, 0)

    old_key, old_value = table[hash(key)]
    table[hash(key)] = (key, value)

    if old_key is not None:
        insert(old_key, old_value, depth + 1)  # Recursive eviction

Worst-case depth: O(n) (but extremely rare)
Average depth: O(1)
Stack space: O(depth)

Java 8 HashMap (Tree Nodes): When chain length > 8, converts to red-black tree:
- Tree operations: O(log n) recursion depth
- Stack space: O(log n)

Stack Space Analysis:

Iterative implementations (standard): O(1) stack space Recursive cuckoo hashing: O(log n) expected, O(n) worst Hybrid tree-based buckets: O(log n) stack space

5.3 Other Complexity Measures

Cache Efficiency

Cache Hierarchy:

L1 Cache: 32-64 KB, ~4 cycles latency
L2 Cache: 256-512 KB, ~10 cycles latency
L3 Cache: 8-16 MB, ~40 cycles latency
RAM: GBs, ~200 cycles latency

Chaining Cache Performance:

# Poor cache locality
def search_chaining(table, key):
    index = hash(key) % len(table)
    node = table[index]  # Cache miss 1: Array access
    while node:
        if node.key == key:  # Cache miss 2: Node access
            return node.value
        node = node.next  # Cache miss 3: Next node (pointer chasing)
    return None

Cache misses:

Array access: 1 miss (if not in cache)
Each node: 1 miss (nodes scattered in memory)
Average misses: 1 + α/2 (1 array + half the chain)

Open Addressing Cache Performance:

# Excellent cache locality
def search_linear_probing(table, key):
    index = hash(key) % len(table)
    i = 0
    while table[index + i] is not EMPTY:  # Sequential access
        if table[index + i].key == key:
            return table[index + i].value
        i += 1
    return None

Cache misses:

Initial access: 1 miss (if not in cache)
Probe sequence: Same cache line if cluster small
Average misses: ~1 (entire probe sequence often in 1-2 cache lines)

Cache Line Utilization:

64-byte cache line holds:

8 pointers (chaining)
~3-4 key-value pairs (open addressing, assuming 16 bytes each)

Linear probing benefits:

Probing within same cache line: 0 additional misses
Sequential prefetching by CPU

Swiss Tables (Google's Abseil):

Optimized for cache efficiency:

Group of 16 slots + 16-byte control byte array
Control bytes: Empty, Deleted, or H2 hash (top 7 bits)

Search:
1. Compute H2 (7 bits of hash)
2. SIMD compare: 16 control bytes at once
3. Only check matching slots

Cache efficiency:
- Control bytes: 16 bytes = fits in cache line
- SIMD: 16 comparisons in parallel
- Fewer memory accesses

Practical Performance

Constant Factors:

Big-O hides constant factors that matter in practice:

Hash Function Overhead:

# Simple hash: Fast but poor distribution
def simple_hash(key):
    return key  # O(1), constant = 1

# Polynomial hash: Moderate speed, good distribution
def poly_hash(s):
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) % (2**32)
    return h  # O(k), constant = ~5 operations per char

# Cryptographic hash: Slow, excellent distribution
def crypto_hash(key):
    return hashlib.sha256(key).digest()  # O(k), constant = ~50 ops/char

Collision Resolution Overhead:

Linear probing:

Each probe: 1 comparison + 1 array access
Constant: ~2 operations per probe

Chaining:

Each node: 1 comparison + 1 pointer dereference
Constant: ~3 operations per node (worse due to cache misses)

Small Input Performance:

For n < 100:

Hash table overhead: Hash computation + collision handling
Linear search: Simple, cache-friendly

Breakeven point (empirical):

n ≈ 10-20: Hash table starts to win
n < 10: Linear search often faster

Real-World Benchmarks:

Python dict (open addressing):

n = 1,000: Insert 0.05 µs, Search 0.03 µs
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs
n = 100,000,000: Insert 0.08 µs, Search 0.06 µs

Nearly constant time, slight degradation due to cache misses

Java HashMap (chaining → trees):

n = 1,000: Insert 0.08 µs, Search 0.05 µs
n = 1,000,000: Insert 0.10 µs, Search 0.07 µs

Slightly slower than Python due to pointer overhead

C++ std::unordered_map (traditionally chaining, now Swiss Tables):

Old (chaining):
n = 1,000,000: Insert 0.12 µs, Search 0.09 µs

New (Swiss Tables):
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs

~2x speedup from cache-friendly design

Practical Bottlenecks:

Hash Function:
- Slow hash → dominates for small keys
- Optimize: Use fast hash for performance-critical code
Memory Allocation (Chaining):
- malloc/free overhead for each node
- Optimize: Use memory pool, allocate nodes in batches
Cache Misses (Chaining):
- Pointer chasing → random memory access
- Optimize: Use open addressing or array-based buckets
Resize Operation:
- Occasional O(n) spike
- Optimize: Incremental resizing, pre-allocate if size known
Load Factor:
- Too high → many collisions
- Too low → memory waste, cache misses
- Optimize: Tune based on profiling (typically 0.75)

When Theory Meets Practice:

Theory predicts: O(1) average case Practice shows:

Constant factor varies 10x based on implementation
Cache effects dominate for small-medium datasets
Hash function quality matters more than collision resolution
Open addressing outperforms chaining in modern systems

Practical Complexity Guidelines:

Tiny (n < 20): Linear search may be faster
Small (n < 1,000): Hash table good, simple implementation fine
Medium (n < 1,000,000): Hash table excellent, optimize hash function
Large (n < 1,000,000,000): Hash table great, optimize for cache
Massive (n > 1 billion): Distributed hash table or approximate structures

6. Optimization and Trade-offs

6.1 Performance Optimization

Optimization Techniques

Hash Function Optimization:

Fast Integer Hashing:

# Slow: Modulo operation
def hash_slow(key, size):
    return key % size

# Fast: Bitwise AND (requires size = power of 2)
def hash_fast(key, size):
    return key & (size - 1)  # Equivalent to key % size when size = 2^k

# Example: size = 16 (binary: 10000)
# size - 1 = 15 (binary: 01111)
# key = 42 (binary: 101010)
# 42 & 15 = 10 (binary: 001010)

# Speed: ~5x faster on modern CPUs

Multiplication Method (Faster than Division):

# Knuth's multiplicative hashing
def hash_multiplicative(key, size):
    # Golden ratio: (√5 - 1) / 2 ≈ 0.6180339887
    A = 0.6180339887498948
    # Or use integer version: multiply by 2^32 * A
    A_int = 2654435769  # 2^32 * golden ratio
    return ((key * A_int) >> (32 - size.bit_length())) & (size - 1)

# Fast: One multiplication, one shift, one AND

String Hashing Optimization:

# Standard polynomial hash (slow for long strings)
def hash_poly_slow(s):
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) % (2**32)
    return h

# Optimized: Process multiple characters at once
def hash_poly_fast(s):
    h = 0
    # Process 4 characters at a time (32-bit chunks)
    for i in range(0, len(s), 4):
        chunk = 0
        for j in range(4):
            if i + j < len(s):
                chunk = (chunk << 8) | ord(s[i + j])
        h = h * 31 + chunk
    return h & 0xFFFFFFFF

# Speed: ~4x faster for long strings

Cache Hash Values:

# For immutable keys, cache hash value
class CachedHashKey:
    def __init__(self, key):
        self.key = key
        self._hash = None  # Lazy compute

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(self.key)
        return self._hash

    def __eq__(self, other):
        return self.key == other.key

# Python strings do this automatically
# Java String.hashCode() is cached after first call

Collision Resolution Optimization:

Robin Hood Hashing (Variance Reduction):

def robin_hood_insert(table, key, value):
    """
    Insert with Robin Hood hashing: "rich" entries give way to "poor" ones.
    Distance from ideal position = probe distance.
    """
    index = hash(key) % len(table)
    distance = 0

    while table[index] is not EMPTY:
        existing_key, existing_value, existing_dist = table[index]

        if existing_key == key:
            table[index] = (key, value, distance)  # Update
            return

        # Robin Hood: If current item is richer (less distance), swap
        if distance > existing_dist:
            key, value, distance = existing_key, existing_value, existing_dist
            table[index] = (key, value, distance)

        index = (index + 1) % len(table)
        distance += 1

    table[index] = (key, value, distance)

# Benefit: Reduces variance in probe lengths
# Average probe length: Same as linear probing
# Worst-case probe length: Much better (more predictable)

Quadratic Probing (Reduce Clustering):

def quadratic_probe(table, key, value):
    """
    Probe sequence: h(k), h(k)+1^2, h(k)+2^2, h(k)+3^2, ...
    Reduces primary clustering of linear probing.
    """
    index = hash(key) % len(table)
    i = 0

    while table[index] is not EMPTY:
        if table[index].key == key:
            table[index].value = value  # Update
            return
        i += 1
        index = (hash(key) + i * i) % len(table)

    table[index] = (key, value)

# Benefit: Reduces clustering compared to linear probing
# Caveat: Requires table size to be prime or power of 2

Double Hashing (Best Probe Distribution):

def double_hash_probe(table, key, value):
    """
    Use second hash function for step size.
    Probe sequence: h1(k), h1(k)+h2(k), h1(k)+2*h2(k), ...
    """
    h1 = hash(key) % len(table)
    h2 = 1 + (hash(key) // len(table)) % (len(table) - 1)  # Step size
    # h2 must be coprime to table size (prime table size ensures this)

    index = h1
    i = 0
    while table[index] is not EMPTY:
        if table[index].key == key:
            table[index].value = value
            return
        i += 1
        index = (h1 + i * h2) % len(table)

    table[index] = (key, value)

# Benefit: Best distribution, minimal clustering
# Cost: Two hash computations instead of one

Resize Optimization:

Incremental Resizing:

class IncrementalHashTable:
    """
    Spread resize cost across multiple operations.
    Avoid O(n) spike.
    """
    def __init__(self):
        self.old_table = None
        self.new_table = []
        self.resize_index = 0
        self.resize_in_progress = False

    def insert(self, key, value):
        # Do a bit of incremental resizing
        if self.resize_in_progress:
            self._rehash_some_entries()

        # Normal insert
        if self.should_resize():
            self._start_resize()

        # Insert into new table if resizing
        table = self.new_table if self.resize_in_progress else self.old_table
        self._insert_into(table, key, value)

    def _rehash_some_entries(self):
        """Rehash a small batch of entries from old to new table."""
        batch_size = 10  # Amortize cost
        for _ in range(batch_size):
            if self.resize_index < len(self.old_table):
                entry = self.old_table[self.resize_index]
                if entry is not EMPTY:
                    self._insert_into(self.new_table, entry.key, entry.value)
                self.resize_index += 1
            else:
                # Resize complete
                self.old_table = self.new_table
                self.resize_in_progress = False
                break

# Benefit: No single slow operation
# Cost: More complex implementation, dual-table lookup during resize

Pre-allocate if Size Known:

# Slow: Start small, resize multiple times
hash_map = {}
for i in range(1000000):
    hash_map[i] = i
# Multiple resizes: at 12, 24, 48, ..., 524288, 1048576

# Fast: Pre-allocate to avoid resizes
hash_map = dict.fromkeys(range(1000000))  # Python
# Or explicitly: hash_map = {}; hash_map.reserve(1000000)  # C++
for i in range(1000000):
    hash_map[i] = i
# No resizes: ~2-3x faster

Lazy Deletion (Avoid Rehashing):

class LazyDeleteHashTable:
    """
    Mark deleted slots, rebuild only when too many tombstones.
    """
    def __init__(self):
        self.table = []
        self.size = 0
        self.deleted_count = 0

    def delete(self, key):
        index = self._find(key)
        if index is not None and self.table[index] is not DELETED:
            self.table[index] = DELETED
            self.size -= 1
            self.deleted_count += 1

            # Rebuild if too many tombstones
            if self.deleted_count > len(self.table) // 4:
                self._rebuild()

    def _rebuild(self):
        """Remove tombstones without increasing table size."""
        old_table = self.table
        self.table = [EMPTY] * len(old_table)
        self.deleted_count = 0

        for entry in old_table:
            if entry is not EMPTY and entry is not DELETED:
                self._insert(entry.key, entry.value)

# Benefit: Fast deletion, infrequent rebuilds

Lazy Evaluation and Caching:

Lazy Bucket Initialization:

class LazyHashTable:
    """
    Don't allocate bucket arrays until needed.
    Saves memory for sparse hash tables.
    """
    def __init__(self, size):
        self.buckets = [None] * size  # Array of None, not empty lists

    def insert(self, key, value):
        index = hash(key) % len(self.buckets)

        # Lazy init: Create bucket list only when first item inserted
        if self.buckets[index] is None:
            self.buckets[index] = []

        self.buckets[index].append((key, value))

# Benefit: Save memory for sparse tables
# Use case: Large table with few entries

Memoization Pattern:

from functools import lru_cache

# Automatic memoization with hash table
@lru_cache(maxsize=None)  # Unbounded cache
def expensive_function(x):
    # Complex computation
    return sum(i**2 for i in range(x))

# Or manual:
def memoize(func):
    cache = {}
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
        return cache[args]
    return wrapper

@memoize
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# Time: O(2^n) → O(n) with memoization

Bottleneck Analysis

Identifying Bottlenecks:

Hash Function Profiling:

import time

def profile_hash_function(hash_func, keys):
    start = time.perf_counter()
    for key in keys:
        _ = hash_func(key)
    end = time.perf_counter()
    return end - start

# Test different hash functions
keys = [str(i) for i in range(1000000)]

print("Simple sum:", profile_hash_function(lambda s: sum(ord(c) for c in s), keys))
print("Polynomial:", profile_hash_function(lambda s: hash(s), keys))
print("SHA-256:", profile_hash_function(lambda s: hashlib.sha256(s.encode()).hexdigest(), keys))

# Results (example):
# Simple sum: 0.15s
# Polynomial: 0.08s (built-in, optimized)
# SHA-256: 2.5s (cryptographic, slow)

Collision Rate Analysis:

def analyze_collisions(hash_table):
    """
    Measure collision rate and chain length distribution.
    """
    chain_lengths = []
    for bucket in hash_table.buckets:
        if bucket is not None:
            chain_lengths.append(len(bucket))
        else:
            chain_lengths.append(0)

    total_collisions = sum(max(0, length - 1) for length in chain_lengths)
    avg_chain = sum(chain_lengths) / len(chain_lengths)
    max_chain = max(chain_lengths)

    print(f"Average chain length: {avg_chain:.2f}")
    print(f"Max chain length: {max_chain}")
    print(f"Total collisions: {total_collisions}")
    print(f"Load factor: {sum(chain_lengths) / len(chain_lengths):.2f}")

    # Histogram
    from collections import Counter
    histogram = Counter(chain_lengths)
    print("Chain length distribution:")
    for length, count in sorted(histogram.items()):
        print(f"  Length {length}: {count} buckets")

# Ideal: Most buckets have 0-1 elements
# Problem: Many long chains indicate poor hash function or high load factor

Memory Access Patterns:

def measure_cache_misses(operation, iterations=1000000):
    """
    Estimate cache efficiency by timing.
    More cache misses = slower per operation.
    """
    # Warm up
    for _ in range(1000):
        operation()

    # Measure
    start = time.perf_counter()
    for _ in range(iterations):
        operation()
    end = time.perf_counter()

    return (end - start) / iterations * 1e9  # nanoseconds per op

# Compare chaining vs open addressing
chaining_time = measure_cache_misses(lambda: chaining_table[random_key()])
open_addr_time = measure_cache_misses(lambda: open_table[random_key()])

print(f"Chaining: {chaining_time:.1f} ns/op")
print(f"Open addressing: {open_addr_time:.1f} ns/op")

# Expected: Open addressing ~2-3x faster due to better cache locality

Common Bottlenecks:

Slow Hash Function:
- Symptom: Insertion/search time dominated by hash computation
- Detection: Profile shows >50% time in hash function
- Solution: Use faster hash (multiplication method, cached hash)
High Collision Rate:
- Symptom: Long chains or probe sequences
- Detection: Average chain length > 2, max chain length > 10
- Solution: Improve hash function, reduce load factor, resize table
Frequent Resizing:
- Symptom: Periodic slow insertions
- Detection: Profiler shows time in resize/rehash
- Solution: Pre-allocate if size known, incremental resizing
Cache Misses (Chaining):
- Symptom: Slower than expected for low load factor
- Detection: Performance doesn't match O(1) prediction
- Solution: Switch to open addressing, use memory pool for nodes
Poor Load Factor:
- Too high: Many collisions
- Too low: Memory waste, cache misses
- Solution: Tune threshold (typically 0.75)

Ideal Conditions

Optimal Configuration:

Load Factor:
- Chaining: α = 0.75 (standard)
- Open addressing: α = 0.5-0.7 (lower for better performance)
- Cryptographic attacks concerned: α = 0.5 (reduces collision probability)
Initial Capacity:
- Known size: Set to expected_size / load_factor
- Unknown size: Start with 16-32 (small overhead, quick to grow)
- Large datasets: Start with 1024+ (avoid early resizes)
Hash Function:
- Integers: Multiplication method or identity (if well-distributed)
- Strings: Polynomial hash (built-in) or MurmurHash
- Security-critical: SipHash or Blake2
Collision Resolution:
- General purpose: Chaining (simple, robust)
- Performance-critical: Open addressing with linear probing (cache-friendly)
- Predictability-critical: Robin Hood hashing or cuckoo hashing

Best Practices:

class OptimizedHashTable:
    """
    Production-ready hash table with best practices.
    """
    def __init__(self, initial_capacity=16, load_factor=0.75):
        # Use power of 2 for fast modulo (bitwise AND)
        self.capacity = self._next_power_of_2(initial_capacity)
        self.size = 0
        self.load_factor = load_factor
        self.threshold = int(self.capacity * load_factor)

        # Randomized hash seed to prevent DoS attacks
        import random
        self.hash_seed = random.getrandbits(32)

        # Initialize table
        self.table = [None] * self.capacity

    def _next_power_of_2(self, n):
        """Round up to next power of 2."""
        return 1 << (n - 1).bit_length()

    def _hash(self, key):
        """
        High-quality hash function.
        """
        # Use built-in hash + randomized seed
        h = hash(key) ^ self.hash_seed

        # Additional mixing (optional, for extra quality)
        h = ((h >> 16) ^ h) * 0x45d9f3b
        h = ((h >> 16) ^ h) * 0x45d9f3b
        h = (h >> 16) ^ h

        return h & (self.capacity - 1)  # Fast modulo

    def put(self, key, value):
        """Insert with automatic resizing."""
        if self.size >= self.threshold:
            self._resize()

        # ... insert logic ...

    def _resize(self):
        """Double capacity and rehash."""
        old_table = self.table
        self.capacity *= 2
        self.threshold = int(self.capacity * self.load_factor)
        self.table = [None] * self.capacity
        self.size = 0

        # Rehash all entries
        for entry in old_table:
            if entry is not None:
                self.put(entry.key, entry.value)

# Usage:
hash_table = OptimizedHashTable(
    initial_capacity=expected_size // 0.75,  # Avoid resizes
    load_factor=0.75  # Standard trade-off
)

Performance Tuning Guide:

Scenario	Recommended Settings
General purpose	α=0.75, chaining, polynomial hash
Performance-critical	α=0.5, open addressing (linear probing), fast hash
Memory-constrained	α=1.0, chaining, lazy deletion
Predictable latency	α=0.5, Robin Hood or cuckoo hashing
Security-critical	α=0.5, SipHash, randomized seed
Known size	Pre-allocate, α=0.75
Unknown size	Start small (16), α=0.75

6.2 Design Trade-offs

Fundamental Trade-offs

Time vs Space:

# Space-efficient but slower: High load factor
class DenseHashTable:
    def __init__(self):
        self.load_factor = 0.95  # Use more space
        # Pros: Minimal memory overhead
        # Cons: More collisions, slower operations
        # Use when: Memory is limited

# Fast but space-intensive: Low load factor
class SparseHashTable:
    def __init__(self):
        self.load_factor = 0.5  # Use more space
        # Pros: Fewer collisions, faster operations
        # Cons: ~2x memory usage
        # Use when: Speed is critical, memory available

Simplicity vs Performance:

# Simple: Chaining with list
class SimpleHashTable:
    def __init__(self):
        self.buckets = [[] for _ in range(capacity)]

    def put(self, key, value):
        bucket = self.buckets[hash(key) % len(self.buckets)]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))

    # Pros: Easy to implement, easy to debug
    # Cons: Pointer overhead, cache misses

# Complex but fast: Open addressing with Robin Hood hashing
class ComplexHashTable:
    def __init__(self):
        self.table = [EMPTY] * capacity
        self.distances = [0] * capacity

    def put(self, key, value):
        # Robin Hood logic: swap if current entry is "richer"
        # ... complex insertion logic with distance tracking ...
        pass

    # Pros: Better performance, cache-friendly
    # Cons: Complex implementation, harder to debug

Determinism vs Security:

# Deterministic: Predictable, vulnerable to attacks
class DeterministicHashTable:
    def __init__(self):
        self.hash_seed = 0  # Fixed seed

    def hash(self, key):
        return hash(key)  # Same output every run

    # Pros: Reproducible behavior, easier debugging
    # Cons: Vulnerable to hash collision DoS attacks

# Randomized: Secure, non-deterministic
class RandomizedHashTable:
    def __init__(self):
        import random
        self.hash_seed = random.getrandbits(64)  # Random per instance

    def hash(self, key):
        return hash((key, self.hash_seed))  # Different output each run

    # Pros: Resistant to DoS attacks
    # Cons: Non-deterministic iteration order, harder to debug

Worst-case vs Average-case:

# Average-case optimized: Standard hash table
class StandardHashTable:
    # O(1) average case, O(n) worst case
    # Most operations fast, rare slow operations acceptable
    pass

# Worst-case optimized: Cuckoo hashing
class CuckooHashTable:
    # O(1) worst-case lookup, O(1) amortized insert
    # Guaranteed fast lookups, insertions occasionally expensive

    def __init__(self):
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity

    def get(self, key):
        # Check both tables: guaranteed O(1)
        h1 = hash1(key) % len(self.table1)
        if self.table1[h1] and self.table1[h1][0] == key:
            return self.table1[h1][1]

        h2 = hash2(key) % len(self.table2)
        if self.table2[h2] and self.table2[h2][0] == key:
            return self.table2[h2][1]

        return None  # Only 2 lookups, guaranteed

    # Use when: Predictable latency critical (real-time systems)

Alternative Approaches

Comparison Table:

Approach	Insert	Search	Delete	Space	Cache	Use Case
Hash Table (Chaining)	O(1) avg	O(1) avg	O(1) avg	O(n)	Poor	General purpose
Hash Table (Open Addr)	O(1) avg	O(1) avg	O(1) avg	O(n)	Good	Performance-critical
Binary Search Tree	O(log n)	O(log n)	O(log n)	O(n)	Poor	Ordered iteration
Balanced BST (Red-Black)	O(log n)	O(log n)	O(log n)	O(n)	Poor	Ordered + balanced
Sorted Array	O(n)	O(log n)	O(n)	O(n)	Excellent	Static data, range queries
Unsorted Array	O(1)	O(n)	O(n)	O(n)	Excellent	Small datasets
Trie	O(k)	O(k)	O(k)	O(n*k)	Good	Prefix searches
Skip List	O(log n) avg	O(log n) avg	O(log n) avg	O(n)	Good	Concurrent access
Bloom Filter	O(k)	O(k)	No	O(m)	Good	Approximate membership

Choosing the Right Structure:

def choose_data_structure(requirements):
    """
    Decision tree for selecting data structure.
    """
    if requirements.need_fast_lookup:
        if requirements.need_ordered:
            return "Balanced BST (TreeMap)"
        elif requirements.need_range_queries:
            return "Balanced BST or Segment Tree"
        elif requirements.need_prefix_search:
            return "Trie"
        else:
            if requirements.memory_limited:
                return "Bloom Filter (approximate)"
            else:
                return "Hash Table"  # ← Most common choice

    elif requirements.need_fast_insertion:
        if requirements.data_static:
            return "Sorted Array + Binary Search"
        else:
            return "Hash Table"

    elif requirements.need_ordered_iteration:
        return "Balanced BST (Red-Black, AVL)"

    else:
        if requirements.dataset_small:
            return "Unsorted Array"
        else:
            return "Hash Table (default choice)"

When to Prefer Alternatives:

Use Binary Search Tree when:
- Need ordered iteration
- Need range queries (find all keys between X and Y)
- Need min/max operations
- Example: Event scheduling, interval queries
Use Sorted Array when:
- Data is static or rarely changes
- Need fast lookups (binary search)
- Memory is tight (no overhead)
- Example: Dictionary lookup, spell checker
Use Trie when:
- Keys are strings
- Need prefix searches
- Need autocomplete functionality
- Example: Autocomplete, IP routing, dictionary
Use Bloom Filter when:
- Need membership testing
- Can tolerate false positives
- Memory is very limited
- Example: Web cache, spell checker, malware detection

Balancing Competing Factors

Fast Reads vs Fast Writes:

# Optimize for reads: Lower load factor
class ReadOptimizedHashTable:
    def __init__(self):
        self.load_factor = 0.5  # Fewer collisions
        # Reads: Very fast (avg 1.5 probes)
        # Writes: Slower (more frequent resizes)
        # Memory: 2x overhead

    # Use when: Read-heavy workload (90% reads, 10% writes)

# Optimize for writes: Higher load factor
class WriteOptimizedHashTable:
    def __init__(self):
        self.load_factor = 0.9  # Rare resizes
        # Reads: Slower (avg 10 probes for open addressing)
        # Writes: Faster (infrequent resizes)
        # Memory: 1.1x overhead

    # Use when: Write-heavy workload or memory-constrained

Consistency vs Performance:

# Strong consistency: Lock entire table
class ConsistentHashTable:
    def __init__(self):
        self.lock = threading.Lock()

    def put(self, key, value):
        with self.lock:  # Exclusive access
            # ... insert logic ...
            pass

    # Pros: Simple, correct
    # Cons: No concurrent writes, single point of contention

# Weak consistency: Fine-grained locking
class ConcurrentHashTable:
    def __init__(self):
        self.segment_locks = [threading.Lock() for _ in range(16)]

    def put(self, key, value):
        segment = hash(key) % len(self.segment_locks)
        with self.segment_locks[segment]:  # Lock only this segment
            # ... insert logic ...
            pass

    # Pros: Concurrent writes to different segments
    # Cons: More complex, potential for stale reads

Heuristics for Tuning:

class AdaptiveHashTable:
    """
    Hash table that tunes itself based on workload.
    """
    def __init__(self):
        self.load_factor = 0.75
        self.resize_count = 0
        self.collision_count = 0
        self.operation_count = 0

    def put(self, key, value):
        self.operation_count += 1

        # Track metrics
        if self._collision_occurred():
            self.collision_count += 1

        # Adapt load factor based on collision rate
        if self.operation_count % 10000 == 0:
            self._adjust_load_factor()

        # ... normal insert ...

    def _adjust_load_factor(self):
        collision_rate = self.collision_count / self.operation_count

        if collision_rate > 0.3:  # Too many collisions
            self.load_factor = max(0.5, self.load_factor - 0.1)
            print(f"Reducing load factor to {self.load_factor}")
        elif collision_rate < 0.1:  # Too sparse
            self.load_factor = min(0.9, self.load_factor + 0.1)
            print(f"Increasing load factor to {self.load_factor}")

# Use when: Workload characteristics unknown or changing

6.3 Advanced Optimizations

Amortization

Amortized Analysis for Resizing:

class AmortizedHashTable:
    """
    Demonstrate amortized O(1) insertion with resizing.
    """
    def __init__(self):
        self.capacity = 8
        self.size = 0
        self.table = [None] * self.capacity
        self.resize_costs = []  # Track resize costs

    def put(self, key, value):
        # Resize if needed (happens at size = 8, 16, 32, 64, ...)
        if self.size >= self.capacity * 0.75:
            old_capacity = self.capacity
            self._resize()
            self.resize_costs.append(old_capacity)  # Track resize cost

        # Normal insert: O(1)
        index = hash(key) % self.capacity
        self.table[index] = (key, value)
        self.size += 1

    def _resize(self):
        """Resize and rehash all elements."""
        old_table = self.table
        self.capacity *= 2
        self.table = [None] * self.capacity

        # Rehash: O(n) for this operation
        for entry in old_table:
            if entry is not None:
                index = hash(entry[0]) % self.capacity
                self.table[index] = entry

    def analyze_amortized_cost(self):
        """
        Analyze amortized cost.

        For n insertions:
        - Regular inserts: n * O(1) = O(n)
        - Resizes at sizes: 8, 16, 32, ..., n
        - Resize costs: 8 + 16 + 32 + ... + n = 2n - 1 = O(n)
        - Total: O(n) + O(n) = O(2n) = O(n)
        - Amortized per insert: O(n) / n = O(1)
        """
        total_resize_cost = sum(self.resize_costs)
        total_cost = self.size + total_resize_cost
        amortized = total_cost / self.size if self.size > 0 else 0

        print(f"Total insertions: {self.size}")
        print(f"Total resize cost: {total_resize_cost}")
        print(f"Total cost: {total_cost}")
        print(f"Amortized cost per insert: {amortized:.2f}")
        # Output: Amortized cost ≈ 2 (constant!)

# Demonstration
ht = AmortizedHashTable()
for i in range(1000):
    ht.put(i, i)
ht.analyze_amortized_cost()
# Shows amortized O(1) despite occasional O(n) resize

Accounting Method:

class AmortizedHashTableWithCredits:
    """
    Using accounting method: charge extra "credits" per operation.
    """
    def __init__(self):
        self.capacity = 8
        self.size = 0
        self.table = [None] * self.capacity
        self.credits = 0

    def put(self, key, value):
        # Charge 3 units per insert (amortized cost)
        cost = 3

        # 1 unit: Actual insert work
        self._insert(key, value)
        actual_cost = 1

        # 2 units: Credits for future resize
        # - 1 credit for moving this element during resize
        # - 1 credit for moving one old element
        self.credits += (cost - actual_cost)

        # When resizing, use accumulated credits
        if self.size >= self.capacity * 0.75:
            resize_cost = self.size  # Cost to rehash all elements
            assert self.credits >= resize_cost, "Not enough credits!"
            self.credits -= resize_cost
            self._resize()

        return cost  # Return amortized cost (always 3)

    def _insert(self, key, value):
        index = hash(key) % self.capacity
        self.table[index] = (key, value)
        self.size += 1

    def _resize(self):
        old_table = self.table
        self.capacity *= 2
        self.table = [None] * self.capacity

        for entry in old_table:
            if entry is not None:
                index = hash(entry[0]) % self.capacity
                self.table[index] = entry

# Key insight: By charging 3 units per insert instead of 1,
# we always have enough credits saved to pay for resize operations.
# Amortized cost: O(1) per insert.

Potential Method:

def potential_function(hash_table):
    """
    Potential Φ(table) = 2 * size - capacity

    Intuition: Potential increases as table fills up,
    releasing "stored energy" during resize.
    """
    return 2 * hash_table.size - hash_table.capacity

def amortized_cost_insert(hash_table):
    """
    Amortized cost = Actual cost + ΔΦ
    """
    old_potential = potential_function(hash_table)

    # Actual insert cost
    if hash_table.size < hash_table.capacity * 0.75:
        actual_cost = 1  # Just insert
        hash_table.size += 1
    else:
        # Resize + insert
        actual_cost = hash_table.size + 1  # Rehash all + insert
        hash_table.capacity *= 2
        hash_table.size += 1

    new_potential = potential_function(hash_table)
    delta_potential = new_potential - old_potential

    amortized_cost = actual_cost + delta_potential
    return amortized_cost

# Analysis:
# Regular insert:
#   Actual = 1
#   ΔΦ = (2(n+1) - m) - (2n - m) = 2
#   Amortized = 1 + 2 = 3
#
# Resize insert (when n = m*0.75):
#   Actual = n + 1
#   ΔΦ = (2(n+1) - 2m) - (2n - m) = 2 - m = 2 - (n/0.75) ≈ -n/0.75
#   Amortized = n + 1 - n/0.75 ≈ n - 1.33n = -0.33n (negative!)
#   But potential released pays for expensive operation
#
# Result: Amortized O(1) per insert

Parallelization

Concurrent Hash Table Design:

import threading
from concurrent.futures import ThreadPoolExecutor

class ConcurrentHashTable:
    """
    Thread-safe hash table with fine-grained locking.
    """
    def __init__(self, num_segments=16):
        self.num_segments = num_segments
        self.segments = [[] for _ in range(num_segments)]
        self.locks = [threading.Lock() for _ in range(num_segments)]

    def _get_segment(self, key):
        """Map key to segment."""
        return hash(key) % self.num_segments

    def put(self, key, value):
        """Thread-safe insert."""
        segment = self._get_segment(key)
        with self.locks[segment]:
            # Only lock this segment, others can be accessed concurrently
            for i, (k, v) in enumerate(self.segments[segment]):
                if k == key:
                    self.segments[segment][i] = (key, value)
                    return
            self.segments[segment].append((key, value))

    def get(self, key):
        """Thread-safe read."""
        segment = self._get_segment(key)
        with self.locks[segment]:
            for k, v in self.segments[segment]:
                if k == key:
                    return v
        return None

    def parallel_build(self, items):
        """Build hash table in parallel."""
        with ThreadPoolExecutor(max_workers=self.num_segments) as executor:
            # Partition items by segment
            segment_items = [[] for _ in range(self.num_segments)]
            for key, value in items:
                segment = self._get_segment(key)
                segment_items[segment].append((key, value))

            # Insert each segment in parallel (no lock contention)
            def insert_segment(segment_id):
                for key, value in segment_items[segment_id]:
                    self.put(key, value)

            executor.map(insert_segment, range(self.num_segments))

# Scalability: With S segments and T threads:
# - Throughput: ~T/S times faster (if T ≤ S)
# - Best case: T = S, no lock contention, ~T times faster

Lock-Free Hash Table (Advanced):

import threading
from ctypes import c_bool, c_int

class LockFreeHashTable:
    """
    Lock-free hash table using Compare-And-Swap (CAS).

    Note: Python doesn't have true CAS; this is conceptual.
    Real implementation would use C/C++ with atomic operations.
    """
    def __init__(self, capacity):
        # Each entry: (key, value, tombstone_flag)
        self.table = [(None, None, False) for _ in range(capacity)]
        self.capacity = capacity

    def put(self, key, value):
        """
        Lock-free insert using CAS.
        """
        index = hash(key) % self.capacity

        while True:
            # Read current value
            current = self.table[index]
            current_key, current_value, is_deleted = current

            # Case 1: Empty slot or deleted
            if current_key is None or is_deleted:
                # Try to CAS: if still empty, insert
                new_entry = (key, value, False)
                if self._cas(index, current, new_entry):
                    return  # Success
                # CAS failed, retry

            # Case 2: Key already exists
            elif current_key == key:
                # Try to CAS: update value
                new_entry = (key, value, False)
                if self._cas(index, current, new_entry):
                    return  # Success
                # CAS failed, retry

            # Case 3: Collision
            else:
                index = (index + 1) % self.capacity  # Linear probe

    def _cas(self, index, expected, new_value):
        """
        Compare-And-Swap operation (atomic).

        if table[index] == expected:
            table[index] = new_value
            return True
        else:
            return False
        """
        # In real implementation, this would be atomic CPU instruction
        # In Python, we simulate (not truly lock-free):
        with threading.Lock():  # Simulating atomicity
            if self.table[index] == expected:
                self.table[index] = new_value
                return True
            return False

    # Benefits of lock-free:
    # - No deadlock possible
    # - Always makes progress (some thread succeeds)
    # - Better worst-case latency (no lock waiting)
    #
    # Drawbacks:
    # - Complex to implement correctly
    # - May have higher contention under high load
    # - Requires hardware support (CAS instruction)

# Real-world examples:
# - Java's ConcurrentHashMap (since Java 8)
# - C++ folly::AtomicHashMap
# - Rust's dashmap

Embarrassingly Parallel Operations:

def parallel_hash_operations(keys, values):
    """
    Some hash table operations are embarrassingly parallel.
    """

    # 1. Parallel building (no dependencies)
    def parallel_build(items):
        hash_tables = [HashTable() for _ in range(num_threads)]

        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            # Each thread builds its own partition
            def build_partition(partition_id):
                for key, value in items[partition_id::num_threads]:
                    hash_tables[partition_id].put(key, value)

            executor.map(build_partition, range(num_threads))

        # Merge tables (sequential, but fast)
        final_table = HashTable()
        for ht in hash_tables:
            for key, value in ht.items():
                final_table.put(key, value)

        return final_table

    # 2. Parallel lookups (read-only, no locks needed)
    def parallel_lookups(hash_table, keys):
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            results = list(executor.map(hash_table.get, keys))
        return results

    # 3. Parallel frequency counting (merge phase)
    def parallel_word_count(documents):
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            # Each thread counts its partition
            partial_counts = executor.map(count_words, documents)

        # Merge counts
        total_count = {}
        for partial in partial_counts:
            for word, count in partial.items():
                total_count[word] = total_count.get(word, 0) + count

        return total_count

    def count_words(doc):
        counts = {}
        for word in doc.split():
            counts[word] = counts.get(word, 0) + 1
        return counts

# Speedup: Near-linear with number of cores for read-heavy workloads

Approximation

Approximate Data Structures:

Bloom Filter (Approximate Membership):

import mmh3  # MurmurHash3

class BloomFilter:
    """
    Space-efficient probabilistic data structure for membership testing.

    Trade-off: Small false positive rate for large space savings.
    """
    def __init__(self, size, num_hash_functions):
        self.size = size
        self.num_hashes = num_hash_functions
        self.bit_array = [False] * size

    def add(self, item):
        """Add item to set."""
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            self.bit_array[index] = True

    def contains(self, item):
        """Check if item might be in set."""
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            if not self.bit_array[index]:
                return False  # Definitely NOT in set
        return True  # Probably in set (might be false positive)

    def false_positive_rate(self, n):
        """
        Calculate false positive probability.

        FPR ≈ (1 - e^(-k*n/m))^k
        where k = num_hashes, n = items inserted, m = size
        """
        import math
        k, m = self.num_hashes, self.size
        return (1 - math.exp(-k * n / m)) ** k

# Example: 1 million items
# Hash table: ~24 MB (with pointers)
# Bloom filter: ~1.2 MB (1% false positive rate)
# Space savings: 20x smaller!

# Use cases:
# - Web crawlers: Check if URL already visited
# - Databases: Avoid disk lookups for non-existent keys
# - CDNs: Check if content in cache before fetching

Count-Min Sketch (Approximate Frequency):

class CountMinSketch:
    """
    Space-efficient probabilistic data structure for frequency counting.

    Trade-off: Approximate counts (overestimation) for space savings.
    """
    def __init__(self, width, depth):
        self.width = width  # Number of buckets per row
        self.depth = depth  # Number of hash functions
        self.table = [[0] * width for _ in range(depth)]

    def add(self, item, count=1):
        """Increment count for item."""
        for i in range(self.depth):
            index = hash((item, i)) % self.width
            self.table[i][index] += count

    def estimate(self, item):
        """Estimate frequency of item."""
        return min(
            self.table[i][hash((item, i)) % self.width]
            for i in range(self.depth)
        )

    def error_bound(self, total_count):
        """
        Error bound: ε * total_count with probability 1 - δ
        where ε = e / width, δ = (1/2)^depth
        """
        import math
        epsilon = math.e / self.width
        delta = (1/2) ** self.depth
        return epsilon * total_count, delta

# Example: Count 1 billion elements
# Hash table: ~16 GB (assuming 16 bytes per entry)
# Count-Min Sketch: ~1 MB (1% error, 99% confidence)
# Space savings: 16,000x smaller!

# Use cases:
# - Network traffic analysis: Count packets per IP
# - Stream processing: Top-K frequent items
# - Database query optimization: Estimate join sizes

HyperLogLog (Approximate Cardinality):

import math

class HyperLogLog:
    """
    Space-efficient cardinality estimation.

    Trade-off: ~2% error for ~1.5 KB space (for billions of elements).
    """
    def __init__(self, precision=14):
        self.precision = precision
        self.m = 1 << precision  # 2^precision buckets
        self.buckets = [0] * self.m

        # Bias correction constant
        if self.m == 16:
            self.alpha = 0.673
        elif self.m == 32:
            self.alpha = 0.697
        elif self.m == 64:
            self.alpha = 0.709
        else:
            self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def add(self, item):
        """Add item to set."""
        # Hash item to 64 bits
        h = hash(item) & ((1 << 64) - 1)

        # First 'precision' bits determine bucket
        bucket = h >> (64 - self.precision)

        # Count leading zeros in remaining bits (+ 1)
        remaining = h & ((1 << (64 - self.precision)) - 1)
        leading_zeros = self._leading_zeros(remaining) + 1

        # Update bucket with maximum leading zeros seen
        self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)

    def cardinality(self):
        """Estimate number of unique items."""
        # Harmonic mean of 2^bucket values
        raw_estimate = self.alpha * self.m * self.m / sum(
            2 ** (-bucket) for bucket in self.buckets
        )

        # Bias correction for small/large cardinalities
        if raw_estimate <= 2.5 * self.m:
            # Small range correction
            zeros = self.buckets.count(0)
            if zeros != 0:
                return self.m * math.log(self.m / zeros)

        if raw_estimate <= (1/30) * (1 << 32):
            return raw_estimate
        else:
            # Large range correction
            return -1 * (1 << 32) * math.log(1 - raw_estimate / (1 << 32))

    def _leading_zeros(self, num):
        """Count leading zeros in binary representation."""
        if num == 0:
            return 64 - self.precision
        return (64 - self.precision) - num.bit_length()

# Example: Count unique users from 1 billion events
# Hash set: ~16 GB
# HyperLogLog: ~12 KB (for 14-bit precision, ~0.8% standard error)
# Space savings: >1,000,000x smaller!

# Use cases:
# - Analytics: Count unique visitors
# - Databases: Estimate distinct values for query optimization
# - Ad tech: Count unique ad impressions

When to Use Approximation:

Exact Structure	Approximate Alternative	Space Savings	Use When
Hash Set	Bloom Filter	10-100x	Membership testing, can tolerate false positives
Hash Map (counts)	Count-Min Sketch	1000-10000x	Frequency estimation, overestimate acceptable
Hash Set (cardinality)	HyperLogLog	1000000x	Counting unique items, ~2% error acceptable

def choose_approximate_structure(use_case):
    """Decision guide for approximate structures."""
    if use_case == "membership_testing":
        if can_tolerate_false_positives():
            return BloomFilter(size=optimal_size(), num_hashes=optimal_k())
        else:
            return HashSet()  # Need exact answers

    elif use_case == "frequency_counting":
        if need_exact_counts():
            return HashMap()  # Exact counts
        elif can_tolerate_overestimation():
            return CountMinSketch(width=1000, depth=5)  # Approximate
        else:
            return HashMap()  # Need exact

    elif use_case == "cardinality_estimation":
        if need_exact_cardinality():
            return HashSet()  # Exact unique count
        elif dataset_massive():
            return HyperLogLog(precision=14)  # Approximate, tiny space
        else:
            return HashSet()  # Small enough for exact

# Rule of thumb:
# - Small datasets (< 1M elements): Use exact structures
# - Large datasets (> 100M elements): Consider approximate
# - Massive datasets (> 1B elements): Approximate almost required

4. Problem-Solving and Applications

4.1 Use Cases and Applications

Where and When to Use

Most Effective Scenarios:

Fast Lookup with Non-Integer Keys:
- Need O(1) retrieval using strings, objects, or complex keys
- Example: User ID → User profile mapping
Frequency Counting:
- Count occurrences of elements in stream or array
- Example: Word frequency in document
Deduplication:
- Remove or detect duplicates efficiently
- Example: Find unique elements in array
Caching/Memoization:
- Store computed results for reuse
- Example: Dynamic programming with complex state
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group by category

When to Choose Hash Table Over Alternatives:

Requirement	Choose	Over
Fast lookup by key	Hash Table	Binary Search Tree
Don't need sorted order	Hash Table	Balanced BST
Membership testing	Hash Set	Array or List
Counting frequencies	Hash Map (Counter)	Sorted Map
Caching results	Hash Map	Dictionary with linear search

When NOT to Use:

Need Ordered Iteration:
- Hash tables don't maintain insertion or sorted order
- Use: TreeMap, OrderedDict, or sorted array instead
Range Queries:
- Can't find "all keys between X and Y"
- Use: Balanced BST (TreeMap) or sorted array with binary search
Prefix Searches:
- Can't find "all keys starting with X"
- Use: Trie or prefix tree
Small Datasets (< 10-20 elements):
- Hash table overhead not justified
- Simple array with linear search may be faster
Memory Constrained:
- Hash tables need extra space (load factor < 1)
- Use: In-place algorithms or compact data structures

Problem Statement Signals:

Keywords/phrases that suggest hash tables:

"Find if pair/triplet sums to target" → Hash set for O(1) lookup
"Count frequency/occurrences" → Hash map for counting
"Group by property" → Hash map with list values
"Check if exists" → Hash set for membership
"First non-repeating character" → Hash map with counts
"Detect duplicates" → Hash set
"Longest substring without repeating" → Hash set for seen characters
"Two arrays intersection" → Hash set for O(n) intersection

Problem Categories

Classic Problem Types:

Two-Sum Pattern:
- Find pair of numbers summing to target
- Use hash map to store complements
- Example: Two Sum, Three Sum, Four Sum
Frequency/Counting:
- Count element occurrences
- Example: Most frequent element, k most frequent
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group shifted strings
Subarray/Substring Problems:
- Find subarray with property (sum, distinct elements)
- Example: Subarray sum equals K, longest substring without repeating
Deduplication:
- Remove or identify duplicates
- Example: Contains duplicate, longest consecutive sequence
Graph Problems:
- Represent graph as adjacency list (hash map)
- Example: Clone graph, course schedule

Canonical/Classic Problems:

Two Sum (Topic 64)
Group Anagrams (Topic 65)
Longest Consecutive Sequence (Topic 66)
Subarray Sum Equals K (Topic 67)
Design HashMap from Scratch (Topic 68)
Rolling Hash - Rabin-Karp (Topic 69)

Also classic:

First non-repeating character
Isomorphic strings
Valid Sudoku
LRU Cache (hash map + doubly linked list)

Real-World Applications

Production Systems:

Database Indexing:
- Hash indexes for equality searches
- PostgreSQL, MySQL support hash indexes
- Example: SELECT * FROM users WHERE email = 'user@example.com'
Caching Layers:
- Redis: In-memory key-value store using hash tables
- Memcached: Distributed hash table for caching
- CDN caching: Hash URLs to cached responses
Compiler/Interpreter Symbol Tables:
- Map variable names to memory locations/types
- Python: Global/local variable namespaces are dicts
- JavaScript: Object properties are hash maps
Web Frameworks:
- Routing: Map URL patterns to handlers
- Session management: Session ID → session data
- Cookies: Key-value pairs
Blockchain:
- Bitcoin: Map transaction ID → transaction details
- Ethereum: Account address → account state
- Merkle trees use cryptographic hashes

Famous Systems Using Hash Tables:

System	Hash Table Usage
Google Search	Index: keyword → document IDs
Facebook	Social graph: User ID → friend list
Amazon	Product catalog: SKU → product details
DNS	Domain name → IP address
Git	Blob hash → file content
Linux Kernel	Process table: PID → process structure
Chrome V8	Object property maps (hidden classes)

Domain-Specific Applications:

Finance:

Order books: Order ID → Order details
Position tracking: Ticker → positions
Risk calculations: Portfolio ID → risk metrics

E-commerce:

Inventory: SKU → stock count
Shopping carts: User ID → cart items
Recommendation engines: User ID → recommended products

Social Media:

Friend graphs: User ID → friends
News feeds: User ID → feed posts
Hashtag tracking: Tag → posts

Gaming:

Entity management: Entity ID → game object
Player data: Player ID → stats
Leaderboards: Player → score (sorted by score)

Bioinformatics:

K-mer counting: k-mer → count
Sequence alignment: Hash subsequences for fast matching
Gene databases: Gene ID → gene info

Scale and Performance Requirements

Dataset Size Suitability:

Small (< 1,000):

Hash table works but may be overkill
Linear search can be competitive
Memory overhead noticeable

Medium (1,000 - 1,000,000):

Sweet spot for hash tables
Clear advantage over O(log n) structures
Fits comfortably in RAM

Large (1,000,000 - 1,000,000,000):

Hash tables excellent if fit in memory
Watch for cache misses
Consider cache-optimized variants (Swiss Tables)

Massive (> 1 billion):

May not fit in single machine memory
Use distributed hash tables (Chord, Kademlia)
Or approximate data structures (Bloom filter, Count-Min Sketch)

Performance Guarantees:

Operation	Average	Worst	Space
Search	O(1)	O(n)	O(n)
Insert	O(1)	O(n)	O(n)
Delete	O(1)	O(n)	O(n)

Amortized insert/delete: O(1) accounting for occasional resize

When Hash Table is Overkill:

Problem solvable in one pass without lookups
Data already sorted (can use binary search)
Only need to iterate once (no repeated lookups)

When Hash Table is Insufficient:

Need sorted order iteration
Need min/max in O(1) (use heap instead)
Need range queries (use segment tree or BST)
Memory extremely limited (use bit arrays or compact structures)

4.2 Problem-Solving Approaches

Recognition Patterns

How to Recognize Hash Table Problems:

Explicit Counting:
- "Count the number of..." → Hash map
- "Find the most/least frequent..." → Hash map with counts
Existence Checks:
- "Check if X exists" → Hash set
- "Find if there is a pair..." → Hash set
Complement/Pair Finding:
- "Find two elements that sum to target"
- Pattern: For each element x, check if (target - x) exists
Grouping by Property:
- "Group items that share X"
- Pattern: Hash map where key = property, value = list of items
Tracking Seen Elements:
- "Find first duplicate"
- "Longest substring without repeating characters"
- Pattern: Hash set to track what we've seen
Index Mapping:
- "Find indices of two numbers that sum to target"
- Pattern: Hash map storing value → index

Input/Output Patterns:

Input: Array/list + target/condition
Output: Pair/subset satisfying condition
Approach: Hash set/map for O(1) complement lookup
Input: Array of strings
Output: Groups of related strings
Approach: Hash map with canonical form as key

Constraint Indicators:

"In O(n) time" or "Linear time" → Likely hash table
"Find in one pass" → Hash table for seen elements
"Can you do better than O(n log n)?" → Hash table to achieve O(n)

Application Strategy

General Problem-Solving Templates:

# Template 1: Existence/Complement Check
def two_sum_pattern(arr, target):
    seen = {}  # or set() if only checking existence
    for i, num in enumerate(arr):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]  # Found pair
        seen[num] = i
    return None

# Template 2: Frequency Counting
def frequency_pattern(arr):
    freq = {}
    for item in arr:
        freq[item] = freq.get(item, 0) + 1
    # Process frequencies
    return max(freq, key=freq.get)  # Example: most frequent

# Template 3: Grouping by Property
def grouping_pattern(items):
    groups = {}
    for item in items:
        key = compute_key(item)  # Canonical form
        if key not in groups:
            groups[key] = []
        groups[key].append(item)
    return list(groups.values())

# Template 4: Sliding Window with Hash Map
def sliding_window_pattern(arr, k):
    window = {}
    # Initialize window
    for i in range(k):
        window[arr[i]] = window.get(arr[i], 0) + 1

    # Slide window
    for i in range(k, len(arr)):
        # Add new element
        window[arr[i]] = window.get(arr[i], 0) + 1
        # Remove old element
        window[arr[i-k]] -= 1
        if window[arr[i-k]] == 0:
            del window[arr[i-k]]

Adapting to Problem-Specific Constraints:

Limited Memory:
- Use Count-Min Sketch instead of hash map for approximate counts
- Use Bloom filter for approximate membership
Need Sorted Output:
- Use hash map during computation
- Sort results afterward
Multiple Lookups:
- Use hash map with tuple keys
- Or nested hash maps
Custom Key Types:
- Implement __hash__ and __eq__ methods
- Or convert to canonical string/tuple form

Common Variations:

Two Sum → Three Sum: Add nested loop, check if -(arr[i] + arr[j]) exists
Find pair → Find all pairs: Don't return on first find, collect all
Exact count → Top K frequent: Use heap with hash map
Single array → Two arrays: Two hash sets, find intersection/difference

Combination with Other Techniques

Hash Table + Two Pointers:

Problem: Three Sum - find all triplets summing to zero

def three_sum(nums):
    nums.sort()  # Sort for two-pointer technique
    result = []
    seen_triplets = set()  # Hash set to avoid duplicates

    for i in range(len(nums) - 2):
        left, right = i + 1, len(nums) - 1
        while left < right:
            total = nums[i] + nums[left] + nums[right]
            if total == 0:
                triplet = (nums[i], nums[left], nums[right])
                if triplet not in seen_triplets:
                    result.append(list(triplet))
                    seen_triplets.add(triplet)
                left += 1
                right -= 1
            elif total < 0:
                left += 1
            else:
                right -= 1
    return result

Hash Table + Sliding Window:

Problem: Longest Substring Without Repeating Characters

def length_of_longest_substring(s):
    char_index = {}  # char → last seen index
    max_length = 0
    start = 0

    for end in range(len(s)):
        if s[end] in char_index and char_index[s[end]] >= start:
            start = char_index[s[end]] + 1  # Move start past duplicate
        char_index[s[end]] = end
        max_length = max(max_length, end - start + 1)

    return max_length

Hash Table + BFS/DFS:

Problem: Clone Graph

def clone_graph(node):
    if not node:
        return None

    cloned = {}  # Original node → cloned node mapping

    def dfs(node):
        if node in cloned:
            return cloned[node]

        clone = Node(node.val)
        cloned[node] = clone  # Store before recursing (handles cycles)

        for neighbor in node.neighbors:
            clone.neighbors.append(dfs(neighbor))

        return clone

    return dfs(node)

Hash Table + Prefix Sum:

Problem: Subarray Sum Equals K

def subarray_sum(nums, k):
    prefix_sum = 0
    sum_count = {0: 1}  # prefix_sum → count of occurrences
    count = 0

    for num in nums:
        prefix_sum += num
        # If (prefix_sum - k) exists, found subarray
        if prefix_sum - k in sum_count:
            count += sum_count[prefix_sum - k]
        sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1

    return count

Hash Table + Heap:

Problem: Top K Frequent Elements

import heapq
from collections import Counter

def top_k_frequent(nums, k):
    # Hash map for frequency counting
    freq = Counter(nums)

    # Min heap of size k
    return heapq.nlargest(k, freq.keys(), key=freq.get)

Hash Table + Trie:

Problem: Word Search II (find all words from dictionary in 2D board)

class TrieNode:
    def __init__(self):
        self.children = {}  # Hash map for children
        self.word = None

def find_words(board, words):
    # Build trie from words
    root = TrieNode()
    for word in words:
        node = root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.word = word

    # DFS on board using trie
    result = []
    for i in range(len(board)):
        for j in range(len(board[0])):
            dfs(board, i, j, root, result)
    return result

Preprocessing Patterns:

Sort then Hash:
- Group anagrams: Sort each string, use sorted string as key
Normalize then Hash:
- Case-insensitive comparison: Lowercase before hashing
- Canonical form: Convert to standard representation
Accumulate then Hash:
- Prefix sums: Store cumulative sum at each index as key

Postprocessing Patterns:

Hash to Collect, Sort to Output:

freq = Counter(arr)
return sorted(freq.items(), key=lambda x: x[1], reverse=True)

Hash to Deduplicate, List to Output:

seen = set()
result = []
for item in arr:
    if item not in seen:
        result.append(item)
        seen.add(item)

7. Comparative Analysis

7.1 Comparison with Alternatives

Feature-by-Feature Comparison

Hash Table vs Binary Search Tree (BST):

Feature	Hash Table	Binary Search Tree
Search	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Insert	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Delete	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Min/Max	O(n)	O(log n) or O(1) with pointer
Ordered Iteration	No	Yes (in-order traversal)
Range Query	No	O(log n + k) where k = results
Predecessor/Successor	No	O(log n)
Memory Overhead	Load factor dependent	Pointer overhead
Cache Performance	Good (open addr) / Poor (chaining)	Poor (pointer chasing)

When to choose:

Hash Table: Need O(1) lookups, don't need ordering
BST: Need ordering, range queries, or min/max operations

Hash Table vs Balanced BST (Red-Black, AVL):

Feature	Hash Table	Balanced BST
Search	O(1) average	O(log n) guaranteed
Insert	O(1) average	O(log n) guaranteed
Delete	O(1) average	O(log n) guaranteed
Worst-case guarantee	No (O(n) possible)	Yes (O(log n))
Ordered operations	No	Yes
Implementation complexity	Simple (chaining)	Complex (rotations)
Memory overhead	Moderate	High (color/balance bits + pointers)

When to choose:

Hash Table: Performance-critical, no ordering needed
Balanced BST: Need guaranteed O(log n), ordered operations

Hash Table vs Sorted Array:

Feature	Hash Table	Sorted Array
Search	O(1) average	O(log n) binary search
Insert	O(1) average	O(n) maintain sorted order
Delete	O(1) average	O(n) shift elements
Space	O(n / load_factor)	O(n)
Range Query	No	O(log n + k)
Cache Performance	Varies	Excellent (contiguous)
Dynamic resizing	Efficient	Expensive
Best for	Dynamic data	Static or rarely-changed data

When to choose:

Hash Table: Frequent insertions/deletions
Sorted Array: Static data with many queries

Hash Table vs Trie:

Feature	Hash Table	Trie
Search	O(1) average	O(k) where k = key length
Insert	O(1) average	O(k)
Delete	O(1) average	O(k)
Space	O(n)	O(ALPHABETSIZE k _ n) worst case
Prefix search	No	O(p) where p = prefix length
Autocomplete	No	Excellent
Longest common prefix	No	Natural
Best for strings	Yes (hash)	Yes (prefix operations)

When to choose:

Hash Table: Exact match lookups, any key type
Trie: Prefix searches, autocomplete, string-specific operations

Hash Table (Chaining) vs Hash Table (Open Addressing):

Feature	Chaining	Open Addressing
Load factor	Can exceed 1.0	Must stay < 1.0
Cache performance	Poor	Excellent
Deletion	Easy	Complex (tombstones)
Memory overhead	Pointer per node	Slack space in array
Implementation	Simple	Moderate
Worst-case search	O(n)	O(n)
Best for	Variable load, simple code	Performance-critical, fixed load

When to choose:

Chaining: Simple implementation, variable load factor
Open Addressing: Performance-critical, cache-friendly

Quantitative Performance Comparison:

import time
import random

def benchmark_structures(n=100000):
    """
    Benchmark different data structures for common operations.
    """
    keys = list(range(n))
    random.shuffle(keys)

    # Hash Table (dict)
    start = time.perf_counter()
    hash_table = {}
    for key in keys:
        hash_table[key] = key
    insert_time_hash = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        _ = hash_table[key]
    search_time_hash = time.perf_counter() - start

    # Sorted List (bisect)
    import bisect
    start = time.perf_counter()
    sorted_list = []
    for key in keys:
        bisect.insort(sorted_list, key)
    insert_time_sorted = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        bisect.bisect_left(sorted_list, key)
    search_time_sorted = time.perf_counter() - start

    # Unsorted List
    start = time.perf_counter()
    unsorted_list = []
    for key in keys:
        unsorted_list.append(key)
    insert_time_unsorted = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        key in unsorted_list
    search_time_unsorted = time.perf_counter() - start

    print(f"Results for n={n}:")
    print(f"Hash Table - Insert: {insert_time_hash:.3f}s, Search: {search_time_hash:.3f}s")
    print(f"Sorted List - Insert: {insert_time_sorted:.3f}s, Search: {search_time_sorted:.3f}s")
    print(f"Unsorted List - Insert: {insert_time_unsorted:.3f}s, Search: {search_time_unsorted:.3f}s")

# Typical output for n=100,000:
# Hash Table - Insert: 0.008s, Search: 0.005s
# Sorted List - Insert: 8.523s, Search: 0.012s
# Unsorted List - Insert: 0.003s, Search: 12.456s

Contextual Comparison

Use Case Decision Matrix:

Use Case	Best Choice	Runner-up	Avoid
Exact match lookup	Hash Table	-	Trie
Prefix search	Trie	-	Hash Table
Range query	Balanced BST	Sorted Array	Hash Table
Frequent insert/delete	Hash Table	-	Sorted Array
Static data	Sorted Array	Hash Table	Linked List
Ordered iteration	Balanced BST	Sorted Array	Hash Table
Memory-constrained	Sorted Array	Hash Table (high load)	Trie
Guaranteed worst-case	Balanced BST	Cuckoo Hash	Standard Hash
String prefix ops	Trie	-	Hash Table
Approximate membership	Bloom Filter	-	Hash Set

Context-Specific Recommendations:

Compiler Symbol Table:
- Best: Hash Table
- Reason: Fast O(1) lookups, identifiers are unique, no need for ordering
- Alternative: Balanced BST if scoping requires ordered traversal
Database Index:
- Best: B-tree (Balanced BST variant)
- Reason: Range queries, ordered access, disk-friendly
- Alternative: Hash index for exact-match-only columns
Phone Directory:
- Best: Trie (for name prefix search)
- Reason: Autocomplete, prefix matching
- Alternative: Hash Table for exact name lookup
Spell Checker:
- Best: Hash Set
- Reason: Membership test, exact word match
- Alternative: Trie for suggestions, Bloom filter if memory tight
URL Cache:
- Best: Hash Table (LRU Cache)
- Reason: Fast lookup, combine with doubly-linked list for LRU
- Alternative: None suitable
Event Scheduler:
- Best: Balanced BST or Heap
- Reason: Need ordered by time, min/max operations
- Alternative: Hash Table if only lookup by event ID

Decision Criteria:

def choose_data_structure(requirements):
    """
    Comprehensive decision tree.
    """
    # Priority 1: Ordering requirements
    if requirements.need_ordered_iteration:
        return "Balanced BST"
    if requirements.need_range_queries:
        return "Balanced BST or Segment Tree"

    # Priority 2: Operation type
    if requirements.primary_operation == "prefix_search":
        return "Trie"
    if requirements.primary_operation == "exact_lookup":
        if requirements.performance_critical:
            return "Hash Table (Open Addressing)"
        else:
            return "Hash Table (Chaining)"

    # Priority 3: Data characteristics
    if requirements.data_static:
        return "Sorted Array"
    if requirements.data_very_dynamic:
        return "Hash Table"

    # Priority 4: Guarantees
    if requirements.need_worst_case_guarantee:
        return "Balanced BST or Cuckoo Hash Table"
    if requirements.can_tolerate_approximation:
        return "Bloom Filter or Count-Min Sketch"

    # Default
    return "Hash Table"

7.2 Evolution and Variants

Historical Progression

Timeline of Hash Table Evolution:

1950s - Birth:
- 1953: Hans Peter Luhn (IBM) - First hash table proposal
- 1956: Arnold Dumey - Formalized scatter storage
- Technique: Simple division method, linear probing
- Limitation: Poor collision handling, no load factor management
1960s - Theoretical Foundation:
- 1963: Gene Amdahl - Analyzed chaining
- 1963: Konheim & Weiss - Linear probing mathematical analysis
- Innovation: Separate chaining introduced
- Impact: Theoretical complexity bounds established
1970s - Refinement:
- 1974: Knuth - Comprehensive analysis in TAOCP Vol. 3
- 1977: Robin Hood hashing introduced
- 1979: Carter & Wegman - Universal hashing theory
- Innovation: Load factor management, better hash functions
- Impact: Provable average-case guarantees
1980s - Practical Optimizations:
- 1982: Dynamic hashing (Larson)
- 1985: Linear hashing
- Innovation: Incremental resizing, better memory management
- Impact: Production systems could scale gracefully
1990s - Security and Variants:
- 1993: Multiplicative hashing gains popularity
- 1996: SHA-1 becomes standard secure hash
- 1997: Perfect hashing for static data
- Innovation: Cryptographic hashes, security awareness
- Impact: Protection against collision attacks
2000s - Advanced Variants:
- 2001: Cuckoo hashing (Pagh & Rodler)
- 2004: Hopscotch hashing
- 2007: MurmurHash (fast, good distribution)
- Innovation: Worst-case O(1) lookup, cache-friendly designs
- Impact: Real-time systems could use hash tables
2010s - Modern Optimization:
- 2012: SipHash (defense against DoS)
- 2014: Google Swiss Tables (SIMD optimization)
- 2016: Rust's HashMap uses Robin Hood hashing
- Innovation: Hardware-aware design, SIMD parallelization
- Impact: 2-3x speedup on modern CPUs
2020s - Current State:
- 2020: xxHash3 (extremely fast non-cryptographic hash)
- 2021: wyhash optimization
- Present: Cache-oblivious algorithms, GPU hashing
- Innovation: AI/ML integration, quantum-resistant hashing research
- Impact: Handles big data and real-time analytics

What Each Era Replaced:

Era	Old Approach	New Approach	Improvement
1960s	Linear search	Simple hash table	O(n) → O(1) average
1970s	Poor hash functions	Universal hashing	Provable guarantees
1980s	Static size	Dynamic resizing	No manual tuning
1990s	Predictable hashes	Randomized hashes	DoS protection
2000s	Chaining (pointer overhead)	Open addressing variants	Better cache use
2010s	Scalar operations	SIMD operations	2-4x speedup
2020s	General-purpose	Hardware-specialized	10x+ for specific tasks

Variant Analysis

Major Hash Table Variants:

Separate Chaining:

class ChainingHashTable:
    def __init__(self, capacity=16):
        self.buckets = [[] for _ in range(capacity)]

    def put(self, key, value):
        bucket = self.buckets[hash(key) % len(self.buckets)]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))

Pros: Simple, load factor can exceed 1, easy deletion
Cons: Pointer overhead, cache misses
Best for: General purpose, variable load

Open Addressing (Linear Probing):

class LinearProbingHashTable:
    def __init__(self, capacity=16):
        self.table = [None] * capacity

    def put(self, key, value):
        i = hash(key) % len(self.table)
        while self.table[i] is not None and self.table[i][0] != key:
            i = (i + 1) % len(self.table)
        self.table[i] = (key, value)

Pros: Cache-friendly, no pointers
Cons: Clustering, complex deletion
Best for: Performance-critical, low load factor

Robin Hood Hashing:

class RobinHoodHashTable:
    def __init__(self, capacity=16):
        self.table = [None] * capacity
        self.distances = [0] * capacity

    def put(self, key, value):
        i = hash(key) % len(self.table)
        distance = 0

        while self.table[i] is not None:
            if self.distances[i] < distance:
                # Swap: current entry is "richer" than new entry
                key, value, distance, self.table[i], self.distances[i] = \
                    self.table[i][0], self.table[i][1], self.distances[i], \
                    (key, value), distance
            i = (i + 1) % len(self.table)
            distance += 1

        self.table[i] = (key, value)
        self.distances[i] = distance

Pros: Reduces variance, better worst-case
Cons: More complex than linear probing
Best for: Predictable latency

Cuckoo Hashing:

class CuckooHashTable:
    def __init__(self, capacity=16):
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity
        self.max_evictions = 10

    def put(self, key, value):
        return self._insert(key, value, 0)

    def _insert(self, key, value, evictions):
        if evictions > self.max_evictions:
            self._rehash()
            return self.put(key, value)

        # Try table 1
        i1 = self._hash1(key) % len(self.table1)
        if self.table1[i1] is None:
            self.table1[i1] = (key, value)
            return

        # Evict from table 1, insert in table 2
        old_key, old_value = self.table1[i1]
        self.table1[i1] = (key, value)

        i2 = self._hash2(old_key) % len(self.table2)
        if self.table2[i2] is None:
            self.table2[i2] = (old_key, old_value)
        else:
            evicted = self.table2[i2]
            self.table2[i2] = (old_key, old_value)
            self._insert(evicted[0], evicted[1], evictions + 1)

    def get(self, key):
        # Check both tables: guaranteed O(1)
        i1 = self._hash1(key) % len(self.table1)
        if self.table1[i1] and self.table1[i1][0] == key:
            return self.table1[i1][1]

        i2 = self._hash2(key) % len(self.table2)
        if self.table2[i2] and self.table2[i2][0] == key:
            return self.table2[i2][1]

        return None

Pros: O(1) worst-case lookup
Cons: Complex insertion, requires two tables
Best for: Real-time systems needing guaranteed latency

Hopscotch Hashing:

class HopscotchHashTable:
    def __init__(self, capacity=16, hop_range=32):
        self.table = [None] * capacity
        self.hop_range = hop_range
        self.hop_info = [0] * capacity  # Bitmap of occupied slots

    def put(self, key, value):
        i = hash(key) % len(self.table)

        # Find free slot within hop range
        free_slot = i
        while free_slot < i + self.hop_range and self.table[free_slot] is not None:
            free_slot += 1

        if free_slot >= i + self.hop_range:
            # Move entries to make room (complex logic)
            free_slot = self._make_room(i)

        self.table[free_slot] = (key, value)
        self.hop_info[i] |= (1 << (free_slot - i))

Pros: Bounded search, cache-friendly
Cons: Complex implementation
Best for: High-performance concurrent hash tables

Swiss Tables (Google Abseil):

class SwissTable:
    def __init__(self, capacity=16):
        self.groups = capacity // 16
        self.ctrl = bytearray(capacity)  # Control bytes
        self.slots = [None] * capacity

    def put(self, key, value):
        h1 = hash(key)
        h2 = h1 >> 57  # Top 7 bits for H2 hash
        group = (h1 % self.groups) * 16

        # SIMD: Compare H2 against 16 control bytes at once
        matches = self._simd_match(self.ctrl[group:group+16], h2)

        for offset in matches:
            if self.slots[group + offset] and self.slots[group + offset][0] == key:
                self.slots[group + offset] = (key, value)
                return

        # Insert in first empty slot
        for offset in range(16):
            if self.ctrl[group + offset] == 0:  # Empty
                self.ctrl[group + offset] = h2
                self.slots[group + offset] = (key, value)
                return

Pros: SIMD optimization, excellent cache performance
Cons: Requires modern CPU, complex implementation
Best for: Modern C++ applications (C++17+)

Variant Comparison Table:

Variant	Lookup	Insert	Delete	Space	Cache	Complexity	Use Case
Chaining	O(1) avg	O(1) avg	O(1) avg	High	Poor	Low	General purpose
Linear Probing	O(1/(1-α))	O(1/(1-α)²)	O(1/(1-α))	Medium	Excellent	Low	Performance
Robin Hood	O(1/(1-α))	O(1/(1-α)²)	O(1/(1-α))	Medium	Excellent	Medium	Predictable
Cuckoo	O(1) worst	O(1) amort	O(1)	High	Good	High	Real-time
Hopscotch	O(1)	O(1)	O(1)	Medium	Excellent	High	Concurrent
Swiss Tables	O(1)	O(1)	O(1)	Medium	Excellent	Very High	Modern C++

Choosing the Right Variant:

def choose_hash_table_variant(requirements):
    """
    Choose hash table variant based on requirements.
    """
    if requirements.need_worst_case_guarantee:
        if requirements.can_afford_space:
            return "Cuckoo Hashing"
        else:
            return "Hopscotch Hashing"

    if requirements.need_concurrent:
        return "Hopscotch or Lock-Free"

    if requirements.performance_critical:
        if requirements.modern_cpu:
            return "Swiss Tables"
        else:
            return "Robin Hood or Linear Probing"

    if requirements.simple_implementation:
        return "Separate Chaining"

    if requirements.low_memory:
        return "Open Addressing (Linear Probing)"

    # Default
    return "Separate Chaining"  # Most common, simplest

Hash Table Family Tree:

Hash-Based Data Structures
│
├── Exact Structures
│   ├── Hash Map (key-value)
│   │   ├── HashMap (Java)
│   │   ├── unordered_map (C++)
│   │   ├── dict (Python)
│   │   └── Map (JavaScript)
│   │
│   ├── Hash Set (keys only)
│   │   ├── HashSet (Java)
│   │   ├── unordered_set (C++)
│   │   ├── set (Python)
│   │   └── Set (JavaScript)
│   │
│   ├── Multi-Map/Multi-Set
│   │   └── Allows duplicate keys
│   │
│   └── Bidirectional Map
│       └── Key ↔ Value both indexed
│
├── Approximate Structures
│   ├── Bloom Filter (membership)
│   ├── Count-Min Sketch (frequency)
│   ├── HyperLogLog (cardinality)
│   └── Quotient Filter (membership)
│
├── Distributed Structures
│   ├── Consistent Hashing
│   ├── Distributed Hash Table (DHT)
│   │   ├── Chord
│   │   ├── Kademlia
│   │   └── Pastry
│   │
│   └── Rendezvous Hashing
│
├── Concurrent Structures
│   ├── Lock-Free Hash Table
│   ├── ConcurrentHashMap (Java)
│   ├── TBB concurrent_hash_map (C++)
│   └── Striped Hash Table
│
└── Specialized Structures
    ├── Perfect Hash Table
    ├── Minimal Perfect Hash Table
    ├── Locality-Sensitive Hashing (LSH)
    └── Geometric Hashing

Relationships:

Hash Table ↔ Hash Function:
- Hash table depends on hash function
- Quality of hash function determines collision rate
- Example hash functions:
  - Division method
  - Multiplication method
  - Universal hashing
  - Cryptographic hashes (MD5, SHA-256)
Hash Table ↔ Array:
- Hash table built on top of array
- Hash function maps keys to array indices
- Array provides O(1) random access
Hash Table ↔ Linked List (Chaining):
- Chaining uses linked lists for collision resolution
- Each bucket is a linked list
- Poor cache locality but simple
Hash Table ↔ BST (Java 8 HashMap):
- When chain length > 8, convert to tree
- Guarantees O(log n) even with poor hash
- Best of both worlds
Hash Set ↔ Hash Map:
- Hash set is hash map with dummy values
- Implementation: HashSet uses HashMap internally
- Only stores keys, values are placeholder
Hash Table ↔ Bloom Filter:
- Bloom filter is probabilistic hash set
- Multiple hash functions
- False positives possible, no false negatives
- Much smaller than hash set

Hybrid Solutions:

LRU Cache (Hash Map + Doubly Linked List):

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # Hash map: key -> node
        self.head = Node(0, 0)  # Dummy head
        self.tail = Node(0, 0)  # Dummy tail
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key):
        if key in self.cache:
            node = self.cache[key]
            self._move_to_front(node)
            return node.value
        return -1

    def put(self, key, value):
        if key in self.cache:
            node = self.cache[key]
            node.value = value
            self._move_to_front(node)
        else:
            if len(self.cache) >= self.capacity:
                # Evict LRU
                lru = self.tail.prev
                self._remove(lru)
                del self.cache[lru.key]

            node = Node(key, value)
            self.cache[key] = node
            self._add_to_front(node)

# Hash map: O(1) lookup
# Doubly linked list: O(1) move to front, O(1) remove tail
# Combined: O(1) get and put

TreeMap with Hashing (for buckets):

class HashTreeMap:
    """
    Combines hash table (for speed) with BST (for ordering within buckets).
    """
    def __init__(self, capacity=16):
        self.buckets = [None] * capacity

    def put(self, key, value):
        bucket_idx = hash(key) % len(self.buckets)
        if self.buckets[bucket_idx] is None:
            self.buckets[bucket_idx] = BST()
        self.buckets[bucket_idx].insert(key, value)

    def range_query(self, low, high):
        """
        Get all keys in range [low, high].
        Hash to relevant buckets, then BST range query.
        """
        results = []
        for bucket in self.buckets:
            if bucket:
                results.extend(bucket.range_query(low, high))
        return results

# Benefits: Fast exact lookup (hash) + range queries (BST)

Trie with Hash Maps (for children):

class TrieNode:
    def __init__(self):
        self.children = {}  # Hash map instead of array
        self.is_end = False

# Array-based trie: Fixed alphabet size (e.g., 26 for a-z)
# Hash map-based trie: Any characters, dynamic
# Trade-off: Speed vs flexibility

Cross-Domain Connections

Hash Tables in Different Fields:

Databases:
- Hash Indexes: Fast equality lookups
- Hash Joins: Join two tables by hashing smaller table
- Query Optimization: Cardinality estimation with HyperLogLog
- Caching: Query result cache using hash map
Cryptography:
- Password Storage: Hash passwords with salt
- Digital Signatures: Hash document, sign hash
- Blockchain: Hash pointers in Merkle trees
- Integrity: File hashing (checksums)
Networking:
- Load Balancing: Consistent hashing for server selection
- ARP Tables: IP address → MAC address
- DNS Caching: Domain name → IP address
- NAT: Map internal IP:port to external IP:port
Compilers:
- Symbol Tables: Identifier → type/address
- String Interning: Deduplicate string literals
- Constant Folding: Cache computed constants
- Macro Expansion: Macro name → definition
Operating Systems:
- Page Tables: Virtual address → physical address
- File Systems: Inode hashing, directory structures
- Process Tables: PID → process control block
- Buffer Cache: Block number → cached block
Graphics/Computer Vision:
- Geometric Hashing: Object recognition
- Locality-Sensitive Hashing: Similar image search
- Texture Caching: Hash texture coordinates
- Collision Detection: Spatial hashing
Machine Learning:
- Feature Hashing: High-dimensional sparse features
- Hashing Trick: Reduce dimensionality
- MinHash: Similar document detection
- SimHash: Near-duplicate detection

Mathematical Connections:

Probability Theory:
- Birthday paradox → Collision probability
- Poisson distribution → Chain length distribution
- Load factor → Expected performance
Number Theory:
- Prime numbers → Table size selection
- Modular arithmetic → Index computation
- Universal hashing → Prime field arithmetic
Information Theory:
- Entropy → Hash distribution quality
- Collision resistance → Information loss
- Perfect hashing → Bijective function

7.4 Practical Considerations

Library Implementations

Standard Library Implementations Across Languages:

Python dict:

# Internal: Open addressing with random probing
d = {}  # or dict()
d[key] = value  # O(1) average
value = d[key]  # Raises KeyError if not found
value = d.get(key, default)  # Returns default if not found

# Advanced features:
d.setdefault(key, default)  # Get or insert with default
d.pop(key, default)  # Remove and return, or default
d.keys()  # dict_keys view (dynamic)
d.values()  # dict_values view
d.items()  # dict_items view (key-value pairs)

# Python 3.7+: Maintains insertion order
# Implementation: Combined hash/index table

Java HashMap:

// Internal: Chaining, converts to tree when chain > 8
HashMap<K, V> map = new HashMap<>();
map.put(key, value);  // O(1) average
V value = map.get(key);  // null if not found
map.remove(key);

// Advanced features:
map.putIfAbsent(key, value);
map.computeIfAbsent(key, k -> computeValue(k));
map.getOrDefault(key, defaultValue);
map.forEach((k, v) -> process(k, v));

// Java 8+: Streams support
map.entrySet().stream()
    .filter(e -> e.getValue() > 10)
    .collect(Collectors.toMap(...));

C++ std::unordered_map:

// Internal: Traditionally chaining, modern uses Swiss Tables
std::unordered_map<K, V> map;
map[key] = value;  // O(1) average
auto it = map.find(key);  // Returns iterator
if (it != map.end()) {
    V value = it->second;
}

// Advanced features:
map.emplace(key, value);  // Construct in-place
map.insert_or_assign(key, value);  // C++17
map.try_emplace(key, args...);  // C++17

// Custom hash function:
struct CustomHash {
    size_t operator()(const K& k) const {
        return std::hash<K>{}(k);
    }
};
std::unordered_map<K, V, CustomHash> custom_map;

JavaScript Map:

// Internal: Varies by engine (V8 uses Swiss Tables)
const map = new Map();
map.set(key, value); // O(1) average
const value = map.get(key); // undefined if not found
map.has(key); // boolean
map.delete(key);

// Maintains insertion order
for (const [key, value] of map) {
  console.log(key, value);
}

// Advanced:
map.forEach((value, key) => process(key, value));
const keys = Array.from(map.keys());

// Note: Objects can also be used as maps, but Map is better
// Map vs Object:
// - Map: Any key type, maintains order, .size property
// - Object: String/Symbol keys only, may have prototype pollution

Go map:

// Internal: Open addressing with some optimizations
m := make(map[string]int)
m[key] = value  // O(1) average
value, exists := m[key]  // Check existence
delete(m, key)

// Iteration (order not guaranteed):
for key, value := range m {
    fmt.Println(key, value)
}

// Pre-allocation:
m := make(map[string]int, capacity)

// Note: Maps are not safe for concurrent use
// Use sync.Map for concurrent access

Hidden Optimizations in Libraries:

Python dict (3.6+):
- Compact dict: Separate hash/index table from key/value table
- Memory saving: 20-25% less memory than old implementation
- Insertion order: Free with new design
- Fast iteration: Keys/values stored contiguously
Java HashMap (8+):
- Treeification: Chains → Red-Black trees when length > 8
- Untreeification: Trees → chains when shrinking below 6
- Prevents: O(n) worst-case from hash collision attacks
- Trade-off: More complex, but guaranteed O(log n)
C++ std::unordered_map (recent):
- Swiss Tables: Group-based open addressing with SIMD
- Control bytes: Metadata for each slot (empty/deleted/H2 hash)
- SIMD matching: 16 parallel comparisons
- Performance: 2-3x faster than old chaining implementation
JavaScript V8 Map:
- Swiss Tables: Similar to C++
- Inline caching: JIT optimizations for repeated access patterns
- Hidden classes: Optimize property access on objects used as maps

When to Use Library vs Implement:

Scenario	Use Library	Implement Custom
General purpose	✓
Production code	✓
Learning/education		✓
Special requirements		✓ (e.g., lock-free)
Language built-in	✓
Specialized hash function	Partial	✓ (custom hash)
Memory-constrained	Maybe	✓ (control layout)
Performance-critical	✓ (try first)	✓ (if library insufficient)

API Differences Across Languages:

# Key operations across languages

# Python
value = d.get(key, default)  # Safe get with default

# Java
V value = map.getOrDefault(key, defaultValue);

# C++
auto it = map.find(key);
V value = (it != map.end()) ? it->second : defaultValue;

# JavaScript
const value = map.get(key) ?? defaultValue;

# Go
value, exists := m[key]
if !exists {
    value = defaultValue
}

# Common pattern: Safe get with default
# Python/Java/JS: Built-in method
# C++/Go: Manual check required

Production vs Academic

Real-World vs Interview Differences:

Aspect	Interview	Production
Implementation	From scratch	Use library
Hash function	Simple (e.g., % table_size)	Robust (MurmurHash, SipHash)
Collision handling	Chaining or linear probing	Sophisticated (Robin Hood, Swiss Tables)
Error handling	Minimal	Comprehensive
Thread safety	Single-threaded	Concurrent, lock-free
Testing	Few test cases	Extensive testing, fuzzing
Edge cases	Basic	Comprehensive (overflow, attacks, etc.)
Documentation	None	Required
Performance	Asymptotic analysis	Profiling, benchmarking
Security	Ignored	Critical (DoS prevention)

Production Concerns Not in Interviews:

Thread Safety:

# Interview: Simple hash table (not thread-safe)
class HashTable:
    def put(self, key, value):
        # ... simple insertion ...

# Production: Thread-safe with locks
class ThreadSafeHashTable:
    def __init__(self):
        self.lock = threading.RLock()
        self.table = {}

    def put(self, key, value):
        with self.lock:
            self.table[key] = value

# Or use built-in thread-safe structures:
from multiprocessing import Manager
manager = Manager()
shared_dict = manager.dict()  # Thread-safe

Memory Management:

# Interview: Ignore memory details

# Production: Monitor memory usage
import sys
import tracemalloc

tracemalloc.start()
hash_table = {}
for i in range(1000000):
    hash_table[i] = i

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"Peak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

Security (Hash Collision Attacks):

# Interview: Use simple hash
def simple_hash(key):
    return sum(ord(c) for c in str(key))

# Production: Randomized hash to prevent DoS
import hashlib
import secrets

class SecureHashTable:
    def __init__(self):
        self.salt = secrets.token_bytes(16)  # Random per instance

    def _hash(self, key):
        return hashlib.sha256(
            str(key).encode() + self.salt
        ).digest()

Monitoring and Metrics:

# Production: Track performance metrics
class MonitoredHashTable:
    def __init__(self):
        self.table = {}
        self.hit_count = 0
        self.miss_count = 0
        self.collision_count = 0

    def get(self, key):
        if key in self.table:
            self.hit_count += 1
            return self.table[key]
        else:
            self.miss_count += 1
            return None

    def metrics(self):
        total = self.hit_count + self.miss_count
        hit_rate = self.hit_count / total if total > 0 else 0
        return {
            'hit_rate': hit_rate,
            'collision_rate': self.collision_count / total,
            'load_factor': len(self.table) / self.capacity
        }

Constant Factors in Practice:

Despite same Big-O, implementations can vary 10x in speed:

import time

def benchmark_hash_implementations():
    n = 1000000
    keys = list(range(n))

    # Simple division method
    start = time.perf_counter()
    hash1 = [hash(k) % n for k in keys]
    time1 = time.perf_counter() - start

    # Multiplication method
    start = time.perf_counter()
    A = 0.6180339887  # Golden ratio
    hash2 = [int(n * (k * A % 1)) for k in keys]
    time2 = time.perf_counter() - start

    # Cryptographic hash (overkill for hash table)
    import hashlib
    start = time.perf_counter()
    hash3 = [int(hashlib.md5(str(k).encode()).hexdigest(), 16) % n for k in keys]
    time3 = time.perf_counter() - start

    print(f"Division method: {time1:.3f}s")
    print(f"Multiplication method: {time2:.3f}s")
    print(f"Cryptographic hash: {time3:.3f}s")

    # Typical output:
    # Division method: 0.15s
    # Multiplication method: 0.18s (slightly slower, better distribution)
    # Cryptographic hash: 8.50s (much slower, unnecessary for hash table)

When Asymptotic Guarantees Don't Match Reality:

Small datasets (n < 100):
- O(1) hash table vs O(n) linear search
- Linear search often faster due to:
  - No hash computation overhead
  - Better cache locality
  - No resizing
Cache effects dominate:
- Chaining: O(1 + α) average
- Linear probing: O(1 / (1-α)) average
- But linear probing 2-3x faster due to cache locality
Dynamic resizing:
- Amortized O(1) insertion
- But occasional O(n) spike
- Can cause tail latency issues in real-time systems

Best Practices from Production:

Pre-size if possible:

# Bad: Multiple resizes
d = {}
for i in range(1000000):
    d[i] = i

# Good: Pre-allocate
d = dict.fromkeys(range(1000000))
for i in range(1000000):
    d[i] = i

Use appropriate data structure:

# Counting: Use Counter
from collections import Counter
counts = Counter(items)  # Optimized for frequency counting

# Ordered: Use OrderedDict or dict (Python 3.7+)
from collections import OrderedDict
ordered = OrderedDict()  # Explicit ordering guarantee

# Default values: Use defaultdict
from collections import defaultdict
groups = defaultdict(list)  # Auto-initialize lists
for item in items:
    groups[get_key(item)].append(item)

Profile before optimizing:

import cProfile
import pstats

def hash_intensive_function():
    d = {}
    for i in range(100000):
        d[i] = i
        _ = d.get(i)

profiler = cProfile.Profile()
profiler.enable()
hash_intensive_function()
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions

8. Edge Cases and Limitations

8.1 Known Limitations

Structural Limitations

Fundamental Limitations of Hash Tables:

No Ordering:

# Hash tables don't maintain order (except Python 3.7+ dict)
hash_table = {3: "three", 1: "one", 2: "two"}

# Can't efficiently:
# - Iterate in sorted order
# - Find minimum/maximum key
# - Get range of keys [a, b]
# - Find predecessor/successor

# Workaround: Extract and sort
sorted_items = sorted(hash_table.items())  # O(n log n)

# Better alternative: Use BST if ordering is primary requirement
from sortedcontainers import SortedDict
sorted_dict = SortedDict({3: "three", 1: "one", 2: "two"})
print(sorted_dict.keys())  # Always sorted: [1, 2, 3]

No Range Queries:

# Can't efficiently find all keys in range [low, high]
def range_query(hash_table, low, high):
    # Must check every key: O(n)
    return {k: v for k, v in hash_table.items() if low <= k <= high}

# BST alternative: O(log n + k) where k = results
# Segment tree: O(log n + k)

Rehashing Cost:

# Occasional O(n) spike during resize
import time

hash_table = {}
times = []

for i in range(100000):
    start = time.perf_counter()
    hash_table[i] = i
    elapsed = time.perf_counter() - start
    if elapsed > 0.001:  # More than 1ms
        times.append((i, elapsed))
        print(f"Slow insertion at i={i}: {elapsed*1000:.2f}ms")

# Output shows spikes at powers of 2:
# Slow insertion at i=12: 1.2ms
# Slow insertion at i=24: 2.3ms
# Slow insertion at i=48: 4.1ms
# Slow insertion at i=96: 7.8ms

Hash Function Dependency:

# Poor hash function → O(n) worst case
class BadHashKey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return 42  # All objects hash to same value!

hash_table = {}
for i in range(1000):
    key = BadHashKey(i)
    hash_table[key] = i

# All keys collide → effectively a linked list → O(n) operations

Memory Overhead:

# Hash table requires load factor < 1 (usually 0.75)
# Means 25-33% wasted space

import sys

# Sorted array: No overhead
sorted_array = list(range(1000000))
print(f"Sorted array: {sys.getsizeof(sorted_array) / 10**6:.2f}MB")

# Hash table: Load factor overhead
hash_table = {i: i for i in range(1000000)}
print(f"Hash table: {sys.getsizeof(hash_table) / 10**6:.2f}MB")

# Output:
# Sorted array: 8.0MB
# Hash table: 36.0MB (4.5x more!)

Problems Hash Tables Cannot Solve Efficiently:

Problem	Hash Table	Better Alternative	Reason
Find kth smallest	O(n log n) sort	Quickselect O(n) avg	Need ordering
Range sum query	O(n) scan	Segment tree O(log n)	No range support
Prefix matching	O(n) scan	Trie O(k)	Need prefix structure
Nearest neighbor	O(n) scan	K-D tree O(log n)	Need spatial structure
Interval queries	O(n) scan	Interval tree O(log n)	Need range support
Sorted iteration	O(n log n) sort	BST O(n)	No ordering
Median finding	O(n log n) sort	Two heaps O(log n)	Need ordering

Scalability Limits:

# Maximum practical size depends on available memory

import sys

def estimate_max_size(available_memory_gb=8):
    """
    Estimate maximum hash table size.
    """
    # Assume 64-byte per entry (key + value + overhead)
    bytes_per_entry = 64
    load_factor = 0.75

    available_bytes = available_memory_gb * 10**9
    max_entries = int((available_bytes / bytes_per_entry) * load_factor)

    print(f"Available memory: {available_memory_gb}GB")
    print(f"Estimated max entries: {max_entries:,}")
    print(f"At 1M ops/sec, would take {max_entries / 10**6:.1f} seconds to build")

    return max_entries

estimate_max_size(8)
# Output:
# Available memory: 8GB
# Estimated max entries: 93,750,000
# At 1M ops/sec, would take 93.8 seconds to build

# For larger datasets: Use distributed hash tables or approximate structures

When NOT to Use

Anti-Patterns and Failure Modes:

Small datasets (< 20 elements):

# Bad: Hash table overhead not worth it
user_types = {"admin": 1, "user": 2, "guest": 3}

# Better: Simple list of tuples
user_types = [("admin", 1), ("user", 2), ("guest", 3)]

# Lookup: O(n) but faster in practice for small n
def get_type(name):
    for n, t in user_types:
        if n == name:
            return t
    return None

Need sorted/ordered access:

# Bad: Hash table + sorting
events = {}
for event in event_stream:
    events[event.timestamp] = event

sorted_events = sorted(events.items())  # O(n log n) every time!

# Better: Use sorted structure from start
from sortedcontainers import SortedDict
events = SortedDict()
for event in event_stream:
    events[event.timestamp] = event
# Already sorted: O(1) to iterate

Mutable keys:

# Bad: Using mutable object as key
key = [1, 2, 3]
hash_table = {}
hash_table[key] = "value"  # TypeError: unhashable type: 'list'

# Also bad: Mutable object that is hashable but changes
class MutableKey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return self.value == other.value

key = MutableKey(5)
hash_table = {}
hash_table[key] = "value"
print(hash_table[key])  # Works

key.value = 10  # Mutate key
print(hash_table[key])  # May not find it! Hash changed.

# Correct: Use immutable keys
key = (1, 2, 3)  # Tuple
hash_table[key] = "value"  # Works correctly

Real-time systems with strict latency:

# Bad: Resizing causes unpredictable latency spikes
hash_table = {}
for i in range(1000000):
    hash_table[i] = i  # Occasional O(n) spike during resize

# Better for real-time: Pre-allocate or use Cuckoo hashing
hash_table = dict.fromkeys(range(1000000))  # Pre-allocate
# Or use Cuckoo hashing: O(1) worst-case lookup

Prefix/fuzzy matching:

# Bad: Hash table for prefix search
dictionary = {"apple", "application", "apply", "banana"}
hash_set = set(dictionary)

def find_prefix(prefix):
    # O(n) scan
    return [word for word in hash_set if word.startswith(prefix)]

# Better: Use Trie
class Trie:
    # O(k) search where k = prefix length
    pass

Extreme Conditions

Handling Extreme Inputs:

Maximum integer keys:

# Large integers
import sys

hash_table = {}
max_int = sys.maxsize  # 2^63 - 1 on 64-bit
hash_table[max_int] = "max"
hash_table[-max_int] = "min"

# Hash function should handle full range
print(hash(max_int))  # Works in Python

# Potential issue: Hash collision with large integers
# Python's hash is identity for small integers, but hashes large ones

Memory pressure:

# What happens when running out of memory?
import sys

hash_table = {}
try:
    for i in range(10**9):  # Try to insert 1 billion elements
        hash_table[i] = [0] * 1000  # Large values
        if i % 1000000 == 0:
            print(f"Inserted {i:,} elements")
except MemoryError:
    print(f"Out of memory at {i:,} elements")
    print(f"Table size: {sys.getsizeof(hash_table) / 10**9:.2f}GB")

# Graceful degradation:
# 1. Use approximate structures (Bloom filter, Count-Min Sketch)
# 2. Distributed hash tables
# 3. Disk-based hash tables

Adversarial input (Hash collision attacks):

# Attacker crafts keys that all hash to same value
# → All collide → O(n) operations → DoS

# Example: Predictable hash function
def bad_hash(s):
    return sum(ord(c) for c in s) % 1000

# Attacker generates many strings with same hash:
# "abc", "bac", "cab" all have same character sum

# Defense 1: Randomized hashing
import random
import hashlib

class SecureHashTable:
    def __init__(self):
        self.salt = random.randbytes(16)

    def _hash(self, key):
        return hashlib.sha256(str(key).encode() + self.salt).digest()

# Defense 2: Switch to tree when chain too long (Java 8 HashMap)
class TreeifyingHashTable:
    TREEIFY_THRESHOLD = 8

    def _handle_collision(self, bucket):
        if len(bucket) > self.TREEIFY_THRESHOLD:
            # Convert linked list to red-black tree
            bucket = RedBlackTree(bucket)
            # Now O(log n) even with all collisions

Numerical overflow in hash function:

# Multiplication can overflow in languages without arbitrary precision
def hash_with_overflow(key):
    prime = 31
    hash_val = 0
    for char in str(key):
        hash_val = hash_val * prime + ord(char)
        # In C/Java: This can overflow!
        # Python: No problem, arbitrary precision
    return hash_val

# C/Java solution: Use modulo to keep in bounds
def safe_hash(key, modulo=2**32):
    prime = 31
    hash_val = 0
    for char in str(key):
        hash_val = (hash_val * prime + ord(char)) % modulo
    return hash_val

8.2 Edge Cases

Input Edge Cases

Comprehensive Edge Case Testing:

class HashTableTester:
    """
    Test hash table with all edge cases.
    """

    def test_empty_table(self, hash_table):
        """Empty table operations."""
        assert hash_table.size() == 0
        assert hash_table.get("nonexistent") is None
        assert hash_table.delete("nonexistent") is False
        assert list(hash_table.keys()) == []

    def test_single_element(self, hash_table):
        """Single element edge case."""
        hash_table.put("only", "value")
        assert hash_table.size() == 1
        assert hash_table.get("only") == "value"
        hash_table.delete("only")
        assert hash_table.size() == 0

    def test_duplicate_keys(self, hash_table):
        """Duplicate insertions should update."""
        hash_table.put("key", "value1")
        hash_table.put("key", "value2")
        assert hash_table.size() == 1
        assert hash_table.get("key") == "value2"

    def test_null_none_keys(self, hash_table):
        """None/null as key."""
        # Python allows None as key
        hash_table.put(None, "null_value")
        assert hash_table.get(None) == "null_value"

        # Empty string as key
        hash_table.put("", "empty")
        assert hash_table.get("") == "empty"

    def test_special_values(self, hash_table):
        """Special numeric values."""
        hash_table.put(0, "zero")
        hash_table.put(-0, "negative_zero")  # Same as 0 in Python
        assert hash_table.size() == 1  # -0 == 0

        # Infinity
        hash_table.put(float('inf'), "infinity")
        assert hash_table.get(float('inf')) == "infinity"

        # NaN (tricky: NaN != NaN)
        nan = float('nan')
        hash_table.put(nan, "not_a_number")
        # May not be retrievable: hash(nan) works but nan != nan

    def test_maximum_values(self, hash_table):
        """Maximum integer values."""
        import sys
        max_int = sys.maxsize
        hash_table.put(max_int, "max")
        hash_table.put(-max_int, "min")
        assert hash_table.get(max_int) == "max"

    def test_unicode_keys(self, hash_table):
        """Unicode and special characters."""
        hash_table.put("hello", "english")
        hash_table.put("こんにちは", "japanese")
        hash_table.put("🔥", "emoji")
        assert hash_table.size() == 3

    def test_collision_scenario(self, hash_table):
        """Force collisions (if using simple hash)."""
        # Keys that may collide with simple hash
        keys = ["abc", "bac", "cab"]  # Same character sum
        for i, key in enumerate(keys):
            hash_table.put(key, i)

        # All should be retrievable
        assert hash_table.get("abc") == 0
        assert hash_table.get("bac") == 1
        assert hash_table.get("cab") == 2

    def test_resize_boundary(self, hash_table):
        """Test around resize threshold."""
        # Assuming load factor = 0.75, capacity = 16
        # Resize should occur at 12th element
        for i in range(15):
            hash_table.put(i, i)
            print(f"Size: {hash_table.size()}, Capacity: {hash_table.capacity()}")

    def test_delete_nonexistent(self, hash_table):
        """Delete key that doesn't exist."""
        result = hash_table.delete("nonexistent")
        assert result is False or result is None

    def test_update_during_iteration(self, hash_table):
        """Concurrent modification detection."""
        hash_table.put(1, "one")
        hash_table.put(2, "two")

        try:
            for key in hash_table.keys():
                hash_table.put(3, "three")  # Modify during iteration
                # Some implementations throw ConcurrentModificationException
        except RuntimeError:
            pass  # Expected in some implementations

Edge Case Categories:

Size edge cases:
- Empty table (size = 0)
- Single element (size = 1)
- At resize threshold (size = capacity * load_factor)
- Maximum size (limited by memory)
Key edge cases:
- None/null keys
- Empty string keys
- Very long keys (10KB+ strings)
- Duplicate keys (should update)
- Non-existent keys (lookups)
Value edge cases:
- None/null values
- Large values (stress memory)
- Mutable values (should be allowed)
Numeric edge cases:
- Zero (0, -0, 0.0)
- Negative numbers
- Maximum/minimum integers
- Infinity (float('inf'))
- NaN (float('nan'))
String edge cases:
- Empty string ("")
- Unicode strings
- Very long strings
- Strings with special characters

8.3 Error Handling and Robustness

Failure Modes

Common Failure Scenarios:

class RobustHashTable:
    """
    Hash table with comprehensive error handling.
    """

    def __init__(self, capacity=16, load_factor=0.75):
        if capacity <= 0:
            raise ValueError(f"Capacity must be positive, got {capacity}")
        if not (0 < load_factor < 1):
            raise ValueError(f"Load factor must be in (0, 1), got {load_factor}")

        self.capacity = capacity
        self.load_factor = load_factor
        self.size = 0
        self.table = [[] for _ in range(capacity)]

    def put(self, key, value):
        """
        Insert with error handling.
        """
        # Validate key
        if key is None:
            raise KeyError("None is not a valid key")

        # Check if key is hashable
        try:
            hash_val = hash(key)
        except TypeError as e:
            raise TypeError(f"Key {key} is not hashable: {e}")

        # Check memory availability before resize
        if self._should_resize():
            try:
                self._resize()
            except MemoryError:
                raise MemoryError("Cannot resize: out of memory")

        # Insert
        index = hash_val % self.capacity
        bucket = self.table[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)  # Update
                return

        bucket.append((key, value))
        self.size += 1

    def get(self, key, default=None):
        """
        Safe get with default value.
        """
        try:
            hash_val = hash(key)
        except TypeError:
            return default  # Unhashable key

        index = hash_val % self.capacity
        bucket = self.table[index]

        for k, v in bucket:
            if k == key:
                return v

        return default

    def delete(self, key):
        """
        Delete with error handling.
        """
        try:
            hash_val = hash(key)
        except TypeError as e:
            raise TypeError(f"Cannot delete unhashable key: {e}")

        index = hash_val % self.capacity
        bucket = self.table[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                del bucket[i]
                self.size -= 1
                return True

        # Key not found
        return False

    def _should_resize(self):
        """Check if resize needed."""
        return self.size >= self.capacity * self.load_factor

    def _resize(self):
        """
        Resize with error handling.
        """
        old_table = self.table
        self.capacity *= 2

        try:
            self.table = [[] for _ in range(self.capacity)]
        except MemoryError:
            # Restore old state
            self.capacity //= 2
            raise

        self.size = 0
        for bucket in old_table:
            for key, value in bucket:
                self.put(key, value)

Error Categories:

Invalid Input:
- Non-hashable keys
- None as key (if not allowed)
- Invalid capacity/load factor
Resource Exhaustion:
- Out of memory during resize
- Too many elements (platform limits)
Concurrent Modification:
- Modifying table during iteration
- Thread safety violations
Corruption:
- Hash function changes for same key
- Mutable key modified after insertion

Recovery Mechanisms:

class ResilientHashTable:
    """
    Hash table with recovery mechanisms.
    """

    def __init__(self):
        self.table = {}
        self.backup = None
        self.operation_log = []

    def put(self, key, value):
        """
        Transactional put with rollback.
        """
        # Create backup before modification
        old_value = self.table.get(key)
        self.operation_log.append(('put', key, old_value))

        try:
            self.table[key] = value
        except Exception as e:
            # Rollback on error
            self.rollback()
            raise

    def batch_put(self, items):
        """
        Batch insert with all-or-nothing semantics.
        """
        self.create_checkpoint()

        try:
            for key, value in items:
                self.table[key] = value
        except Exception as e:
            # Rollback all changes
            self.restore_checkpoint()
            raise Exception(f"Batch put failed, rolled back: {e}")
        finally:
            self.clear_checkpoint()

    def create_checkpoint(self):
        """Create backup for rollback."""
        import copy
        self.backup = copy.deepcopy(self.table)

    def restore_checkpoint(self):
        """Restore from backup."""
        if self.backup is not None:
            self.table = self.backup
            self.backup = None

    def clear_checkpoint(self):
        """Clear backup."""
        self.backup = None

    def rollback(self):
        """Rollback last operation."""
        if self.operation_log:
            op, key, old_value = self.operation_log.pop()
            if op == 'put':
                if old_value is None:
                    del self.table[key]
                else:
                    self.table[key] = old_value

Validation

Data Integrity Checks:

class ValidatedHashTable:
    """
    Hash table with validation.
    """

    def __init__(self):
        self.table = {}
        self.checksum = 0

    def put(self, key, value):
        """Insert with validation."""
        # Validate inputs
        self._validate_key(key)
        self._validate_value(value)

        # Update checksum
        old_checksum = self.checksum
        self.checksum ^= hash((key, value))

        try:
            self.table[key] = value
        except Exception:
            # Restore checksum on failure
            self.checksum = old_checksum
            raise

    def verify_integrity(self):
        """
        Verify hash table integrity.
        """
        # Recompute checksum from scratch
        computed_checksum = 0
        for key, value in self.table.items():
            computed_checksum ^= hash((key, value))

        if computed_checksum != self.checksum:
            raise ValueError("Hash table corrupted: checksum mismatch")

        return True

    def _validate_key(self, key):
        """Validate key."""
        if key is None:
            raise ValueError("Key cannot be None")

        try:
            hash(key)
        except TypeError:
            raise TypeError(f"Key must be hashable, got {type(key)}")

    def _validate_value(self, value):
        """Validate value (optional)."""
        # Can add custom validation logic
        pass

9. Practical Mastery

9.1 Implementation

Coding from Scratch

Complete Hash Table Implementation (All 12 Topics Integrated)

This section provides production-ready Python implementations for all 12 hash table topics.

Topic 58 & 62: Hash Table Fundamentals & HashMap Implementation

class HashMap:
    """
    Complete HashMap implementation with separate chaining.
    Demonstrates hash table fundamentals with dynamic resizing.
    """

    class _Node:
        """Internal node for chaining."""
        __slots__ = ['key', 'value', 'next']

        def __init__(self, key, value, next_node=None):
            self.key = key
            self.value = value
            self.next = next_node

    def __init__(self, initial_capacity=16, load_factor_threshold=0.75):
        """
        Initialize hash map.

        Args:
            initial_capacity: Initial size of bucket array
            load_factor_threshold: Resize when n/m exceeds this
        """
        if initial_capacity < 1:
            raise ValueError("Capacity must be positive")
        if not (0 < load_factor_threshold < 1):
            raise ValueError("Load factor must be between 0 and 1")

        self._capacity = self._next_power_of_2(initial_capacity)
        self._buckets = [None] * self._capacity
        self._size = 0
        self._load_factor_threshold = load_factor_threshold
        self._modification_count = 0  # For fail-fast iterators

    def _next_power_of_2(self, n):
        """Find next power of 2 >= n."""
        power = 1
        while power < n:
            power <<= 1
        return power

    def _hash(self, key):
        """
        Compute hash for key.
        Uses Python's built-in hash with perturbation for better distribution.
        """
        h = hash(key)
        # Additional mixing for better distribution
        h ^= (h >> 16)
        return h

    def _get_index(self, key):
        """Get bucket index for key."""
        return self._hash(key) & (self._capacity - 1)  # Fast modulo for power of 2

    def put(self, key, value):
        """
        Insert or update key-value pair.

        Time: O(1) average, O(n) worst case
        Space: O(1)
        """
        if key is None:
            raise ValueError("Key cannot be None")

        index = self._get_index(key)
        node = self._buckets[index]

        # Search for existing key
        while node:
            if node.key == key or (node.key is not None and node.key == key):
                # Update existing
                node.value = value
                return
            node = node.next

        # Insert new node at head of chain
        new_node = self._Node(key, value, self._buckets[index])
        self._buckets[index] = new_node
        self._size += 1
        self._modification_count += 1

        # Check if resize needed
        if self._size / self._capacity > self._load_factor_threshold:
            self._resize(self._capacity * 2)

    def get(self, key, default=None):
        """
        Get value for key.

        Time: O(1) average, O(n) worst case
        Space: O(1)
        """
        if key is None:
            return default

        index = self._get_index(key)
        node = self._buckets[index]

        while node:
            if node.key == key:
                return node.value
            node = node.next

        return default

    def remove(self, key):
        """
        Remove key-value pair.

        Time: O(1) average, O(n) worst case
        Space: O(1)

        Returns:
            Value that was removed, or None if key not found
        """
        if key is None:
            return None

        index = self._get_index(key)
        node = self._buckets[index]
        prev = None

        while node:
            if node.key == key:
                # Found it
                if prev:
                    prev.next = node.next
                else:
                    self._buckets[index] = node.next

                self._size -= 1
                self._modification_count += 1
                return node.value

            prev = node
            node = node.next

        return None

    def contains_key(self, key):
        """Check if key exists."""
        return self.get(key, object()) is not object()

    def _resize(self, new_capacity):
        """
        Resize and rehash all elements.

        Time: O(n) where n = number of elements
        Space: O(n) for new array
        """
        old_buckets = self._buckets
        self._capacity = new_capacity
        self._buckets = [None] * self._capacity
        self._size = 0

        # Rehash all elements
        for bucket in old_buckets:
            node = bucket
            while node:
                self.put(node.key, node.value)
                node = node.next

    def keys(self):
        """Return list of all keys."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append(node.key)
                node = node.next
        return result

    def values(self):
        """Return list of all values."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append(node.value)
                node = node.next
        return result

    def items(self):
        """Return list of (key, value) tuples."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append((node.key, node.value))
                node = node.next
        return result

    def __len__(self):
        return self._size

    def __contains__(self, key):
        return self.contains_key(key)

    def __getitem__(self, key):
        value = self.get(key, object())
        if value is object():
            raise KeyError(key)
        return value

    def __setitem__(self, key, value):
        self.put(key, value)

    def __delitem__(self, key):
        if self.remove(key) is None:
            raise KeyError(key)

    def __iter__(self):
        """Iterate over keys."""
        return iter(self.keys())

    def __repr__(self):
        items = ', '.join(f'{k!r}: {v!r}' for k, v in self.items())
        return f'{{{items}}}'

    def get_load_factor(self):
        """Get current load factor."""
        return self._size / self._capacity if self._capacity > 0 else 0

    def get_stats(self):
        """Get statistics about hash table."""
        chain_lengths = []
        for bucket in self._buckets:
            length = 0
            node = bucket
            while node:
                length += 1
                node = node.next
            chain_lengths.append(length)

        non_empty = sum(1 for l in chain_lengths if l > 0)
        max_chain = max(chain_lengths) if chain_lengths else 0
        avg_chain = sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0

        return {
            'size': self._size,
            'capacity': self._capacity,
            'load_factor': self.get_load_factor(),
            'non_empty_buckets': non_empty,
            'max_chain_length': max_chain,
            'avg_chain_length': avg_chain,
        }


# Example usage and testing
if __name__ == "__main__":
    # Create hash map
    hm = HashMap()

    # Insert elements
    hm.put("apple", 5)
    hm.put("banana", 3)
    hm.put("cherry", 7)
    hm.put("date", 2)

    # Retrieve elements
    print(f"apple: {hm.get('apple')}")  # 5
    print(f"banana: {hm['banana']}")    # 3

    # Update
    hm["apple"] = 10
    print(f"apple updated: {hm['apple']}")  # 10

    # Check existence
    print(f"Contains 'cherry': {hm.contains_key('cherry')}")  # True
    print(f"Contains 'grape': {'grape' in hm}")  # False

    # Remove
    hm.remove("banana")
    print(f"After removing banana: {'banana' in hm}")  # False

    # Iterate
    print("Keys:", hm.keys())
    print("Values:", hm.values())
    print("Items:", hm.items())

    # Statistics
    print("Stats:", hm.get_stats())

Topic 63: HashSet Implementation

class HashSet:
    """
    HashSet implementation - stores unique elements only.
    Built on top of HashMap, values are ignored.
    """

    _PRESENT = object()  # Dummy value

    def __init__(self, initial_capacity=16):
        """Initialize hash set using internal hash map."""
        self._map = HashMap(initial_capacity)

    def add(self, element):
        """
        Add element to set.

        Time: O(1) average
        Space: O(1)
        """
        self._map.put(element, self._PRESENT)

    def remove(self, element):
        """
        Remove element from set.

        Time: O(1) average
        Returns: True if element was present, False otherwise
        """
        return self._map.remove(element) is not None

    def contains(self, element):
        """Check if element exists in set."""
        return self._map.contains_key(element)

    def size(self):
        """Return number of elements."""
        return len(self._map)

    def is_empty(self):
        """Check if set is empty."""
        return len(self._map) == 0

    def clear(self):
        """Remove all elements."""
        self._map = HashMap()

    def to_list(self):
        """Convert to list."""
        return self._map.keys()

    def union(self, other):
        """Return union of this set and other set."""
        result = HashSet()
        for elem in self.to_list():
            result.add(elem)
        for elem in other.to_list():
            result.add(elem)
        return result

    def intersection(self, other):
        """Return intersection of this set and other set."""
        result = HashSet()
        for elem in self.to_list():
            if other.contains(elem):
                result.add(elem)
        return result

    def difference(self, other):
        """Return elements in this set but not in other."""
        result = HashSet()
        for elem in self.to_list():
            if not other.contains(elem):
                result.add(elem)
        return result

    def is_subset(self, other):
        """Check if this is subset of other."""
        for elem in self.to_list():
            if not other.contains(elem):
                return False
        return True

    def __len__(self):
        return self.size()

    def __contains__(self, element):
        return self.contains(element)

    def __iter__(self):
        return iter(self.to_list())

    def __repr__(self):
        return '{' + ', '.join(repr(e) for e in self.to_list()) + '}'


# Example usage
if __name__ == "__main__":
    # Create set
    s1 = HashSet()
    s1.add(1)
    s1.add(2)
    s1.add(3)
    s1.add(2)  # Duplicate ignored

    print(f"Set: {s1}")  # {1, 2, 3}
    print(f"Size: {len(s1)}")  # 3
    print(f"Contains 2: {2 in s1}")  # True

    # Set operations
    s2 = HashSet()
    s2.add(2)
    s2.add(3)
    s2.add(4)

    print(f"Union: {s1.union(s2)}")  # {1, 2, 3, 4}
    print(f"Intersection: {s1.intersection(s2)}")  # {2, 3}
    print(f"Difference: {s1.difference(s2)}")  # {1}

Topic 59: Hash Functions and Collision Handling

class HashFunctions:
    """Collection of hash function implementations."""

    @staticmethod
    def division_method(key, table_size):
        """
        Division method: h(k) = k mod m

        Works well when m is prime and not close to power of 2.
        """
        return key % table_size

    @staticmethod
    def multiplication_method(key, table_size):
        """
        Multiplication method: h(k) = floor(m * (k * A mod 1))

        A is constant, typically golden ratio (√5 - 1)/2 ≈ 0.618
        Works well for any table size.
        """
        A = 0.6180339887  # Golden ratio
        return int(table_size * ((key * A) % 1))

    @staticmethod
    def string_hash_polynomial(s, table_size, base=31):
        """
        Polynomial rolling hash for strings.

        h(s) = (s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1]) mod table_size

        Common bases: 31, 33, 37 (small primes)
        """
        hash_value = 0
        for char in s:
            hash_value = (hash_value * base + ord(char))
        return hash_value % table_size

    @staticmethod
    def string_hash_djb2(s):
        """
        DJB2 hash function by Dan Bernstein.

        Simple and effective for strings.
        hash(i) = hash(i-1) * 33 + str[i]
        """
        hash_value = 5381
        for char in s:
            hash_value = ((hash_value << 5) + hash_value) + ord(char)
        return hash_value & 0xFFFFFFFF  # Keep it 32-bit

    @staticmethod
    def integer_hash_knuth(key):
        """
        Knuth's multiplicative hash.

        Multiply by large prime and take middle bits.
        """
        key = ((key >> 3) * 2654435761) & 0xFFFFFFFF
        return key

    @staticmethod
    def universal_hash(key, a, b, prime, table_size):
        """
        Universal hashing: h_ab(k) = ((a*k + b) mod p) mod m

        Choose random a, b from [1, p-1] where p is prime > universe size.
        Guarantees collision probability <= 1/m for any pair of keys.
        """
        return ((a * key + b) % prime) % table_size

    @staticmethod
    def mix_bits(key):
        """
        Bit mixing for better distribution.

        Used in Java's HashMap.
        """
        key ^= (key >> 20) ^ (key >> 12)
        return key ^ (key >> 7) ^ (key >> 4)


class CollisionResolutionStrategies:
    """Implementations of different collision resolution strategies."""

    @staticmethod
    def linear_probing_sequence(index, i, table_size):
        """
        Linear probing: h(k, i) = (h(k) + i) mod m

        Probe sequence: index, index+1, index+2, ...
        Simple but causes primary clustering.
        """
        return (index + i) % table_size

    @staticmethod
    def quadratic_probing_sequence(index, i, table_size, c1=1, c2=1):
        """
        Quadratic probing: h(k, i) = (h(k) + c1*i + c2*i^2) mod m

        Probe sequence: index, index+1, index+4, index+9, ...
        Reduces primary clustering but can cause secondary clustering.
        """
        return (index + c1 * i + c2 * i * i) % table_size

    @staticmethod
    def double_hashing_sequence(key, i, table_size, hash1_val):
        """
        Double hashing: h(k, i) = (h1(k) + i * h2(k)) mod m

        Uses two hash functions.
        h2(k) must be relatively prime to m (use h2(k) = 1 + (k mod (m-1))).
        Best open addressing method - minimal clustering.
        """
        hash2_val = 1 + (key % (table_size - 1))
        return (hash1_val + i * hash2_val) % table_size


# Example: Hash table with different collision resolution
class OpenAddressingHashMap:
    """Hash map using open addressing with linear probing."""

    class _Entry:
        __slots__ = ['key', 'value', 'is_deleted']

        def __init__(self, key, value):
            self.key = key
            self.value = value
            self.is_deleted = False

    def __init__(self, capacity=16):
        self._capacity = capacity
        self._table = [None] * capacity
        self._size = 0

    def _hash(self, key):
        return hash(key) % self._capacity

    def put(self, key, value):
        """Insert with linear probing."""
        if self._size >= self._capacity * 0.7:
            self._resize(self._capacity * 2)

        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None or entry.is_deleted:
                # Found empty slot
                self._table[probe_index] = self._Entry(key, value)
                self._size += 1
                return
            elif entry.key == key:
                # Update existing
                entry.value = value
                return

            i += 1

        raise Exception("Hash table full")

    def get(self, key):
        """Search with linear probing."""
        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None:
                return None  # Not found
            elif not entry.is_deleted and entry.key == key:
                return entry.value

            i += 1

        return None

    def remove(self, key):
        """Delete with tombstone marking."""
        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None:
                return None
            elif not entry.is_deleted and entry.key == key:
                entry.is_deleted = True
                self._size -= 1
                return entry.value

            i += 1

        return None

    def _resize(self, new_capacity):
        """Resize and rehash."""
        old_table = self._table
        self._capacity = new_capacity
        self._table = [None] * new_capacity
        self._size = 0

        for entry in old_table:
            if entry and not entry.is_deleted:
                self.put(entry.key, entry.value)

Topic 60: Chaining vs Open Addressing - Detailed Comparison

class ChainingVsOpenAddressingDemo:
    """
    Demonstration comparing chaining and open addressing.
    """

    @staticmethod
    def benchmark_insertion(n=10000):
        """Compare insertion performance."""
        import time

        # Chaining
        chaining_map = HashMap()
        start = time.time()
        for i in range(n):
            chaining_map.put(i, i * 2)
        chaining_time = time.time() - start

        # Open addressing
        open_map = OpenAddressingHashMap(capacity=n * 2)
        start = time.time()
        for i in range(n):
            open_map.put(i, i * 2)
        open_time = time.time() - start

        print(f"Insertion of {n} elements:")
        print(f"  Chaining: {chaining_time:.4f}s")
        print(f"  Open Addressing: {open_time:.4f}s")

        return chaining_time, open_time

    @staticmethod
    def compare_memory_usage(n=1000):
        """
        Compare memory overhead.

        Chaining: Each node has key, value, next pointer
        Open addressing: Only key, value in array slots
        """
        import sys

        # Chaining
        chaining_map = HashMap()
        for i in range(n):
            chaining_map.put(i, i)

        # Open addressing
        open_map = OpenAddressingHashMap(capacity=n * 2)
        for i in range(n):
            open_map.put(i, i)

        # Estimate memory (rough approximation)
        chaining_size = sys.getsizeof(chaining_map._buckets)
        open_size = sys.getsizeof(open_map._table)

        print(f"Memory for {n} elements:")
        print(f"  Chaining: ~{chaining_size} bytes")
        print(f"  Open Addressing: ~{open_size} bytes")

        return chaining_size, open_size

    @staticmethod
    def demonstrate_clustering():
        """Show clustering effect in linear probing."""
        # Create small table with linear probing
        table_size = 10
        table = [None] * table_size

        # Insert keys that cause clustering
        keys_to_insert = [0, 10, 20, 1, 11]  # All hash to 0 or 1

        print("Inserting keys with linear probing:")
        for key in keys_to_insert:
            index = key % table_size
            i = 0
            while i < table_size:
                probe_index = (index + i) % table_size
                if table[probe_index] is None:
                    table[probe_index] = key
                    print(f"  Key {key} (hash={index}) → slot {probe_index} (probes={i+1})")
                    break
                i += 1

        print(f"\nFinal table: {table}")
        print("Notice clustering at indices 0-4!")


# Run demonstrations
if __name__ == "__main__":
    demo = ChainingVsOpenAddressingDemo()

    print("=== Performance Comparison ===")
    demo.benchmark_insertion(10000)

    print("\n=== Memory Comparison ===")
    demo.compare_memory_usage(1000)

    print("\n=== Clustering Demonstration ===")
    demo.demonstrate_clustering()

Topic 61: Load Factor and Rehashing

class DynamicHashMap:
    """
    Hash map with detailed load factor tracking and rehashing.
    """

    def __init__(self, initial_capacity=8, max_load_factor=0.75, min_load_factor=0.25):
        """
        Args:
            initial_capacity: Starting size
            max_load_factor: Resize up when exceeded
            min_load_factor: Resize down when below (optional)
        """
        self._capacity = initial_capacity
        self._buckets = [[] for _ in range(self._capacity)]
        self._size = 0
        self._max_load_factor = max_load_factor
        self._min_load_factor = min_load_factor
        self._resize_count = 0
        self._resize_history = []

    def _hash(self, key):
        return hash(key) % self._capacity

    def get_load_factor(self):
        """Current load factor α = n/m."""
        return self._size / self._capacity if self._capacity > 0 else 0

    def put(self, key, value):
        """Insert with automatic resize when load factor exceeded."""
        # Check if resize needed BEFORE insertion
        if self.get_load_factor() >= self._max_load_factor:
            self._resize_up()

        index = self._hash(key)
        bucket = self._buckets[index]

        # Update if exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return

        # Insert new
        bucket.append((key, value))
        self._size += 1

    def remove(self, key):
        """Remove with automatic resize down when load factor too low."""
        index = self._hash(key)
        bucket = self._buckets[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket.pop(i)
                self._size -= 1

                # Resize down if too sparse
                if self.get_load_factor() < self._min_load_factor and self._capacity > 8:
                    self._resize_down()

                return v

        return None

    def _resize_up(self):
        """Double capacity and rehash."""
        old_capacity = self._capacity
        self._resize(self._capacity * 2)
        self._resize_history.append(('up', old_capacity, self._capacity, self._size))
        print(f"Resized UP: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")

    def _resize_down(self):
        """Halve capacity and rehash."""
        old_capacity = self._capacity
        self._resize(self._capacity // 2)
        self._resize_history.append(('down', old_capacity, self._capacity, self._size))
        print(f"Resized DOWN: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")

    def _resize(self, new_capacity):
        """Rehash all elements to new capacity."""
        self._resize_count += 1
        old_buckets = self._buckets

        self._capacity = new_capacity
        self._buckets = [[] for _ in range(self._capacity)]
        self._size = 0

        # Rehash all elements
        for bucket in old_buckets:
            for key, value in bucket:
                self.put(key, value)

    def get_stats(self):
        """Get detailed statistics."""
        chain_lengths = [len(bucket) for bucket in self._buckets]
        non_empty = sum(1 for length in chain_lengths if length > 0)

        return {
            'size': self._size,
            'capacity': self._capacity,
            'load_factor': self.get_load_factor(),
            'resize_count': self._resize_count,
            'non_empty_buckets': non_empty,
            'max_chain_length': max(chain_lengths) if chain_lengths else 0,
            'avg_chain_length': sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0,
            'resize_history': self._resize_history
        }


# Demonstration of load factor and rehashing
if __name__ == "__main__":
    print("=== Load Factor and Rehashing Demo ===\n")

    hm = DynamicHashMap(initial_capacity=4, max_load_factor=0.75)

    print("Inserting elements to trigger resizes:")
    for i in range(20):
        hm.put(f"key{i}", i)
        if i % 5 == 4:
            print(f"  After {i+1} inserts: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")

    print("\nRemoving elements to trigger shrink:")
    for i in range(18):
        hm.remove(f"key{i}")
        if i % 5 == 4:
            print(f"  After {i+1} removes: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")

    print("\nFinal stats:")
    stats = hm.get_stats()
    for key, value in stats.items():
        print(f"  {key}: {value}")

Now continuing with the remaining topics (64-69) which are problem-solving applications...

Topic 64: Two Sum Problem and Variations

class TwoSumProblems:
    """Solutions to Two Sum and its variations using hash tables."""

    @staticmethod
    def two_sum(nums, target):
        """
        Find two numbers that sum to target.

        Time: O(n)
        Space: O(n)

        Args:
            nums: List of integers
            target: Target sum

        Returns:
            Indices of two numbers, or None if not found
        """
        seen = {}  # value → index

        for i, num in enumerate(nums):
            complement = target - num
            if complement in seen:
                return [seen[complement], i]
            seen[num] = i

        return None

    @staticmethod
    def two_sum_all_pairs(nums, target):
        """
        Find ALL pairs that sum to target.

        Time: O(n)
        Space: O(n)
        """
        seen = set()
        pairs = set()

        for num in nums:
            complement = target - num
            if complement in seen:
                # Add as sorted tuple to avoid duplicates
                pair = tuple(sorted([num, complement]))
                pairs.add(pair)
            seen.add(num)

        return list(pairs)

    @staticmethod
    def two_sum_count(nums, target):
        """
        Count number of pairs that sum to target.

        Time: O(n)
        Space: O(n)
        """
        from collections import Counter
        count_map = Counter(nums)
        pairs_count = 0

        for num in count_map:
            complement = target - num
            if complement in count_map:
                if num == complement:
                    # Same number, use combination formula
                    n = count_map[num]
                    pairs_count += n * (n - 1) // 2
                elif num < complement:
                    # Count once to avoid duplicates
                    pairs_count += count_map[num] * count_map[complement]

        return pairs_count

    @staticmethod
    def three_sum(nums, target=0):
        """
        Find all unique triplets that sum to target.

        Time: O(n²)
        Space: O(n)

        Combination of sorting + two-pointer + hash set for dedup.
        """
        nums.sort()
        result = []
        seen_triplets = set()

        for i in range(len(nums) - 2):
            # Skip duplicates for first element
            if i > 0 and nums[i] == nums[i-1]:
                continue

            # Two sum on remaining array
            seen = {}
            for j in range(i + 1, len(nums)):
                complement = target - nums[i] - nums[j]
                if complement in seen:
                    triplet = (nums[i], complement, nums[j])
                    if triplet not in seen_triplets:
                        result.append(list(triplet))
                        seen_triplets.add(triplet)
                seen[nums[j]] = j

        return result

    @staticmethod
    def four_sum(nums, target):
        """
        Find all unique quadruplets that sum to target.

        Time: O(n³)
        Space: O(n)
        """
        nums.sort()
        result = []
        n = len(nums)

        for i in range(n - 3):
            if i > 0 and nums[i] == nums[i-1]:
                continue

            for j in range(i + 1, n - 2):
                if j > i + 1 and nums[j] == nums[j-1]:
                    continue

                # Two sum with hash map
                seen = {}
                for k in range(j + 1, n):
                    complement = target - nums[i] - nums[j] - nums[k]
                    if complement in seen:
                        result.append([nums[i], nums[j], complement, nums[k]])
                        # Skip duplicates
                        while k + 1 < n and nums[k] == nums[k+1]:
                            k += 1
                    seen[nums[k]] = k

        return result

    @staticmethod
    def two_sum_closest(nums, target):
        """
        Find pair with sum closest to target.

        Time: O(n log n) due to sorting
        Space: O(1)

        Note: This uses two pointers after sorting, not hash table,
        but included for completeness of "two sum" family.
        """
        nums.sort()
        left, right = 0, len(nums) - 1
        closest_sum = float('inf')
        result = None

        while left < right:
            current_sum = nums[left] + nums[right]

            if abs(current_sum - target) < abs(closest_sum - target):
                closest_sum = current_sum
                result = [nums[left], nums[right]]

            if current_sum < target:
                left += 1
            else:
                right -= 1

        return result, closest_sum


# Test Two Sum variations
if __name__ == "__main__":
    solver = TwoSumProblems()

    print("=== Two Sum Problem ===")
    nums = [2, 7, 11, 15]
    target = 9
    print(f"Input: {nums}, target: {target}")
    print(f"Indices: {solver.two_sum(nums, target)}")

    print("\n=== Two Sum All Pairs ===")
    nums = [1, 5, 3, 7, 2, 8, 4]
    target = 9
    print(f"Input: {nums}, target: {target}")
    print(f"All pairs: {solver.two_sum_all_pairs(nums, target)}")

    print("\n=== Two Sum Count ===")
    nums = [1, 1, 1, 1]
    target = 2
    print(f"Input: {nums}, target: {target}")
    print(f"Pair count: {solver.two_sum_count(nums, target)}")

    print("\n=== Three Sum ===")
    nums = [-1, 0, 1, 2, -1, -4]
    print(f"Input: {nums}")
    print(f"Triplets: {solver.three_sum(nums, target=0)}")

Topic 65: Group Anagrams Using Hash Map

class AnagramProblems:
    """Solutions using hash maps for anagram problems."""

    @staticmethod
    def group_anagrams_sorted(strs):
        """
        Group anagrams by sorting each string as key.

        Time: O(n * k log k) where n = number of strings, k = max length
        Space: O(n * k)

        Args:
            strs: List of strings

        Returns:
            List of grouped anagrams
        """
        from collections import defaultdict

        groups = defaultdict(list)

        for s in strs:
            # Use sorted string as canonical form (key)
            key = ''.join(sorted(s))
            groups[key].append(s)

        return list(groups.values())

    @staticmethod
    def group_anagrams_count(strs):
        """
        Group anagrams by character count.

        Time: O(n * k) where n = number of strings, k = max length
        Space: O(n * k)

        Faster than sorting approach.
        """
        from collections import defaultdict

        groups = defaultdict(list)

        for s in strs:
            # Count frequency of each character
            count = [0] * 26  # Assuming lowercase a-z
            for char in s:
                count[ord(char) - ord('a')] += 1

            # Use tuple of counts as key (lists aren't hashable)
            key = tuple(count)
            groups[key].append(s)

        return list(groups.values())

    @staticmethod
    def are_anagrams(s1, s2):
        """
        Check if two strings are anagrams.

        Time: O(n)
        Space: O(1) - fixed size alphabet
        """
        if len(s1) != len(s2):
            return False

        from collections import Counter
        return Counter(s1) == Counter(s2)

    @staticmethod
    def find_anagrams_in_string(s, p):
        """
        Find all start indices of anagrams of p in s.

        Time: O(n) where n = len(s)
        Space: O(1) - fixed size alphabet

        Uses sliding window + hash map.
        """
        from collections import Counter

        if len(p) > len(s):
            return []

        result = []
        p_count = Counter(p)
        window_count = Counter()

        # Initialize window
        for i in range(len(p)):
            window_count[s[i]] += 1

        if window_count == p_count:
            result.append(0)

        # Slide window
        for i in range(len(p), len(s)):
            # Add new character
            window_count[s[i]] += 1

            # Remove old character
            old_char = s[i - len(p)]
            window_count[old_char] -= 1
            if window_count[old_char] == 0:
                del window_count[old_char]

            # Check if anagram
            if window_count == p_count:
                result.append(i - len(p) + 1)

        return result

    @staticmethod
    def group_shifted_strings(strings):
        """
        Group strings that are shifts of each other.

        E.g., "abc" and "bcd" are shifts (each char shifted by 1).

        Time: O(n * k) where n = number of strings, k = max length
        Space: O(n * k)
        """
        from collections import defaultdict

        def get_shift_key(s):
            """
            Get canonical form by computing differences between consecutive chars.
            E.g., "abc" → (1, 1), "xyz" → (1, 1), "za" → (25)
            """
            if len(s) <= 1:
                return ()

            shifts = []
            for i in range(1, len(s)):
                shift = (ord(s[i]) - ord(s[i-1])) % 26
                shifts.append(shift)
            return tuple(shifts)

        groups = defaultdict(list)

        for s in strings:
            key = get_shift_key(s)
            groups[key].append(s)

        return list(groups.values())


# Test anagram problems
if __name__ == "__main__":
    solver = AnagramProblems()

    print("=== Group Anagrams (Sorting) ===")
    strs = ["eat", "tea", "tan", "ate", "nat", "bat"]
    print(f"Input: {strs}")
    print(f"Groups: {solver.group_anagrams_sorted(strs)}")

    print("\n=== Group Anagrams (Counting) ===")
    print(f"Groups: {solver.group_anagrams_count(strs)}")

    print("\n=== Find Anagrams in String ===")
    s = "cbaebabacd"
    p = "abc"
    print(f"String: {s}, Pattern: {p}")
    print(f"Anagram indices: {solver.find_anagrams_in_string(s, p)}")

    print("\n=== Group Shifted Strings ===")
    strings = ["abc", "bcd", "acef", "xyz", "az", "ba", "a", "z"]
    print(f"Input: {strings}")
    print(f"Shift groups: {solver.group_shifted_strings(strings)}")

Topic 66: Longest Consecutive Sequence

class LongestConsecutiveSequence:
    """
    Find longest consecutive sequence in unsorted array.

    Time: O(n)
    Space: O(n)

    Key insight: Use hash set for O(1) lookup.
    Only start counting from numbers that are sequence starts.
    """

    @staticmethod
    def longest_consecutive(nums):
        """
        Find length of longest consecutive elements sequence.

        Example: [100, 4, 200, 1, 3, 2] → 4 (sequence: 1, 2, 3, 4)
        """
        if not nums:
            return 0

        num_set = set(nums)  # O(n) space, O(1) lookup
        longest = 0

        for num in num_set:
            # Only start counting if this is a sequence start
            # (i.e., num-1 is not in set)
            if num - 1 not in num_set:
                current_num = num
                current_length = 1

                # Count consecutive numbers
                while current_num + 1 in num_set:
                    current_num += 1
                    current_length += 1

                longest = max(longest, current_length)

        return longest

    @staticmethod
    def longest_consecutive_with_sequence(nums):
        """Return both length and the actual sequence."""
        if not nums:
            return 0, []

        num_set = set(nums)
        longest = 0
        best_sequence = []

        for num in num_set:
            if num - 1 not in num_set:  # Sequence start
                sequence = [num]
                current_num = num

                while current_num + 1 in num_set:
                    current_num += 1
                    sequence.append(current_num)

                if len(sequence) > longest:
                    longest = len(sequence)
                    best_sequence = sequence

        return longest, best_sequence


# Test
if __name__ == "__main__":
    solver = LongestConsecutiveSequence()

    nums = [100, 4, 200, 1, 3, 2]
    print(f"Input: {nums}")
    print(f"Longest consecutive length: {solver.longest_consecutive(nums)}")
    length, seq = solver.longest_consecutive_with_sequence(nums)
    print(f"Sequence: {seq} (length {length})")

Topic 67: Subarray Sum Equals K

class SubarraySumProblems:
    """Subarray sum problems using hash map with prefix sums."""

    @staticmethod
    def subarray_sum_equals_k(nums, k):
        """
        Count subarrays with sum equal to k.

        Time: O(n)
        Space: O(n)

        Key insight: Use prefix sum + hash map.
        If prefix_sum[j] - prefix_sum[i] = k, then subarray[i+1:j+1] sums to k.
        Equivalently: if prefix_sum - k exists, found valid subarray.
        """
        prefix_sum = 0
        sum_count = {0: 1}  # prefix_sum → count (0:1 handles subarrays from start)
        count = 0

        for num in nums:
            prefix_sum += num

            # If (prefix_sum - k) exists, those positions form valid subarrays ending here
            if prefix_sum - k in sum_count:
                count += sum_count[prefix_sum - k]

            # Record current prefix sum
            sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1

        return count

    @staticmethod
    def max_subarray_sum_k(nums, k):
        """
        Find maximum length subarray with sum equal to k.

        Time: O(n)
        Space: O(n)
        """
        prefix_sum = 0
        sum_index = {0: -1}  # prefix_sum → earliest index
        max_length = 0

        for i, num in enumerate(nums):
            prefix_sum += num

            if prefix_sum - k in sum_index:
                length = i - sum_index[prefix_sum - k]
                max_length = max(max_length, length)

            # Only store first occurrence of each prefix sum
            if prefix_sum not in sum_index:
                sum_index[prefix_sum] = i

        return max_length

    @staticmethod
    def subarray_sum_divisible_by_k(nums, k):
        """
        Count subarrays with sum divisible by k.

        Time: O(n)
        Space: O(k)

        Key: Use modulo arithmetic. If (prefix % k) repeats, difference is divisible by k.
        """
        prefix_sum = 0
        mod_count = {0: 1}  # remainder → count
        count = 0

        for num in nums:
            prefix_sum += num
            remainder = prefix_sum % k

            # Handle negative remainders
            if remainder < 0:
                remainder += k

            if remainder in mod_count:
                count += mod_count[remainder]

            mod_count[remainder] = mod_count.get(remainder, 0) + 1

        return count


# Test
if __name__ == "__main__":
    solver = SubarraySumProblems()

    print("=== Subarray Sum Equals K ===")
    nums = [1, 1, 1]
    k = 2
    print(f"Input: {nums}, k={k}")
    print(f"Count: {solver.subarray_sum_equals_k(nums, k)}")  # 2: [1,1] at two positions

    nums = [1, 2, 3]
    k = 3
    print(f"\nInput: {nums}, k={k}")
    print(f"Count: {solver.subarray_sum_equals_k(nums, k)}")  # 2: [3] and [1,2]

Topic 69: Rolling Hash (Rabin-Karp Algorithm)

class RollingHash:
    """
    Rolling hash for pattern matching (Rabin-Karp algorithm).

    Key idea: Efficiently compute hash of sliding window.
    Hash of window [i:i+m] can be updated to [i+1:i+m+1] in O(1).
    """

    def __init__(self, base=256, mod=10**9 + 7):
        """
        Args:
            base: Base for polynomial hash (typically 256 for ASCII)
            mod: Large prime for modulo to avoid overflow
        """
        self.base = base
        self.mod = mod

    def compute_hash(self, s):
        """Compute hash of string s."""
        h = 0
        for char in s:
            h = (h * self.base + ord(char)) % self.mod
        return h

    def rabin_karp_search(self, text, pattern):
        """
        Find all occurrences of pattern in text.

        Time: O(n + m) average case, O(nm) worst case
        Space: O(1)

        Returns: List of starting indices
        """
        n, m = len(text), len(pattern)
        if m > n:
            return []

        # Precompute base^(m-1) mod mod
        h_multiplier = pow(self.base, m - 1, self.mod)

        # Compute hash of pattern
        pattern_hash = self.compute_hash(pattern)

        # Compute hash of first window
        window_hash = self.compute_hash(text[:m])

        result = []

        for i in range(n - m + 1):
            # Check if hashes match
            if window_hash == pattern_hash:
                # Verify actual string (to handle hash collisions)
                if text[i:i+m] == pattern:
                    result.append(i)

            # Roll the hash to next window
            if i < n - m:
                # Remove leftmost character
                window_hash = (window_hash - ord(text[i]) * h_multiplier) % self.mod
                # Add rightmost character
                window_hash = (window_hash * self.base + ord(text[i + m])) % self.mod
                # Handle negative values
                if window_hash < 0:
                    window_hash += self.mod

        return result

    def longest_common_substring(self, s1, s2):
        """
        Find longest common substring using rolling hash + binary search.

        Time: O((n+m) * log(min(n,m)))
        Space: O(n+m)
        """
        def get_all_hashes(s, length):
            """Get all hashes of substrings of given length."""
            if length > len(s):
                return set()

            h_multiplier = pow(self.base, length - 1, self.mod)
            window_hash = self.compute_hash(s[:length])
            hashes = {window_hash}

            for i in range(len(s) - length):
                window_hash = (window_hash - ord(s[i]) * h_multiplier) % self.mod
                window_hash = (window_hash * self.base + ord(s[i + length])) % self.mod
                hashes.add(window_hash)

            return hashes

        # Binary search on length
        left, right = 0, min(len(s1), len(s2))
        result_length = 0

        while left <= right:
            mid = (left + right) // 2

            hashes1 = get_all_hashes(s1, mid)
            hashes2 = get_all_hashes(s2, mid)

            if hashes1 & hashes2:  # Common hash found
                result_length = mid
                left = mid + 1
            else:
                right = mid - 1

        return result_length


# Test
if __name__ == "__main__":
    rh = RollingHash()

    print("=== Rabin-Karp Pattern Search ===")
    text = "ABABCABABA"
    pattern = "ABA"
    indices = rh.rabin_karp_search(text, pattern)
    print(f"Text: {text}")
    print(f"Pattern: {pattern}")
    print(f"Found at indices: {indices}")  # [0, 5, 7]

    print("\n=== Longest Common Substring ===")
    s1 = "abcdxyz"
    s2 = "xyzabcd"
    length = rh.longest_common_substring(s1, s2)
    print(f"String 1: {s1}")
    print(f"String 2: {s2}")
    print(f"Longest common substring length: {length}")  # 4 ("abcd" or "xyz")

9.3 Problem-Solving Practice

Learning Path

Progressive Problem Sequence:

Easy (Master Fundamentals):

Two Sum
Contains Duplicate
Valid Anagram
Intersection of Two Arrays
First Unique Character in String

Medium (Build Proficiency): 6. Group Anagrams 7. Longest Substring Without Repeating Characters 8. Subarray Sum Equals K 9. Top K Frequent Elements 10. Longest Consecutive Sequence

Hard (Demonstrate Mastery): 11. Substring with Concatenation of All Words 12. LRU Cache 13. Design Twitter 14. All O(1) Data Structure 15. Longest Duplicate Substring

9.4 Integration and Ecosystem

Library Usage

Python Built-in Dictionaries:

# Python's dict is highly optimized hash table
d = {}
d = dict()
d = {"key": "value"}

# Common operations
d[key] = value  # Put
value = d.get(key, default)  # Get with default
del d[key]  # Remove
key in d  # Contains

# Collections module
from collections import defaultdict, Counter, OrderedDict

# defaultdict: Auto-initialize missing keys
dd = defaultdict(list)
dd["key"].append("value")  # No KeyError

# Counter: Specialized for counting
from collections import Counter
counts = Counter([1, 1, 2, 2, 2, 3])
# Counter({2: 3, 1: 2, 3: 1})

# OrderedDict: Maintains insertion order (Python 3.7+ dicts are ordered by default)
od = OrderedDict()

10. Advanced Concepts

10.1 Advanced Variants and Extensions

Perfect Hashing

class PerfectHashTable:
    """
    Perfect hashing for static datasets - no collisions!

    Two-level hashing:
    - Level 1: Hash to n buckets using h1
    - Level 2: Each bucket has its own hash table sized O(m_i^2)
      where m_i = number of keys hashing to bucket i

    Guarantees: O(1) worst-case lookup, O(n) total space.
    """

    def __init__(self, keys):
        """Build perfect hash table for given static keys."""
        self.n = len(keys)
        # Level 1 table
        self.buckets = [[] for _ in range(self.n)]

        # Hash keys to buckets
        for key in keys:
            bucket_idx = hash(key) % self.n
            self.buckets[bucket_idx].append(key)

        # Level 2: Create perfect hash table for each bucket
        self.secondary = []
        for bucket in self.buckets:
            if len(bucket) == 0:
                self.secondary.append(None)
            else:
                # Size is O(m^2) to guarantee no collisions
                size = len(bucket) ** 2
                secondary_table = [None] * size

                for key in bucket:
                    idx = hash(key * 31) % size  # Different hash function
                    secondary_table[idx] = key

                self.secondary.append(secondary_table)

    def lookup(self, key):
        """O(1) worst-case lookup."""
        bucket_idx = hash(key) % self.n
        if self.secondary[bucket_idx] is None:
            return False

        secondary_table = self.secondary[bucket_idx]
        idx = hash(key * 31) % len(secondary_table)
        return secondary_table[idx] == key

Cuckoo Hashing

class CuckooHashTable:
    """
    Cuckoo hashing: Worst-case O(1) lookup!

    Two hash tables, two hash functions.
    If collision, evict existing element and relocate it.
    """

    def __init__(self, capacity=16):
        self.capacity = capacity
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity
        self.size = 0
        self.max_iterations = 500  # Prevent infinite loops

    def _hash1(self, key):
        return hash(key) % self.capacity

    def _hash2(self, key):
        return (hash(key) * 31) % self.capacity

    def insert(self, key):
        """Insert with cuckoo eviction."""
        if self.contains(key):
            return

        current_key = key
        for _ in range(self.max_iterations):
            # Try table 1
            idx1 = self._hash1(current_key)
            if self.table1[idx1] is None:
                self.table1[idx1] = current_key
                self.size += 1
                return

            # Evict from table 1, try table 2
            self.table1[idx1], current_key = current_key, self.table1[idx1]

            idx2 = self._hash2(current_key)
            if self.table2[idx2] is None:
                self.table2[idx2] = current_key
                self.size += 1
                return

            # Evict from table 2, loop continues
            self.table2[idx2], current_key = current_key, self.table2[idx2]

        # Too many evictions → rebuild with larger tables
        self._resize(self.capacity * 2)
        self.insert(key)

    def contains(self, key):
        """O(1) worst-case lookup!"""
        idx1 = self._hash1(key)
        idx2 = self._hash2(key)
        return self.table1[idx1] == key or self.table2[idx2] == key

    def _resize(self, new_capacity):
        """Rebuild tables."""
        old_keys = []
        for key in self.table1:
            if key is not None:
                old_keys.append(key)
        for key in self.table2:
            if key is not None:
                old_keys.append(key)

        self.capacity = new_capacity
        self.table1 = [None] * new_capacity
        self.table2 = [None] * new_capacity
        self.size = 0

        for key in old_keys:
            self.insert(key)

10.2 Theoretical Depth

Lower Bounds:

For dictionary operations (insert, delete, search) with n elements:

Comparison-based: Ω(log n) per operation (like BST)
Hashing: O(1) expected time (but not comparison-based)

Hash tables achieve O(1) expected time, which is optimal for expected case.

Complexity Classes:

Dictionary problem is in P (polynomial time solvable)
Perfect hashing construction is polynomial
Universal hashing provides probabilistic guarantees

10.3 Cutting-Edge and Future Directions

Recent Innovations:

Swiss Tables (Google, 2017):
- Open addressing with SIMD metadata
- 2x faster than std::unordered_map
Learned Hash Functions (2020s):
- Use machine learning to learn optimal hash function for dataset
- Outperforms traditional hash functions on specific distributions
Quantum Hashing:
- Grovers algorithm: O(√n) quantum search
- Post-quantum cryptographic hash functions

11. Meta-Learning Questions

Practical Judgment:

Use hash tables when:

Need O(1) average lookup
Don't need sorted order
Memory not extremely limited

Use alternatives when:

Need sorted iteration → BST
Need range queries → Segment tree
Memory critical → Bloom filter (approximate)

Prerequisites:

Arrays, linked lists
Big-O notation
Modular arithmetic

Builds Upon:

Dynamic arrays → dynamic resizing
Linked lists → chaining
Probability → collision analysis

Enables:

Graph adjacency lists
LRU/LFU caches
Deduplication algorithms
Indexing in databases

Section 9: Practical Mastery

9.1 Complete Python Implementations for All 12 Topics

This section provides production-ready, thoroughly documented implementations for each of the 12 hash table topics. Each implementation includes error handling, edge case management, and comprehensive testing.

Topic 58: Two Sum - Complete Implementation

"""
Two Sum - Find indices of two numbers that add up to target
Time: O(n), Space: O(n)
"""

from typing import List, Optional, Tuple

class TwoSumSolver:
    """
    Complete implementation of Two Sum problem with multiple variants.

    Key Insight: Hash table allows O(1) lookup of complement (target - current)
    """

    def two_sum_basic(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Basic two sum: return indices of two numbers that sum to target.

        Args:
            nums: List of integers
            target: Target sum

        Returns:
            Tuple of (index1, index2) or None if no solution exists

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_basic([2, 7, 11, 15], 9)
            (0, 1)
            >>> solver.two_sum_basic([3, 2, 4], 6)
            (1, 2)
        """
        if not nums or len(nums) < 2:
            return None

        seen = {}  # value -> index mapping

        for i, num in enumerate(nums):
            complement = target - num

            if complement in seen:
                return (seen[complement], i)

            seen[num] = i

        return None

    def two_sum_all_pairs(self, nums: List[int], target: int) -> List[Tuple[int, int]]:
        """
        Find ALL pairs that sum to target (return indices).

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_all_pairs([1, 1, 1, 1], 2)
            [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
        """
        if not nums or len(nums) < 2:
            return []

        # Map value to list of indices
        value_indices = {}
        for i, num in enumerate(nums):
            if num not in value_indices:
                value_indices[num] = []
            value_indices[num].append(i)

        result = []
        seen_pairs = set()

        for i, num in enumerate(nums):
            complement = target - num

            if complement in value_indices:
                for j in value_indices[complement]:
                    if i < j:  # Ensure no duplicates and i < j
                        pair = (i, j)
                        if pair not in seen_pairs:
                            result.append(pair)
                            seen_pairs.add(pair)

        return result

    def two_sum_sorted(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Two sum on sorted array using two pointers.
        More space efficient: O(1) space vs O(n) for hash table.

        Time: O(n), Space: O(1)
        """
        if not nums or len(nums) < 2:
            return None

        left, right = 0, len(nums) - 1

        while left < right:
            current_sum = nums[left] + nums[right]

            if current_sum == target:
                return (left, right)
            elif current_sum < target:
                left += 1
            else:
                right -= 1

        return None

    def two_sum_with_values(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Return the actual values instead of indices.

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_with_values([2, 7, 11, 15], 9)
            (2, 7)
        """
        result = self.two_sum_basic(nums, target)
        if result:
            i, j = result
            return (nums[i], nums[j])
        return None

    def two_sum_count(self, nums: List[int], target: int) -> int:
        """
        Count number of pairs that sum to target.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_count([1, 1, 1, 1], 2)
            6  # C(4,2) = 6 pairs
        """
        if not nums or len(nums) < 2:
            return 0

        from collections import Counter
        count_map = Counter(nums)
        pairs = 0

        for num in count_map:
            complement = target - num

            if complement in count_map:
                if num == complement:
                    # Same number: choose 2 from count
                    n = count_map[num]
                    pairs += n * (n - 1) // 2
                elif num < complement:
                    # Different numbers: multiply counts
                    pairs += count_map[num] * count_map[complement]

        return pairs

    def two_sum_closest(self, nums: List[int], target: int) -> Tuple[int, int]:
        """
        Find pair with sum closest to target.

        Time: O(n log n) due to sorting, Space: O(1)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_closest([1, 3, 5, 7, 9], 10)
            (1, 9)  # sum = 10, exact match
            >>> solver.two_sum_closest([1, 2, 3, 4], 10)
            (3, 4)  # sum = 7, closest to 10
        """
        if not nums or len(nums) < 2:
            raise ValueError("Need at least 2 numbers")

        nums_sorted = sorted(enumerate(nums), key=lambda x: x[1])
        left, right = 0, len(nums_sorted) - 1

        closest_diff = float('inf')
        result = (0, 1)

        while left < right:
            current_sum = nums_sorted[left][1] + nums_sorted[right][1]
            diff = abs(current_sum - target)

            if diff < closest_diff:
                closest_diff = diff
                result = (nums_sorted[left][0], nums_sorted[right][0])

            if current_sum < target:
                left += 1
            elif current_sum > target:
                right -= 1
            else:
                break  # Exact match found

        return result


class TwoSumDataStructure:
    """
    Design a data structure that supports:
    - add(number): Add number to internal data structure
    - find(value): Find if there exists any pair that sums to value

    LeetCode 170: Two Sum III - Data structure design
    """

    def __init__(self):
        """Initialize with empty counter."""
        from collections import Counter
        self.nums = Counter()

    def add(self, number: int) -> None:
        """
        Add a number to the data structure.
        Time: O(1)
        """
        self.nums[number] += 1

    def find(self, value: int) -> bool:
        """
        Find if there exists any pair of numbers which sum is equal to value.
        Time: O(n) where n is number of unique values

        Examples:
            >>> ds = TwoSumDataStructure()
            >>> ds.add(1)
            >>> ds.add(3)
            >>> ds.add(5)
            >>> ds.find(4)
            True  # 1 + 3 = 4
            >>> ds.find(7)
            False
        """
        for num in self.nums:
            complement = value - num

            if complement in self.nums:
                # If same number, need at least 2 occurrences
                if num == complement:
                    if self.nums[num] >= 2:
                        return True
                else:
                    return True

        return False


# Comprehensive Test Suite for Two Sum
def test_two_sum():
    """Complete test suite covering all edge cases."""
    solver = TwoSumSolver()

    # Test 1: Basic case
    assert solver.two_sum_basic([2, 7, 11, 15], 9) == (0, 1)

    # Test 2: Multiple valid pairs (returns first found)
    result = solver.two_sum_basic([3, 3], 6)
    assert result == (0, 1)

    # Test 3: No solution
    assert solver.two_sum_basic([1, 2, 3], 10) is None

    # Test 4: Empty array
    assert solver.two_sum_basic([], 5) is None

    # Test 5: Single element
    assert solver.two_sum_basic([5], 5) is None

    # Test 6: Negative numbers
    assert solver.two_sum_basic([-1, -2, -3, -4, -5], -8) == (2, 4)

    # Test 7: All pairs
    pairs = solver.two_sum_all_pairs([1, 1, 1, 1], 2)
    assert len(pairs) == 6  # C(4,2) = 6

    # Test 8: Count pairs
    assert solver.two_sum_count([1, 1, 1, 1], 2) == 6

    # Test 9: Closest sum
    assert solver.two_sum_closest([1, 2, 3, 4], 10) in [(2, 3), (3, 2)]

    # Test 10: Data structure
    ds = TwoSumDataStructure()
    ds.add(1)
    ds.add(3)
    ds.add(5)
    assert ds.find(4) == True
    assert ds.find(7) == False
    ds.add(2)
    assert ds.find(7) == True  # 2 + 5 = 7

    print("All Two Sum tests passed!")


if __name__ == "__main__":
    test_two_sum()

Topic 59: Group Anagrams - Complete Implementation

"""
Group Anagrams - Group strings that are anagrams of each other
Time: O(n * k log k) where n = number of strings, k = max string length
Space: O(n * k)
"""

from typing import List, Dict
from collections import defaultdict

class AnagramGrouper:
    """
    Complete implementation of anagram grouping with multiple approaches.

    Key Insight: Anagrams have same character frequency or same sorted form
    """

    def group_anagrams_sorted(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using sorted string as key.

        Time: O(n * k log k) where k = average string length
        Space: O(n * k) for storing all strings

        Args:
            strs: List of strings

        Returns:
            List of grouped anagrams

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
            >>> sorted([sorted(group) for group in result])
            [['ate', 'eat', 'tea'], ['bat'], ['nat', 'tan']]
        """
        if not strs:
            return []

        anagram_groups = defaultdict(list)

        for s in strs:
            # Use sorted string as key
            key = ''.join(sorted(s))
            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def group_anagrams_frequency(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using character frequency as key.
        More efficient for long strings (O(n*k) vs O(n*k log k)).

        Time: O(n * k) where k = average string length
        Space: O(n * k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
            >>> len(result)
            2
        """
        if not strs:
            return []

        anagram_groups = defaultdict(list)

        for s in strs:
            # Create frequency tuple as key (only for lowercase a-z)
            count = [0] * 26
            for c in s:
                count[ord(c) - ord('a')] += 1

            # Tuple is hashable, list is not
            key = tuple(count)
            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def group_anagrams_prime(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using prime number product as key.
        Mathematical approach: each letter maps to prime number.
        Product is unique for each anagram group (fundamental theorem of arithmetic).

        Time: O(n * k), Space: O(n * k)

        Note: Can overflow for very long strings. Use for educational purposes.
        """
        if not strs:
            return []

        # First 26 prime numbers for a-z
        primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
                  43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]

        anagram_groups = defaultdict(list)

        for s in strs:
            # Calculate product of primes
            key = 1
            for c in s:
                key *= primes[ord(c) - ord('a')]

            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def find_anagrams_of_word(self, strs: List[str], word: str) -> List[str]:
        """
        Find all anagrams of a specific word in the list.

        Time: O(n * k), Space: O(k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
            ['eat', 'tea', 'ate']
        """
        if not strs or not word:
            return []

        # Create frequency signature for target word
        target_freq = [0] * 26
        for c in word:
            target_freq[ord(c) - ord('a')] += 1
        target_key = tuple(target_freq)

        anagrams = []
        for s in strs:
            # Create frequency signature for current string
            freq = [0] * 26
            for c in s:
                freq[ord(c) - ord('a')] += 1

            if tuple(freq) == target_key:
                anagrams.append(s)

        return anagrams

    def count_anagram_groups(self, strs: List[str]) -> int:
        """
        Count number of distinct anagram groups.

        Time: O(n * k log k), Space: O(n)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"])
            3  # [eat,tea,ate], [tan,nat], [bat]
        """
        if not strs:
            return 0

        seen_signatures = set()

        for s in strs:
            key = ''.join(sorted(s))
            seen_signatures.add(key)

        return len(seen_signatures)

    def largest_anagram_group(self, strs: List[str]) -> List[str]:
        """
        Find the largest group of anagrams.

        Time: O(n * k log k), Space: O(n * k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
            >>> sorted(result)
            ['ate', 'eat', 'tea']
        """
        if not strs:
            return []

        groups = self.group_anagrams_sorted(strs)
        return max(groups, key=len) if groups else []


class AnagramDictionary:
    """
    Data structure for efficient anagram lookups.
    Useful for word games like Scrabble.
    """

    def __init__(self, word_list: List[str]):
        """
        Build anagram dictionary from word list.

        Time: O(n * k log k), Space: O(n * k)
        """
        self.anagrams = defaultdict(list)

        for word in word_list:
            key = ''.join(sorted(word.lower()))
            self.anagrams[key].append(word)

    def find_anagrams(self, word: str) -> List[str]:
        """
        Find all anagrams of given word.

        Time: O(k log k), Space: O(k)
        """
        key = ''.join(sorted(word.lower()))
        return self.anagrams.get(key, [])

    def is_anagram(self, word1: str, word2: str) -> bool:
        """
        Check if two words are anagrams.

        Time: O(k log k), Space: O(k)
        """
        return sorted(word1.lower()) == sorted(word2.lower())


# Comprehensive Test Suite
def test_group_anagrams():
    """Complete test suite for anagram grouping."""
    grouper = AnagramGrouper()

    # Test 1: Basic grouping
    result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
    assert len(result) == 3

    # Test 2: Empty input
    assert grouper.group_anagrams_sorted([]) == []

    # Test 3: Single string
    assert grouper.group_anagrams_sorted(["a"]) == [["a"]]

    # Test 4: No anagrams
    result = grouper.group_anagrams_sorted(["abc", "def", "ghi"])
    assert len(result) == 3

    # Test 5: All anagrams
    result = grouper.group_anagrams_sorted(["abc", "bca", "cab", "acb"])
    assert len(result) == 1
    assert len(result[0]) == 4

    # Test 6: Frequency method
    result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
    assert len(result) == 2

    # Test 7: Find specific anagrams
    anagrams = grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
    assert sorted(anagrams) == ['ate', 'eat', 'tea']

    # Test 8: Count groups
    assert grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"]) == 3

    # Test 9: Largest group
    largest = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
    assert len(largest) == 3

    # Test 10: Anagram dictionary
    word_list = ["listen", "silent", "enlist", "hello", "world"]
    dictionary = AnagramDictionary(word_list)
    assert sorted(dictionary.find_anagrams("listen")) == ['enlist', 'listen', 'silent']
    assert dictionary.is_anagram("listen", "silent") == True

    print("All Group Anagrams tests passed!")


if __name__ == "__main__":
    test_group_anagrams()

Topic 60: Valid Anagram - Complete Implementation

"""
Valid Anagram - Check if two strings are anagrams
Time: O(n), Space: O(1) for lowercase English letters
"""

from typing import Dict
from collections import Counter

class AnagramValidator:
    """
    Complete implementation of anagram validation with multiple approaches.

    Key Insight: Anagrams have identical character frequencies
    """

    def is_anagram_sorting(self, s: str, t: str) -> bool:
        """
        Check if two strings are anagrams using sorting.

        Time: O(n log n), Space: O(n)

        Args:
            s: First string
            t: Second string

        Returns:
            True if s and t are anagrams

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_sorting("anagram", "nagaram")
            True
            >>> validator.is_anagram_sorting("rat", "car")
            False
        """
        if len(s) != len(t):
            return False

        return sorted(s) == sorted(t)

    def is_anagram_hashtable(self, s: str, t: str) -> bool:
        """
        Check if two strings are anagrams using hash table.
        More efficient for long strings: O(n) vs O(n log n).

        Time: O(n), Space: O(1) for fixed alphabet size

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_hashtable("anagram", "nagaram")
            True
        """
        if len(s) != len(t):
            return False

        return Counter(s) == Counter(t)

    def is_anagram_array(self, s: str, t: str) -> bool:
        """
        Check anagrams using fixed-size array (only lowercase a-z).
        Most space-efficient: O(1) space.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_array("anagram", "nagaram")
            True
        """
        if len(s) != len(t):
            return False

        # Fixed array for 26 lowercase letters
        char_count = [0] * 26

        for i in range(len(s)):
            char_count[ord(s[i]) - ord('a')] += 1
            char_count[ord(t[i]) - ord('a')] -= 1

        return all(count == 0 for count in char_count)

    def is_anagram_unicode(self, s: str, t: str) -> bool:
        """
        Check anagrams for Unicode strings (supports all characters).

        Time: O(n), Space: O(k) where k = unique characters

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_unicode("你好", "好你")
            True
        """
        if len(s) != len(t):
            return False

        return Counter(s) == Counter(t)

    def is_anagram_case_insensitive(self, s: str, t: str) -> bool:
        """
        Check anagrams ignoring case.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_case_insensitive("Anagram", "Nagaram")
            True
        """
        return sorted(s.lower()) == sorted(t.lower())

    def is_anagram_ignore_spaces(self, s: str, t: str) -> bool:
        """
        Check anagrams ignoring spaces and case.

        Time: O(n), Space: O(n)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_ignore_spaces("conversation", "voices rant on")
            True
        """
        s_clean = s.replace(" ", "").lower()
        t_clean = t.replace(" ", "").lower()
        return sorted(s_clean) == sorted(t_clean)

    def find_anagram_difference(self, s: str, t: str) -> Dict[str, int]:
        """
        Find character frequency differences between two strings.
        Useful for debugging or partial anagram matching.

        Time: O(n), Space: O(k)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.find_anagram_difference("aab", "aac")
            {'b': 1, 'c': -1}
        """
        count_s = Counter(s)
        count_t = Counter(t)

        diff = {}
        all_chars = set(count_s.keys()) | set(count_t.keys())

        for char in all_chars:
            difference = count_s.get(char, 0) - count_t.get(char, 0)
            if difference != 0:
                diff[char] = difference

        return diff

    def minimum_swaps_to_anagram(self, s: str, t: str) -> int:
        """
        Find minimum character swaps to make s an anagram of t.
        Returns -1 if impossible.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.minimum_swaps_to_anagram("abc", "bca")
            0  # Already anagrams
            >>> validator.minimum_swaps_to_anagram("abc", "def")
            -1  # Impossible
        """
        if len(s) != len(t):
            return -1

        # Check if same character set
        if Counter(s) != Counter(t):
            return -1

        # Count mismatched positions
        mismatches = 0
        for i in range(len(s)):
            if s[i] != t[i]:
                mismatches += 1

        # Each swap fixes 2 positions
        return mismatches // 2


class AnagramGame:
    """
    Anagram-based word game utilities.
    """

    def __init__(self, dictionary: list[str]):
        """Initialize with word dictionary."""
        self.words = set(word.lower() for word in dictionary)

        # Build anagram index
        self.anagram_index = {}
        for word in self.words:
            key = ''.join(sorted(word))
            if key not in self.anagram_index:
                self.anagram_index[key] = []
            self.anagram_index[key].append(word)

    def get_valid_anagrams(self, letters: str) -> list[str]:
        """
        Get all valid dictionary words that are anagrams of given letters.

        Time: O(k log k + m) where k = len(letters), m = anagrams found

        Examples:
            >>> game = AnagramGame(["listen", "silent", "enlist", "hello"])
            >>> sorted(game.get_valid_anagrams("silent"))
            ['enlist', 'listen', 'silent']
        """
        key = ''.join(sorted(letters.lower()))
        return self.anagram_index.get(key, [])

    def score_anagram(self, word: str) -> int:
        """
        Score word based on letter frequency (Scrabble-like).

        Examples:
            >>> game = AnagramGame([])
            >>> game.score_anagram("quiz")
            22  # High value letters
        """
        # Scrabble letter values
        values = {
            'a': 1, 'e': 1, 'i': 1, 'o': 1, 'u': 1, 'l': 1, 'n': 1, 's': 1, 't': 1, 'r': 1,
            'd': 2, 'g': 2,
            'b': 3, 'c': 3, 'm': 3, 'p': 3,
            'f': 4, 'h': 4, 'v': 4, 'w': 4, 'y': 4,
            'k': 5,
            'j': 8, 'x': 8,
            'q': 10, 'z': 10
        }

        return sum(values.get(c.lower(), 0) for c in word)


# Comprehensive Test Suite
def test_valid_anagram():
    """Complete test suite for anagram validation."""
    validator = AnagramValidator()

    # Test 1: Basic anagrams
    assert validator.is_anagram_sorting("anagram", "nagaram") == True
    assert validator.is_anagram_sorting("rat", "car") == False

    # Test 2: Empty strings
    assert validator.is_anagram_sorting("", "") == True

    # Test 3: Different lengths
    assert validator.is_anagram_sorting("a", "ab") == False

    # Test 4: Single character
    assert validator.is_anagram_sorting("a", "a") == True

    # Test 5: Hash table method
    assert validator.is_anagram_hashtable("listen", "silent") == True

    # Test 6: Array method
    assert validator.is_anagram_array("anagram", "nagaram") == True

    # Test 7: Case insensitive
    assert validator.is_anagram_case_insensitive("Anagram", "Nagaram") == True

    # Test 8: Ignore spaces
    assert validator.is_anagram_ignore_spaces("conversation", "voices rant on") == True

    # Test 9: Unicode
    assert validator.is_anagram_unicode("你好", "好你") == True

    # Test 10: Find difference
    diff = validator.find_anagram_difference("aab", "aac")
    assert diff == {'b': 1, 'c': -1}

    # Test 11: Minimum swaps
    assert validator.minimum_swaps_to_anagram("abc", "bca") == 0
    assert validator.minimum_swaps_to_anagram("abc", "def") == -1

    # Test 12: Anagram game
    game = AnagramGame(["listen", "silent", "enlist", "hello", "world"])
    anagrams = game.get_valid_anagrams("silent")
    assert len(anagrams) == 3
    assert all(word in anagrams for word in ["listen", "silent", "enlist"])

    # Test 13: Scoring
    score = game.score_anagram("quiz")
    assert score == 22  # q=10, u=1, i=1, z=10

    print("All Valid Anagram tests passed!")


if __name__ == "__main__":
    test_valid_anagram()

Topic 61: Contains Duplicate - Complete Implementation

"""
Contains Duplicate - Check if array has duplicate elements
Time: O(n), Space: O(n)
"""

from typing import List, Set, Dict
from collections import Counter

class DuplicateDetector:
    """
    Complete implementation of duplicate detection with various constraints.

    Key Insight: Hash set provides O(1) duplicate checking
    """

    def contains_duplicate_set(self, nums: List[int]) -> bool:
        """
        Check if array contains any duplicates using set.

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers

        Returns:
            True if any value appears at least twice

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_set([1,2,3,1])
            True
            >>> detector.contains_duplicate_set([1,2,3,4])
            False
        """
        if not nums:
            return False

        return len(nums) != len(set(nums))

    def contains_duplicate_early_return(self, nums: List[int]) -> bool:
        """
        Check duplicates with early return (more efficient).

        Time: O(n), Space: O(n) worst case
        Best case: O(1) if duplicate found early

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_early_return([1,1,2,3])
            True  # Returns immediately on first duplicate
        """
        if not nums:
            return False

        seen = set()
        for num in nums:
            if num in seen:
                return True
            seen.add(num)

        return False

    def contains_duplicate_ii(self, nums: List[int], k: int) -> bool:
        """
        Check if duplicates exist within k indices of each other.
        LeetCode 219: Contains Duplicate II

        Time: O(n), Space: O(min(n, k))

        Args:
            nums: List of integers
            k: Maximum index difference

        Returns:
            True if duplicates exist with index difference <= k

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_ii([1,2,3,1], 3)
            True  # indices 0 and 3, diff = 3
            >>> detector.contains_duplicate_ii([1,0,1,1], 1)
            True  # indices 2 and 3, diff = 1
            >>> detector.contains_duplicate_ii([1,2,3,1,2,3], 2)
            False
        """
        if not nums or k <= 0:
            return False

        # Sliding window: maintain window of size k+1
        window = set()

        for i, num in enumerate(nums):
            if num in window:
                return True

            window.add(num)

            # Maintain window size
            if len(window) > k:
                window.remove(nums[i - k])

        return False

    def contains_duplicate_iii(self, nums: List[int], index_diff: int, value_diff: int) -> bool:
        """
        Check if duplicates exist with constraints on both index and value difference.
        LeetCode 220: Contains Duplicate III

        Find if there exist indices i and j such that:
        - abs(i - j) <= index_diff
        - abs(nums[i] - nums[j]) <= value_diff

        Time: O(n log k), Space: O(k)
        Using bucket approach: O(n), O(k)

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_iii([1,2,3,1], 3, 0)
            True  # Exact duplicate within range
            >>> detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3)
            False
        """
        if not nums or index_diff < 0 or value_diff < 0:
            return False

        # Bucket approach: each bucket has width value_diff + 1
        bucket_width = value_diff + 1
        buckets = {}

        for i, num in enumerate(nums):
            bucket_id = num // bucket_width

            # Check current bucket
            if bucket_id in buckets:
                return True

            # Check adjacent buckets
            if bucket_id - 1 in buckets and abs(num - buckets[bucket_id - 1]) <= value_diff:
                return True
            if bucket_id + 1 in buckets and abs(num - buckets[bucket_id + 1]) <= value_diff:
                return True

            # Add to bucket
            buckets[bucket_id] = num

            # Remove old bucket
            if i >= index_diff:
                del buckets[nums[i - index_diff] // bucket_width]

        return False

    def find_all_duplicates(self, nums: List[int]) -> List[int]:
        """
        Find all numbers that appear more than once.

        Time: O(n), Space: O(n)

        Examples:
            >>> detector = DuplicateDetector()
            >>> sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1]))
            [2, 3]
        """
        if not nums:
            return []

        counts = Counter(nums)
        return [num for num, count in counts.items() if count > 1]

    def find_duplicate_single(self, nums: List[int]) -> int:
        """
        Find the one duplicate in array where:
        - nums contains n+1 integers in range [1, n]
        - Only one number appears twice, all others appear once

        Floyd's cycle detection: O(n) time, O(1) space

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.find_duplicate_single([1,3,4,2,2])
            2
            >>> detector.find_duplicate_single([3,1,3,4,2])
            3
        """
        # Treat array as linked list: nums[i] points to nums[nums[i]]
        # Duplicate creates a cycle

        # Phase 1: Find intersection point in cycle
        slow = fast = nums[0]

        while True:
            slow = nums[slow]
            fast = nums[nums[fast]]
            if slow == fast:
                break

        # Phase 2: Find entrance to cycle (duplicate)
        slow = nums[0]
        while slow != fast:
            slow = nums[slow]
            fast = nums[fast]

        return slow

    def count_duplicates(self, nums: List[int]) -> int:
        """
        Count how many numbers appear more than once.

        Time: O(n), Space: O(n)

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.count_duplicates([1,2,3,1,2,4])
            2  # 1 and 2 are duplicates
        """
        if not nums:
            return 0

        counts = Counter(nums)
        return sum(1 for count in counts.values() if count > 1)

    def remove_duplicates_inplace(self, nums: List[int]) -> int:
        """
        Remove duplicates from sorted array in-place.
        Return new length.

        Time: O(n), Space: O(1)

        Examples:
            >>> detector = DuplicateDetector()
            >>> nums = [1,1,2,2,3,4,4]
            >>> length = detector.remove_duplicates_inplace(nums)
            >>> length
            4
            >>> nums[:length]
            [1, 2, 3, 4]
        """
        if not nums:
            return 0

        write_idx = 1

        for i in range(1, len(nums)):
            if nums[i] != nums[i - 1]:
                nums[write_idx] = nums[i]
                write_idx += 1

        return write_idx


class DuplicateAnalyzer:
    """
    Advanced duplicate analysis and statistics.
    """

    def __init__(self, nums: List[int]):
        """Initialize with number array."""
        self.nums = nums
        self.frequency = Counter(nums)

    def get_duplicate_stats(self) -> Dict:
        """
        Get comprehensive duplicate statistics.

        Returns:
            Dictionary with various statistics

        Examples:
            >>> analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
            >>> stats = analyzer.get_duplicate_stats()
            >>> stats['total_duplicates']
            2
            >>> stats['most_frequent']
            3
        """
        duplicates = {num: count for num, count in self.frequency.items() if count > 1}

        return {
            'total_elements': len(self.nums),
            'unique_elements': len(self.frequency),
            'total_duplicates': len(duplicates),
            'duplicate_elements': list(duplicates.keys()),
            'most_frequent': max(self.frequency, key=self.frequency.get) if self.frequency else None,
            'max_frequency': max(self.frequency.values()) if self.frequency else 0,
            'duplicate_count': sum(count - 1 for count in duplicates.values()),
        }

    def find_k_most_frequent(self, k: int) -> List[int]:
        """
        Find k most frequent elements.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> analyzer = DuplicateAnalyzer([1,1,1,2,2,3])
            >>> analyzer.find_k_most_frequent(2)
            [1, 2]
        """
        from heapq import nlargest
        return nlargest(k, self.frequency.keys(), key=self.frequency.get)


# Comprehensive Test Suite
def test_contains_duplicate():
    """Complete test suite for duplicate detection."""
    detector = DuplicateDetector()

    # Test 1: Basic duplicate
    assert detector.contains_duplicate_set([1,2,3,1]) == True
    assert detector.contains_duplicate_set([1,2,3,4]) == False

    # Test 2: Empty array
    assert detector.contains_duplicate_set([]) == False

    # Test 3: Single element
    assert detector.contains_duplicate_set([1]) == False

    # Test 4: All duplicates
    assert detector.contains_duplicate_set([1,1,1,1]) == True

    # Test 5: Early return optimization
    assert detector.contains_duplicate_early_return([1,1,2,3]) == True

    # Test 6: Contains Duplicate II
    assert detector.contains_duplicate_ii([1,2,3,1], 3) == True
    assert detector.contains_duplicate_ii([1,0,1,1], 1) == True
    assert detector.contains_duplicate_ii([1,2,3,1,2,3], 2) == False

    # Test 7: Contains Duplicate III
    assert detector.contains_duplicate_iii([1,2,3,1], 3, 0) == True
    assert detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3) == False

    # Test 8: Find all duplicates
    assert sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1])) == [2, 3]

    # Test 9: Find single duplicate (Floyd's algorithm)
    assert detector.find_duplicate_single([1,3,4,2,2]) == 2
    assert detector.find_duplicate_single([3,1,3,4,2]) == 3

    # Test 10: Count duplicates
    assert detector.count_duplicates([1,2,3,1,2,4]) == 2

    # Test 11: Remove duplicates in-place
    nums = [1,1,2,2,3,4,4]
    length = detector.remove_duplicates_inplace(nums)
    assert length == 4
    assert nums[:length] == [1, 2, 3, 4]

    # Test 12: Duplicate analyzer
    analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
    stats = analyzer.get_duplicate_stats()
    assert stats['total_duplicates'] == 2
    assert stats['most_frequent'] == 3
    assert stats['max_frequency'] == 3

    # Test 13: K most frequent
    assert analyzer.find_k_most_frequent(2) == [3, 2]

    print("All Contains Duplicate tests passed!")


if __name__ == "__main__":
    test_contains_duplicate()

Topic 62: Longest Consecutive Sequence - Complete Implementation

"""
Longest Consecutive Sequence - Find length of longest consecutive elements sequence
Time: O(n), Space: O(n)
"""

from typing import List, Set, Dict

class ConsecutiveSequenceFinder:
    """
    Complete implementation of consecutive sequence problems.

    Key Insight: Hash set allows O(1) lookup for sequence building
    """

    def longest_consecutive_optimal(self, nums: List[int]) -> int:
        """
        Find longest consecutive sequence using hash set.
        Only start counting from sequence starts (num-1 not in set).

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers

        Returns:
            Length of longest consecutive sequence

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_optimal([100,4,200,1,3,2])
            4  # Sequence: [1,2,3,4]
            >>> finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1])
            9  # Sequence: [0,1,2,3,4,5,6,7,8]
        """
        if not nums:
            return 0

        num_set = set(nums)
        max_length = 0

        for num in num_set:
            # Only start counting from sequence beginnings
            if num - 1 not in num_set:
                current_num = num
                current_length = 1

                # Count consecutive numbers
                while current_num + 1 in num_set:
                    current_num += 1
                    current_length += 1

                max_length = max(max_length, current_length)

        return max_length

    def longest_consecutive_with_sequence(self, nums: List[int]) -> List[int]:
        """
        Return the actual longest consecutive sequence, not just length.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
            [1, 2, 3, 4]
        """
        if not nums:
            return []

        num_set = set(nums)
        longest_sequence = []

        for num in num_set:
            if num - 1 not in num_set:
                # Build sequence from this start
                sequence = []
                current = num

                while current in num_set:
                    sequence.append(current)
                    current += 1

                if len(sequence) > len(longest_sequence):
                    longest_sequence = sequence

        return longest_sequence

    def longest_consecutive_union_find(self, nums: List[int]) -> int:
        """
        Alternative approach using Union-Find data structure.
        Educational purpose - shows connection to graph connectivity.

        Time: O(n * α(n)) ≈ O(n), Space: O(n)
        where α is inverse Ackermann function (practically constant)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_union_find([100,4,200,1,3,2])
            4
        """
        if not nums:
            return 0

        class UnionFind:
            def __init__(self):
                self.parent = {}
                self.size = {}

            def add(self, x):
                if x not in self.parent:
                    self.parent[x] = x
                    self.size[x] = 1

            def find(self, x):
                if self.parent[x] != x:
                    self.parent[x] = self.find(self.parent[x])
                return self.parent[x]

            def union(self, x, y):
                root_x, root_y = self.find(x), self.find(y)
                if root_x == root_y:
                    return

                # Union by size
                if self.size[root_x] < self.size[root_y]:
                    root_x, root_y = root_y, root_x

                self.parent[root_y] = root_x
                self.size[root_x] += self.size[root_y]

            def get_max_size(self):
                return max(self.size.values()) if self.size else 0

        uf = UnionFind()
        num_set = set(nums)

        for num in num_set:
            uf.add(num)
            if num + 1 in num_set:
                uf.union(num, num + 1)

        return uf.get_max_size()

    def count_consecutive_sequences(self, nums: List[int]) -> int:
        """
        Count number of consecutive sequences.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.count_consecutive_sequences([100,4,200,1,3,2])
            3  # [100], [200], [1,2,3,4]
        """
        if not nums:
            return 0

        num_set = set(nums)
        count = 0

        for num in num_set:
            if num - 1 not in num_set:
                count += 1

        return count

    def all_consecutive_sequences(self, nums: List[int]) -> List[List[int]]:
        """
        Find all consecutive sequences.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> result = finder.all_consecutive_sequences([100,4,200,1,3,2])
            >>> sorted([sorted(seq) for seq in result])
            [[1, 2, 3, 4], [100], [200]]
        """
        if not nums:
            return []

        num_set = set(nums)
        sequences = []

        for num in num_set:
            if num - 1 not in num_set:
                # Build sequence
                sequence = []
                current = num

                while current in num_set:
                    sequence.append(current)
                    current += 1

                sequences.append(sequence)

        return sequences

    def longest_consecutive_range(self, nums: List[int]) -> tuple:
        """
        Return (start, end) of longest consecutive sequence.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_range([100,4,200,1,3,2])
            (1, 4)
        """
        if not nums:
            return (0, 0)

        num_set = set(nums)
        max_start, max_end = 0, 0
        max_length = 0

        for num in num_set:
            if num - 1 not in num_set:
                start = num
                end = num

                while end + 1 in num_set:
                    end += 1

                if end - start + 1 > max_length:
                    max_length = end - start + 1
                    max_start, max_end = start, end

        return (max_start, max_end)

    def consecutive_with_duplicates(self, nums: List[int]) -> int:
        """
        Handle duplicates explicitly in consecutive sequence.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.consecutive_with_duplicates([1,2,2,3,3,3,4])
            4  # [1,2,3,4]
        """
        # Same as optimal since we use set
        return self.longest_consecutive_optimal(nums)


class BinaryArrayConsecutive:
    """
    Find longest consecutive sequence of 1s in binary array.
    Related problem showing hash table application to different domain.
    """

    def longest_consecutive_ones(self, nums: List[int]) -> int:
        """
        Find longest sequence of consecutive 1s.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = BinaryArrayConsecutive()
            >>> solver.longest_consecutive_ones([1,1,0,1,1,1])
            3
        """
        max_length = 0
        current_length = 0

        for num in nums:
            if num == 1:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0

        return max_length

    def longest_consecutive_ones_with_flip(self, nums: List[int], k: int) -> int:
        """
        Find longest sequence of 1s after flipping at most k 0s.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = BinaryArrayConsecutive()
            >>> solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2)
            6  # Flip two 0s: [1,1,1,0,0,0,1,1,1,1] -> [1,1,1,1,1,0,1,1,1,1]
        """
        left = 0
        zeros_count = 0
        max_length = 0

        for right in range(len(nums)):
            if nums[right] == 0:
                zeros_count += 1

            while zeros_count > k:
                if nums[left] == 0:
                    zeros_count -= 1
                left += 1

            max_length = max(max_length, right - left + 1)

        return max_length


# Comprehensive Test Suite
def test_longest_consecutive():
    """Complete test suite for consecutive sequence."""
    finder = ConsecutiveSequenceFinder()

    # Test 1: Basic case
    assert finder.longest_consecutive_optimal([100,4,200,1,3,2]) == 4

    # Test 2: Long sequence
    assert finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1]) == 9

    # Test 3: Empty array
    assert finder.longest_consecutive_optimal([]) == 0

    # Test 4: Single element
    assert finder.longest_consecutive_optimal([1]) == 1

    # Test 5: No consecutive
    assert finder.longest_consecutive_optimal([1,3,5,7,9]) == 1

    # Test 6: All consecutive
    assert finder.longest_consecutive_optimal([1,2,3,4,5]) == 5

    # Test 7: Duplicates
    assert finder.longest_consecutive_optimal([1,2,0,1]) == 3

    # Test 8: Negative numbers
    assert finder.longest_consecutive_optimal([-1,0,1,2]) == 4

    # Test 9: Get actual sequence
    sequence = finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
    assert sequence == [1,2,3,4]

    # Test 10: Union-Find approach
    assert finder.longest_consecutive_union_find([100,4,200,1,3,2]) == 4

    # Test 11: Count sequences
    assert finder.count_consecutive_sequences([100,4,200,1,3,2]) == 3

    # Test 12: All sequences
    sequences = finder.all_consecutive_sequences([100,4,200,1,3,2])
    assert len(sequences) == 3

    # Test 13: Get range
    assert finder.longest_consecutive_range([100,4,200,1,3,2]) == (1, 4)

    # Test 14: Binary array consecutive 1s
    binary_solver = BinaryArrayConsecutive()
    assert binary_solver.longest_consecutive_ones([1,1,0,1,1,1]) == 3

    # Test 15: With flips
    assert binary_solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2) == 6

    print("All Longest Consecutive Sequence tests passed!")


if __name__ == "__main__":
    test_longest_consecutive()

Topic 63: Top K Frequent Elements - Complete Implementation

"""
Top K Frequent Elements - Find k most frequent elements
Time: O(n log k) using heap, O(n) using bucket sort
Space: O(n)
"""

from typing import List, Dict, Tuple
from collections import Counter
import heapq

class TopKFrequentSolver:
    """
    Complete implementation of Top K Frequent Elements.

    Key Insight: Use frequency counting + heap or bucket sort
    """

    def top_k_frequent_heap(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent elements using min heap.

        Time: O(n log k), Space: O(n)

        Args:
            nums: List of integers
            k: Number of most frequent elements to return

        Returns:
            List of k most frequent elements

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.top_k_frequent_heap([1,1,1,2,2,3], 2))
            [1, 2]
            >>> sorted(solver.top_k_frequent_heap([1], 1))
            [1]
        """
        if not nums or k <= 0:
            return []

        # Count frequencies
        freq_map = Counter(nums)

        # Use min heap of size k
        # Heap stores (frequency, number)
        min_heap = []

        for num, freq in freq_map.items():
            heapq.heappush(min_heap, (freq, num))
            if len(min_heap) > k:
                heapq.heappop(min_heap)

        # Extract numbers from heap
        return [num for freq, num in min_heap]

    def top_k_frequent_bucket(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent using bucket sort.
        More efficient: O(n) time.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.top_k_frequent_bucket([1,1,1,2,2,3], 2))
            [1, 2]
        """
        if not nums or k <= 0:
            return []

        # Count frequencies
        freq_map = Counter(nums)

        # Bucket sort: index = frequency, value = list of numbers
        # Maximum frequency is len(nums)
        buckets = [[] for _ in range(len(nums) + 1)]

        for num, freq in freq_map.items():
            buckets[freq].append(num)

        # Collect k most frequent from highest frequency buckets
        result = []
        for i in range(len(buckets) - 1, -1, -1):
            if buckets[i]:
                result.extend(buckets[i])
                if len(result) >= k:
                    return result[:k]

        return result

    def top_k_frequent_quickselect(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent using Quickselect algorithm.
        Average O(n), worst case O(n²).

        Time: O(n) average, Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
            >>> len(result)
            2
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)
        unique_nums = list(freq_map.keys())

        def partition(left, right, pivot_idx):
            pivot_freq = freq_map[unique_nums[pivot_idx]]
            # Move pivot to end
            unique_nums[pivot_idx], unique_nums[right] = unique_nums[right], unique_nums[pivot_idx]

            store_idx = left
            for i in range(left, right):
                if freq_map[unique_nums[i]] < pivot_freq:
                    unique_nums[store_idx], unique_nums[i] = unique_nums[i], unique_nums[store_idx]
                    store_idx += 1

            # Move pivot to final position
            unique_nums[right], unique_nums[store_idx] = unique_nums[store_idx], unique_nums[right]
            return store_idx

        def quickselect(left, right, k_smallest):
            if left == right:
                return

            # Select random pivot
            import random
            pivot_idx = random.randint(left, right)
            pivot_idx = partition(left, right, pivot_idx)

            if k_smallest == pivot_idx:
                return
            elif k_smallest < pivot_idx:
                quickselect(left, pivot_idx - 1, k_smallest)
            else:
                quickselect(pivot_idx + 1, right, k_smallest)

        n = len(unique_nums)
        quickselect(0, n - 1, n - k)
        return unique_nums[n - k:]

    def top_k_frequent_with_counts(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
        """
        Return k most frequent elements with their frequencies.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
            [(1, 3), (2, 2)]
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)

        # Use max heap (negate frequencies for min heap)
        heap = []
        for num, freq in freq_map.items():
            heapq.heappush(heap, (-freq, num))

        result = []
        for _ in range(min(k, len(heap))):
            freq, num = heapq.heappop(heap)
            result.append((num, -freq))

        return result

    def top_k_frequent_sorted(self, nums: List[int], k: int) -> List[int]:
        """
        Simple approach using sorting.

        Time: O(n log n), Space: O(n)
        Not optimal but simple and clear.
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)
        # Sort by frequency (descending)
        sorted_items = sorted(freq_map.items(), key=lambda x: x[1], reverse=True)

        return [num for num, freq in sorted_items[:k]]

    def bottom_k_frequent(self, nums: List[int], k: int) -> List[int]:
        """
        Find k least frequent elements.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.bottom_k_frequent([1,1,1,2,2,3], 2))
            [2, 3]
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)

        # Use max heap to keep k smallest frequencies
        max_heap = []

        for num, freq in freq_map.items():
            heapq.heappush(max_heap, (-freq, num))
            if len(max_heap) > k:
                heapq.heappop(max_heap)

        return [num for freq, num in max_heap]


class FrequencyAnalyzer:
    """
    Advanced frequency analysis tools.
    """

    def __init__(self, nums: List[int]):
        """Initialize with number array."""
        self.nums = nums
        self.freq_map = Counter(nums)

    def get_frequency_distribution(self) -> Dict[int, List[int]]:
        """
        Get distribution: frequency -> list of numbers with that frequency.

        Time: O(n), Space: O(n)

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> dist = analyzer.get_frequency_distribution()
            >>> dist[3]
            [1]
            >>> sorted(dist[2])
            [2]
            >>> dist[1]
            [3]
        """
        distribution = {}
        for num, freq in self.freq_map.items():
            if freq not in distribution:
                distribution[freq] = []
            distribution[freq].append(num)

        return distribution

    def frequency_stats(self) -> Dict:
        """
        Get comprehensive frequency statistics.

        Returns:
            Dictionary with various statistics
        """
        if not self.freq_map:
            return {}

        frequencies = list(self.freq_map.values())

        return {
            'total_elements': len(self.nums),
            'unique_elements': len(self.freq_map),
            'min_frequency': min(frequencies),
            'max_frequency': max(frequencies),
            'avg_frequency': sum(frequencies) / len(frequencies),
            'mode': max(self.freq_map, key=self.freq_map.get),
            'mode_frequency': max(frequencies),
        }

    def elements_with_frequency(self, target_freq: int) -> List[int]:
        """
        Get all elements that appear exactly target_freq times.

        Time: O(n), Space: O(1)

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> analyzer.elements_with_frequency(2)
            [2]
        """
        return [num for num, freq in self.freq_map.items() if freq == target_freq]

    def frequency_percentile(self, num: int) -> float:
        """
        Calculate what percentile the frequency of num is at.

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> analyzer.frequency_percentile(1)
            100.0  # Highest frequency
        """
        if num not in self.freq_map:
            return 0.0

        target_freq = self.freq_map[num]
        count_lower = sum(1 for freq in self.freq_map.values() if freq < target_freq)

        return (count_lower / len(self.freq_map)) * 100


# Comprehensive Test Suite
def test_top_k_frequent():
    """Complete test suite for Top K Frequent."""
    solver = TopKFrequentSolver()

    # Test 1: Basic case with heap
    result = solver.top_k_frequent_heap([1,1,1,2,2,3], 2)
    assert sorted(result) == [1, 2]

    # Test 2: Single element
    assert solver.top_k_frequent_heap([1], 1) == [1]

    # Test 3: K equals number of unique elements
    result = solver.top_k_frequent_heap([1,2,3], 3)
    assert len(result) == 3

    # Test 4: Bucket sort approach
    result = solver.top_k_frequent_bucket([1,1,1,2,2,3], 2)
    assert sorted(result) == [1, 2]

    # Test 5: Quickselect approach
    result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
    assert len(result) == 2
    assert 1 in result and 2 in result

    # Test 6: With counts
    result = solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
    assert len(result) == 2
    assert (1, 3) in result
    assert (2, 2) in result

    # Test 7: Sorted approach
    result = solver.top_k_frequent_sorted([1,1,1,2,2,3], 2)
    assert result == [1, 2]

    # Test 8: Bottom k frequent
    result = solver.bottom_k_frequent([1,1,1,2,2,3], 2)
    assert sorted(result) == [2, 3]

    # Test 9: Frequency analyzer
    analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
    dist = analyzer.get_frequency_distribution()
    assert dist[3] == [1]
    assert 2 in dist[2]
    assert dist[1] == [3]

    # Test 10: Frequency stats
    stats = analyzer.frequency_stats()
    assert stats['total_elements'] == 6
    assert stats['unique_elements'] == 3
    assert stats['max_frequency'] == 3
    assert stats['mode'] == 1

    # Test 11: Elements with frequency
    assert analyzer.elements_with_frequency(2) == [2]

    # Test 12: Frequency percentile
    assert analyzer.frequency_percentile(1) == 100.0

    # Test 13: Large array performance
    import random
    large_array = [random.randint(1, 100) for _ in range(10000)]
    result = solver.top_k_frequent_bucket(large_array, 10)
    assert len(result) <= 10

    print("All Top K Frequent tests passed!")


if __name__ == "__main__":
    test_top_k_frequent()

Topic 64: Subarray Sum Equals K - Complete Implementation

"""
Subarray Sum Equals K - Count subarrays with sum equal to k
Time: O(n), Space: O(n)
"""

from typing import List, Dict, Tuple
from collections import defaultdict

class SubarraySumSolver:
    """
    Complete implementation of subarray sum problems.

    Key Insight: Use prefix sum + hash map for O(n) solution
    """

    def subarray_sum_count(self, nums: List[int], k: int) -> int:
        """
        Count number of continuous subarrays with sum equal to k.

        Uses prefix sum: if prefix[j] - prefix[i] = k, then subarray[i+1:j+1] sums to k
        Store prefix sums in hash map for O(1) lookup.

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers
            k: Target sum

        Returns:
            Count of subarrays summing to k

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_count([1,1,1], 2)
            2  # [1,1] appears twice
            >>> solver.subarray_sum_count([1,2,3], 3)
            2  # [3] and [1,2]
        """
        if not nums:
            return 0

        # Map: prefix_sum -> count of occurrences
        prefix_count = defaultdict(int)
        prefix_count[0] = 1  # Empty prefix

        current_sum = 0
        count = 0

        for num in nums:
            current_sum += num

            # Check if (current_sum - k) exists
            # If yes, those positions form valid subarrays
            if current_sum - k in prefix_count:
                count += prefix_count[current_sum - k]

            prefix_count[current_sum] += 1

        return count

    def subarray_sum_indices(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
        """
        Return all subarray indices (start, end) with sum equal to k.

        Time: O(n²) worst case (all subarrays sum to k)
        Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_indices([1,1,1], 2)
            [(0, 1), (1, 2)]
        """
        if not nums:
            return []

        # Map: prefix_sum -> list of indices
        prefix_indices = defaultdict(list)
        prefix_indices[0].append(-1)  # Before array starts

        current_sum = 0
        result = []

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_indices:
                for start_idx in prefix_indices[target]:
                    result.append((start_idx + 1, i))

            prefix_indices[current_sum].append(i)

        return result

    def subarray_sum_max_length(self, nums: List[int], k: int) -> int:
        """
        Find maximum length of subarray with sum equal to k.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_max_length([1,-1,5,-2,3], 3)
            4  # [1,-1,5,-2] or [-2,3] -> actually [-1,5,-2,3] = 5? Let me recalc
            # Actually [1,-1,5,-2] = 3, length 4
        """
        if not nums:
            return 0

        # Map: prefix_sum -> first index where it appears
        prefix_first = {0: -1}
        current_sum = 0
        max_length = 0

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_first:
                max_length = max(max_length, i - prefix_first[target])

            # Only store first occurrence of each prefix sum
            if current_sum not in prefix_first:
                prefix_first[current_sum] = i

        return max_length

    def subarray_sum_min_length(self, nums: List[int], k: int) -> int:
        """
        Find minimum length of subarray with sum equal to k.
        Returns -1 if no such subarray exists.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_min_length([1,2,3,4,5], 9)
            2  # [4,5]
        """
        if not nums:
            return -1

        # Map: prefix_sum -> last index where it appears
        prefix_last = {0: -1}
        current_sum = 0
        min_length = float('inf')

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_last:
                min_length = min(min_length, i - prefix_last[target])

            # Always update to latest index
            prefix_last[current_sum] = i

        return min_length if min_length != float('inf') else -1

    def subarray_sum_equals_zero(self, nums: List[int]) -> bool:
        """
        Check if there exists a subarray with sum equal to 0.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_equals_zero([1,-1,2,3])
            True  # [1,-1]
            >>> solver.subarray_sum_equals_zero([1,2,3])
            False
        """
        if not nums:
            return False

        seen_sums = {0}  # Empty prefix
        current_sum = 0

        for num in nums:
            current_sum += num
            if current_sum in seen_sums:
                return True
            seen_sums.add(current_sum)

        return False

    def longest_subarray_sum_zero(self, nums: List[int]) -> int:
        """
        Find length of longest subarray with sum 0.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23])
            5  # [-2,2,-8,1,7]
        """
        return self.subarray_sum_max_length(nums, 0)

    def count_subarrays_sum_divisible_by_k(self, nums: List[int], k: int) -> int:
        """
        Count subarrays with sum divisible by k.

        Key insight: (prefix[j] - prefix[i]) % k = 0
        means prefix[j] % k = prefix[i] % k

        Time: O(n), Space: O(k)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5)
            7
        """
        if not nums:
            return 0

        # Map: remainder -> count
        remainder_count = defaultdict(int)
        remainder_count[0] = 1  # Empty prefix

        current_sum = 0
        count = 0

        for num in nums:
            current_sum += num
            remainder = current_sum % k

            # Handle negative remainders
            if remainder < 0:
                remainder += k

            count += remainder_count[remainder]
            remainder_count[remainder] += 1

        return count


class RangeSumQuery:
    """
    Data structure for efficient range sum queries.
    Preprocess array to answer sum queries in O(1).
    """

    def __init__(self, nums: List[int]):
        """
        Preprocess array to build prefix sum array.

        Time: O(n), Space: O(n)
        """
        self.prefix = [0]
        for num in nums:
            self.prefix.append(self.prefix[-1] + num)

    def sum_range(self, left: int, right: int) -> int:
        """
        Return sum of elements from index left to right (inclusive).

        Time: O(1), Space: O(1)

        Examples:
            >>> query = RangeSumQuery([1,2,3,4,5])
            >>> query.sum_range(0, 2)
            6  # 1+2+3
            >>> query.sum_range(2, 4)
            12  # 3+4+5
        """
        return self.prefix[right + 1] - self.prefix[left]

    def count_range_sum_k(self, k: int) -> int:
        """
        Count all ranges with sum equal to k.

        Time: O(n²), Space: O(1)
        """
        count = 0
        n = len(self.prefix) - 1

        for i in range(n):
            for j in range(i, n):
                if self.sum_range(i, j) == k:
                    count += 1

        return count


class SubarrayProductSolver:
    """
    Related problem: Product instead of sum.
    Shows how hash map pattern applies to different operations.
    """

    def subarray_product_less_than_k(self, nums: List[int], k: int) -> int:
        """
        Count subarrays where product of elements is less than k.

        Note: All elements must be positive for this to work.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = SubarrayProductSolver()
            >>> solver.subarray_product_less_than_k([10,5,2,6], 100)
            8
        """
        if k <= 1:
            return 0

        count = 0
        product = 1
        left = 0

        for right in range(len(nums)):
            product *= nums[right]

            while product >= k and left <= right:
                product //= nums[left]
                left += 1

            # All subarrays ending at right and starting from left to right
            count += right - left + 1

        return count


# Comprehensive Test Suite
def test_subarray_sum():
    """Complete test suite for subarray sum."""
    solver = SubarraySumSolver()

    # Test 1: Basic count
    assert solver.subarray_sum_count([1,1,1], 2) == 2
    assert solver.subarray_sum_count([1,2,3], 3) == 2

    # Test 2: Empty array
    assert solver.subarray_sum_count([], 5) == 0

    # Test 3: Single element
    assert solver.subarray_sum_count([5], 5) == 1
    assert solver.subarray_sum_count([5], 3) == 0

    # Test 4: Negative numbers
    assert solver.subarray_sum_count([1,-1,0], 0) == 3

    # Test 5: Get indices
    indices = solver.subarray_sum_indices([1,1,1], 2)
    assert len(indices) == 2
    assert (0, 1) in indices and (1, 2) in indices

    # Test 6: Max length
    assert solver.subarray_sum_max_length([1,-1,5,-2,3], 3) == 4

    # Test 7: Min length
    assert solver.subarray_sum_min_length([1,2,3,4,5], 9) == 2
    assert solver.subarray_sum_min_length([1,2,3], 10) == -1

    # Test 8: Sum equals zero
    assert solver.subarray_sum_equals_zero([1,-1,2,3]) == True
    assert solver.subarray_sum_equals_zero([1,2,3]) == False

    # Test 9: Longest zero sum
    assert solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23]) == 5

    # Test 10: Divisible by k
    assert solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5) == 7

    # Test 11: Range sum query
    query = RangeSumQuery([1,2,3,4,5])
    assert query.sum_range(0, 2) == 6
    assert query.sum_range(2, 4) == 12
    assert query.sum_range(0, 4) == 15

    # Test 12: Product less than k
    product_solver = SubarrayProductSolver()
    assert product_solver.subarray_product_less_than_k([10,5,2,6], 100) == 8

    # Test 13: All elements sum to k
    assert solver.subarray_sum_count([1,1,1,1], 4) == 1

    # Test 14: No subarray sums to k
    assert solver.subarray_sum_count([1,2,3], 10) == 0

    print("All Subarray Sum tests passed!")


if __name__ == "__main__":
    test_subarray_sum()

Explore Unread

Great job! You've read all available articles

Comprehensive Guide to Hash Tables and Hashing

A complete deep-dive into hash tables, hash functions, collision resolution strategies, and their applications in problem-solving.

1. Foundational Understanding

1.1 What, Why, and When Questions

Basic Definition and Scope

What is a Hash Table?

Precise Definition:

A hash table is a combination of:
1. An array (the underlying storage)
2. A hash function (maps keys to array indices)
3. A collision resolution strategy (handles when multiple keys hash to same index)

Operations Supported:

Insert/Put: Add a key-value pair to the table - O(1) average
Delete/Remove: Remove a key-value pair from the table - O(1) average
Search/Get: Retrieve the value associated with a key - O(1) average
Update: Modify the value for an existing key - O(1) average
Contains: Check if a key exists in the table - O(1) average
Keys/Values/Entries: Iterate through all keys, values, or key-value pairs - O(n)

What Problems Does It Solve That Simpler Structures Cannot?

Fast lookups with arbitrary keys: Unlike arrays (which need integer indices), hash tables allow O(1) lookup with any hashable key type (strings, custom objects, etc.)
Efficient membership testing: Checking if an element exists is O(1) average, compared to O(n) for unsorted arrays or O(log n) for sorted arrays
Frequency counting and grouping: Perfect for counting occurrences or grouping related items
Caching and memoization: Store computed results for fast retrieval
Set operations: Efficient union, intersection, difference operations

Boundaries of the Concept:

Included in Hash Table Family:

Hash maps (key-value pairs)
Hash sets (keys only, no values)
Multisets/bags (allow duplicate keys)
Concurrent hash tables (thread-safe versions)
Persistent hash tables (immutable versions)
Distributed hash tables (DHT)

Excluded:

Trees (balanced BSTs, tries) - different structure despite some overlap in use cases
Direct-address tables - no hashing, just direct indexing
Bloom filters - probabilistic, don't store actual data

Mathematical and Computational Foundations:

Hash tables build upon:

Array indexing: O(1) random access
Hash functions: Mapping from large key space to small index space
Modular arithmetic: Computing indices within array bounds
Probability theory: Load factor analysis, collision probability
Amortized analysis: Understanding dynamic resizing costs

What Inspired Its Creation?

Hash tables emerged from the need to:

Store and retrieve data faster than O(log n) binary search
Use meaningful keys (strings, objects) instead of just integer indices
Handle large symbol tables in compilers efficiently
Implement associative memory in early AI systems

The concept evolved from:

Scatter storage (1950s): Early ideas of computed indexing
Symbol tables in compilers: Need for fast identifier lookup
Database indexing: Efficient record retrieval
Cache memory: Hardware-level associative lookup

Purpose and Context

Why Was the Hash Table Invented?

In the 1950s, as computers began processing larger datasets, researchers faced a fundamental challenge:

Arrays required integer indices and had no semantic meaning
Linked lists required O(n) search time
Sorted arrays required O(log n) search but O(n) insertion
Binary search trees weren't yet well-developed

Hans Peter Luhn (IBM, 1953) and others independently developed hashing to achieve near-constant-time operations using a mathematical transformation (hash function) to convert keys into array indices.

Why Is It Preferred Over Simpler Alternatives?

Operation	Array (unsorted)	Array (sorted)	Linked List	Hash Table
Search	O(n)	O(log n)	O(n)	O(1) avg
Insert	O(1)	O(n)	O(1)	O(1) avg
Delete	O(n)	O(n)	O(n)	O(1) avg

Hash tables win when:

You need fast lookups with non-integer keys
Insert/delete frequency is high
You can afford extra memory (load factor < 1)
You don't need ordered traversal

Real-World Motivating Problems:

Compiler symbol tables (1950s-1960s):
- Problem: During compilation, need to look up variable names thousands of times
- Solution: Hash table maps identifier strings → memory locations/types
Database indexing (1970s):
- Problem: Retrieve customer records by ID/email instantly
- Solution: Hash index for O(1) retrieval
Spell checkers (1980s):
- Problem: Check if word exists in 100,000+ word dictionary
- Solution: Hash set for O(1) membership test
Network routing (1990s):
- Problem: Map IP addresses to MAC addresses (ARP tables)
- Solution: Hash table for fast address resolution

Limitations Addressed:

Linear search inefficiency: Hash tables provide O(1) average lookup vs O(n)
Binary search insertion cost: Hash tables avoid the O(n) insertion time of maintaining sorted order
Memory rigidity of arrays: Hash tables grow dynamically
Lack of semantic keys: Arrays use indices; hash tables use meaningful keys

Historical Context

Evolution and Variants:

1950s - Birth of Hashing:
- Hans Peter Luhn (IBM): Internal memorandum on hashing (1953)
- Arnold Dumey: "Indexing for rapid random access memory" (1956)
- Early collision strategies: linear probing
1960s - Theoretical Foundation:
- W.W. Peterson: Analysis of open addressing (1957)
- Bell Labs: Development of chaining
- Donald Knuth: Systematic analysis in "The Art of Computer Programming"
1970s - Refinement:
- Robin Hood hashing: Improved open addressing
- Double hashing: Better probe sequences
- Universal hashing theory (Carter & Wegman, 1979)
1980s-1990s - Practical Variants:
- Cuckoo hashing (2001): Worst-case O(1) lookup
- Perfect hashing: No collisions for static datasets
- Concurrent hash tables: Lock-free data structures
2000s-Present - Modern Innovations:
- Swiss Tables (Google, 2017): SIMD-optimized hash tables
- Consistent hashing: Distributed systems
- Cryptographic hash tables: Secure against collision attacks

Current Variants:

Variant	Key Feature	Use Case
Chaining	Linked lists at each bucket	General purpose, variable load
Open Addressing	Probing sequence	Cache-friendly, bounded memory
Robin Hood	Variance reduction	Predictable performance
Cuckoo	Worst-case O(1) lookup	Real-time systems
Hopscotch	Bounded search	High performance
Swiss Tables	SIMD optimization	Modern C++ std::unordered_map

Key Milestones:

1953: First hash table implementation (Luhn)
1963: Linear probing analysis (Konheim & Weiss)
1979: Universal hashing proved (Carter & Wegman)
1997: SHA-1 becomes standard hash function
2001: Cuckoo hashing introduced
2012: MurmurHash3 widely adopted
2017: Google's Swiss Tables revolutionize std::unordered_map

Computational Problems Made Solvable:

Real-time symbol lookup in compilers: Enabled faster compilation
Large-scale database indexing: Made relational databases practical
Distributed caching: Memcached, Redis rely on hash tables
Deduplication at scale: Git uses SHA-1 hashing for content addressing
Password verification: Cryptographic hashes (bcrypt, scrypt)

Historical Failures and Lessons

Famous Bugs and Failures:

Perl Hash Collision DoS (2003):
- Issue: Predictable hash functions allowed attackers to craft inputs that all hashed to same bucket
- Impact: POST requests with crafted keys caused O(n²) behavior, DoS attacks
- Lesson: Hash functions need randomization (salt) to prevent algorithmic complexity attacks
- Fix: Randomized hash seeds, switched to SipHash in many languages
Java HashMap Infinite Loop (pre-Java 8):
- Issue: Concurrent modification during resize could create circular linked lists in buckets
- Impact: get() operations would loop infinitely
- Lesson: Need proper synchronization or use ConcurrentHashMap
- Fix: Java 8 switched to trees for large buckets, added better resize logic
Redis Hash Collision Attack (2012):
- Issue: Predictable hash function allowed attackers to degrade performance
- Impact: Redis instances became unresponsive
- Lesson: Even in-memory databases vulnerable to hash attacks
- Fix: Implemented hash function randomization
Git Hash Collision (SHA-1, 2017):
- Issue: Researchers demonstrated SHA-1 collision (SHAttered attack)
- Impact: Theoretical ability to forge Git commits
- Lesson: Cryptographic hash functions have finite lifespan
- Fix: Git migrating to SHA-256
Birthday Paradox Underestimation:
- Mistake: Developers often underestimate collision probability
- Example: 32-bit hash with 77,163 elements has 50% collision chance
- Lesson: Birthday paradox means collisions happen at √n, not n
- Fix: Use sufficient hash bit width (64-bit or 128-bit)

Common Historical Misconceptions:

"Hash tables are always O(1)"
- Reality: O(1) is average case; worst case is O(n) with all collisions
- Modern understanding: Amortized analysis, probabilistic guarantees
"Any hash function will do"
- Reality: Poor hash functions cause clustering and performance degradation
- Modern understanding: Need avalanche effect, good distribution
"Larger tables always perform better"
- Reality: Cache misses can make sparse tables slower than denser ones
- Modern understanding: Balance load factor vs cache locality
"Chaining is always better than open addressing"
- Reality: Open addressing has better cache performance on modern CPUs
- Modern understanding: Hardware matters; Swiss Tables use open addressing

"Obvious" Optimizations That Were Wrong:

Using multiplication instead of modulo:
- Thought: Modulo is slow, use (hash * 2654435761) >> shift instead
- Reality: Modern CPUs have fast modulo; multiplication can have worse distribution
- Truth: It depends on CPU architecture and hash function
Always using prime-sized tables:
- Thought: Prime sizes avoid patterns in modulo operation
- Reality: Power-of-2 sizes enable fast bitwise AND instead of modulo
- Truth: Prime sizes help only with poor hash functions; good hash functions work with any size
Pre-allocating huge tables to avoid resizing:
- Thought: Eliminate resize cost by making table oversized from start
- Reality: Memory waste and cache misses dominate, slower than resizing
- Truth: Dynamic resizing with good load factor (0.7-0.75) is optimal

Real-World Incidents:

Algorithmic Complexity Attacks (2011):
- Multiple languages (PHP, Python, Ruby, Java) vulnerable simultaneously
- Crafted POST data caused hash collisions → O(n²) behavior
- Web servers crashed under attack
- Led to widespread adoption of randomized hashing
Cloudflare Bug (2017):
- Hash table overflow in edge cases caused memory corruption
- Leaked sensitive data (Cloudbleed)
- Lesson: Careful bounds checking even in "safe" operations

Evolution of Best Practices:

Era	Practice	Rationale
1960s	Simple division method	Fast on limited hardware
1970s	Multiplication method	Better distribution
1980s	Prime table sizes	Avoid patterns with poor hash functions
1990s	Cryptographic hashes	Security-conscious hashing
2000s	Randomized seeds	Prevent DoS attacks
2010s	SIMD-optimized structures	Leverage modern CPU features
2020s	Hardware-aware design	Cache-oblivious algorithms

Philosophical and Conceptual

Fundamental Trade-offs:

Time vs. Space:
- Lower load factor → faster operations, more memory
- Higher load factor → more collisions, less memory
- Sweet spot: 0.7-0.75 load factor for most applications
Determinism vs. Security:
- Predictable hash functions → reproducible behavior, vulnerable to attacks
- Randomized hash functions → secure, non-deterministic iteration order
Simplicity vs. Performance:
- Chaining → simple to implement, pointer overhead
- Open addressing → complex deletion, better cache locality
Worst-case vs. Average-case:
- Standard hash table → O(1) average, O(n) worst
- Cuckoo hashing → O(1) worst-case lookup, more complex insertion

Core Assumptions:

Good hash function: Keys distribute uniformly across buckets
Hashability: Keys can be hashed consistently (immutable, implement equality)
Independence: Hash values are independent random variables (under simple uniform hashing assumption)
Sufficient memory: Can maintain load factor below threshold
Amortized cost acceptable: Occasional expensive resize is okay

Paradigm:

Hash tables follow the randomized algorithm paradigm:

Use randomness (hash function) to achieve expected-case performance
Trade worst-case guarantees for better average performance
Rely on probabilistic analysis rather than deterministic bounds

Core Insight:

The fundamental insight of hashing is trading uniqueness for speed:

We can't guarantee unique locations for keys (collisions happen)
But we can make collisions rare enough that average-case is constant time
The hash function compresses a large key space into a small index space
Mathematically: We accept the pigeonhole principle (|Keys| > |Buckets|) and handle collisions efficiently

Philosophical Connections:

Pigeonhole Principle: If you have more keys than buckets, collisions are inevitable
Birthday Paradox: Collisions happen much sooner than intuition suggests
Locality of Reference: Modern implementations optimize for cache performance
Probabilistic Reasoning: Performance guarantees are statistical, not deterministic

1.2 Classification and Categorization

Type and Family

Data Structure Type:

Primary Classification: Non-linear, random access data structure
Abstract Data Type: Implements the Map/Dictionary ADT and Set ADT
Storage Category: Array-based with computed indexing

Algorithm Family (for hashing):

Searching Algorithms: Hash-based search
String Algorithms: Rolling hash (Rabin-Karp)
Cryptographic Algorithms: Secure hash functions (SHA-256, bcrypt)

Major Variations:

By Collision Resolution:
- Separate Chaining: Each bucket contains a linked list/array/tree
- Open Addressing: All elements stored in the array itself
  - Linear probing
  - Quadratic probing
  - Double hashing
- Robin Hood Hashing: Variant of open addressing with fairness
- Cuckoo Hashing: Multiple hash functions, relocate on collision
- Hopscotch Hashing: Bounded search neighborhood
By Purpose:
- Hash Map: Stores key-value pairs
- Hash Set: Stores unique keys only
- Multimap: Allows duplicate keys
- Multiset/Bag: Allows duplicate elements
By Implementation:
- Fixed-size hash table: Static capacity
- Dynamic hash table: Grows/shrinks with load factor
- Perfect hash table: No collisions (for static datasets)
- Minimal perfect hash table: No collisions, minimal space
By Thread-Safety:
- Single-threaded: Standard hash table
- Concurrent hash table: Lock-free or fine-grained locking
- Distributed hash table (DHT): Spread across multiple machines
By Persistence:
- Ephemeral: Standard mutable hash table
- Persistent: Immutable, preserves old versions
- Cache-oblivious: Optimized for unknown cache parameters

Most Common Variant:

Separate Chaining with Dynamic Resizing is most common because:

Simple to implement correctly
Handles variable load factors gracefully
Deletion is straightforward
Predictable performance

However, Open Addressing is gaining popularity in modern systems (e.g., Google's Swiss Tables) due to:

Better cache locality
No pointer overhead
Smaller memory footprint

Distinguishing Factors:

Aspect	Chaining	Open Addressing	Cuckoo
Collision Strategy	Linked structures	Probe sequence	Multiple tables
Cache Performance	Poor (pointer chasing)	Excellent	Good
Deletion	Easy	Complex (tombstones)	Easy
Load Factor	Can exceed 1.0	Must stay < 1.0	Typically < 0.5
Worst-case Lookup	O(n)	O(n)	O(1)

Domain and Application

Where Most Commonly Used:

Compilers and Interpreters:
- Symbol tables: Map identifiers to types, scopes, addresses
- Constant pools: Deduplicate string/numeric literals
- Example: Python's dict for variable namespaces
Databases:
- Hash indexes: Fast equality lookups (not range queries)
- Join operations: Hash join algorithm
- Example: PostgreSQL hash indexes
Operating Systems:
- Page tables: Virtual → physical address mapping
- File systems: Directory structures (ext4, NTFS)
- Process tables: PID → process structure
- Buffer cache: Inode hashing
Networking:
- Routing tables: IP address lookup
- ARP cache: IP → MAC address mapping
- NAT tables: Connection tracking
- DNS caching
Caching Systems:
- Web caches: URL → cached response
- CDNs: Content distribution
- Example: Redis, Memcached
Security:
- Password storage: Hash + salt
- Digital signatures: Document integrity
- Certificate validation: Key fingerprinting

Industries Relying Heavily:

Industry	Application	Specific Use
Finance	Trading systems	Order matching, position tracking
E-commerce	Product catalogs	SKU → product info
Social Media	User data	User ID → profile
Gaming	Game state	Entity ID → game object
Telecommunications	Call routing	Phone number → routing info
Bioinformatics	Sequence analysis	k-mer counting, genome assembly

Scale Suitability:

Small Datasets (< 1,000 elements):
- Hash tables work but may be overkill
- Simple linear search might be comparable
- Memory overhead not justified
Medium Datasets (1,000 - 1,000,000):
- Ideal range for hash tables
- O(1) lookup clearly beats O(log n) trees
- Memory footprint reasonable
Large Datasets (1,000,000 - 1,000,000,000):
- Hash tables excel if they fit in RAM
- Cache performance becomes critical
- Consider Swiss Tables or other cache-optimized variants
Massive Datasets (> 1 billion):
- May not fit in memory of single machine
- Use distributed hash tables (DHT): Chord, Kademlia
- Or disk-based hash structures: Extendible hashing
Streaming Data:
- Use Count-Min Sketch or Bloom filters for approximate answers
- Standard hash tables for exact tracking of recent items

Problem Type Excellence:

Hash tables excel at:

Membership testing: "Does this element exist?"
Counting/frequency: "How many times does X appear?"
Grouping/partitioning: "Group elements by property"
Deduplication: "Remove duplicates"
Caching: "Have I computed this before?"
Two-pointer patterns: "Find pair summing to target"

Hash tables are poor at:

Ordered operations: Can't iterate in sorted order
Range queries: Can't find "all keys between X and Y"
Nearest neighbor: Can't find "closest key to X"
Prefix searches: Can't find "all keys starting with X"
- (Use tries or prefix trees instead)

2. Theory and Principles

2.1 Core Concepts and Structure

Fundamental Ideas

Underpinning Concepts:

Hashing Concept:
- Transform arbitrary key into fixed-range integer
- Key → Hash Function → Hash Code → Index
- Hash code typically much larger than table size
- Compression: index = hash_code % table_size
Simple Uniform Hashing Assumption (SUHA):
- Each key is equally likely to hash to any bucket
- Hash values are uniformly distributed
- Independence: Hash values of different keys are independent
- This is an ideal; real hash functions approximate it
Load Factor (α):
- α = n / m (n = number of elements, m = table size)
- Critical threshold: typically resize when α > 0.75
- Determines average collision chain length
Collision:
- When two different keys hash to the same index
- Inevitable by pigeonhole principle when |keys| > |buckets|
- Must be handled by collision resolution strategy

Underlying Mathematical Model:

Hash tables are modeled as:

Balls and Bins: Throwing n balls (keys) into m bins (buckets)
Probability Theory: Analyze collision rates using birthday paradox
Random Variables: Chain length is random variable; analyze expected value

Expected chain length under SUHA:

E[chain length] = α (load factor)
With chaining, as α → ∞, performance degrades linearly
With open addressing, must keep α < 1

Invariants

Must Always Be True:

Hash Consistency:
- If key1.equals(key2), then hash(key1) == hash(key2)
- Violation → same key could be stored in multiple locations
- Critical for correctness of get/put/delete
Capacity Invariant:
- For open addressing: num_elements < table_size (α < 1)
- For chaining: no strict limit, but typically num_elements < threshold * table_size
Probe Sequence Covers All Slots (Open Addressing):
- Every position must be reachable from any starting position
- Ensures we can always find an empty slot (when α < 1)
- E.g., with linear probing: covers all slots trivially
- E.g., with double hashing: step size must be coprime to table size
Key Immutability:
- Keys must not change after insertion (or hash must not change)
- If key mutates → hash changes → can't find element
- Languages like Java enforce this for String, Integer, etc.

Boundary Conditions:

Empty table: All operations should handle gracefully
Full table (open addressing): Must trigger resize before truly full
Load factor threshold: Resize at specific α to maintain performance
Single element: Edge case for iteration, min/max operations

What Happens If Invariant Violated:

Hash consistency broken:

# Bad: Using mutable list as key
my_dict = {}
key = [1, 2, 3]
my_dict[key] = "value"  # Works initially
key.append(4)  # Mutates key → hash changes
print(my_dict[key])  # May not find the value!

Capacity exceeded (open addressing):
- Infinite loop in probe sequence looking for empty slot
- Modern implementations prevent this by resizing proactively

Restoration Mechanisms:

Resize operation: When load factor exceeds threshold, create new larger table and rehash all elements
Rehashing: Recompute all indices for new table size
Tombstones (open addressing): Mark deleted slots to maintain probe sequences

Architecture and Design

Internal Organization:

Hash Table Structure:
┌─────────────────────────────────────┐
│  Hash Function (key → hash code)    │
└─────────────────────────────────────┘
              ↓
┌─────────────────────────────────────┐
│  Compression (hash code % size)      │
└─────────────────────────────────────┘
              ↓
┌───────────────────────────────────────┐
│          Bucket Array                 │
│  ┌───┬───┬───┬───┬───┬───┬───┬───┐  │
│  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  │
│  └───┴───┴───┴───┴───┴───┴───┴───┘  │
└───────────────────────────────────────┘
       ↓
  Collision Resolution
  (Chaining or Open Addressing)

Components:

Bucket Array:
- Fixed-size array (grows dynamically)
- Each slot can hold one or more key-value pairs
- Size typically power of 2 (enables fast modulo via bitwise AND)
Hash Function:
- Maps key to hash code (typically 32-bit or 64-bit integer)
- Should distribute keys uniformly
- Examples: hash(key) = (a * key.hashCode() + b) % table_size
Collision Resolution Mechanism:
- Chaining: Each bucket points to linked list/array/tree
- Open Addressing: Probe sequence to find next available slot
Metadata:
- size: Number of elements currently stored
- capacity: Size of underlying array
- load_factor_threshold: When to resize (e.g., 0.75)
- hash_seed (optional): Randomization to prevent attacks

Memory Layout:

Separate Chaining:

Bucket Array (contiguous memory):
┌─────┬─────┬─────┬─────┐
│ ptr │ ptr │ ptr │ ptr │  (pointers to chains)
└──┬──┴──┬──┴──┬──┴──┬──┘
   │     │     │     │
   v     v     v     v
 Node  Node  Node  NULL
   │     │     │
   v     v     v
 Node  NULL  Node
   │           │
   v           v
 NULL        NULL

Each Node: [key | value | next_ptr]

Open Addressing (Linear Probing):

Single Array (contiguous memory):
┌──────────┬──────────┬──────────┬──────────┐
│ (k1, v1) │ (k2, v2) │   NULL   │ (k3, v3) │
└──────────┴──────────┴──────────┴──────────┘
         Probe sequence →

Logical vs. Physical Representation:

Logical: Abstract map {key1: value1, key2: value2, ...}
Physical: Array of buckets with collision chains or probe sequences

Pointers and Links:

Chaining: Each bucket has pointer to first node in chain; nodes link to each other
Open Addressing: No pointers; use arithmetic to find next probe location
Cuckoo Hashing: Multiple arrays, each with own hash function

Mathematical Foundations

Key Formulas:

Division Method:
```
h(k) = k mod m
```
- Simple and fast
- m should be prime to avoid clustering
- Not good if m is power of 2 and keys are even
Multiplication Method:
```
h(k) = ⌊m * (k * A mod 1)⌋
```
- A is constant (golden ratio ≈ 0.6180339887 works well)
- k * A mod 1 extracts fractional part
- Works well for any m (power of 2 is fine)
Universal Hashing:
```
h_ab(k) = ((a * k + b) mod p) mod m
```
- Choose random a, b from [0, p-1] where p is prime > |U| (universe of keys)
- Guarantees: For any two keys k1 ≠ k2, Pr[h(k1) = h(k2)] ≤ 1/m
- Provides theoretical protection against adversarial inputs
Load Factor:
```
α = n / m
```
- n = number of elements
- m = table size
- α determines average chain length (chaining) or probe length (open addressing)

Collision Probability (Birthday Paradox):

For m buckets and n keys:

P(no collision) ≈ e^(-n² / 2m)
P(collision) ≈ 1 - e^(-n² / 2m)

For 50% collision chance: n ≈ 1.177√m

Example: 32-bit hash (m = 2³²) has 50% collision with only ~77,000 keys

Expected Chain Length (Chaining):

Under SUHA, expected number of keys in a bucket:

E[chain length] = α = n/m

Expected Probes (Linear Probing):

Successful search:

E[probes] ≈ 1/2 * (1 + 1/(1-α))

Unsuccessful search:

E[probes] ≈ 1/2 * (1 + 1/(1-α)²)

As α → 1, probes → ∞ (table becomes full)

Theorems Supporting Correctness:

Universal Hashing Theorem (Carter & Wegman, 1979):
- For any two distinct keys, probability of collision is at most 1/m
- Guarantees good average-case performance even with adversarial input
Chernoff Bounds:
- Probability that chain length exceeds expected value by large amount is exponentially small
- Makes hash tables reliable in practice

Deriving Complexity:

Chaining - Successful Search:

Expected chain length = α
Expected comparisons = 1 + α/2
With constant α, this is O(1)

Chaining - Unsuccessful Search:

Must traverse entire chain
Expected comparisons = α
With constant α, this is O(1)

Open Addressing - Linear Probing:

Clustering causes longer probe sequences
Expected probes ≈ 1/(1-α)
With α = 0.75, about 4 probes on average

Models and Representations

Visualization:

Separate Chaining:

Index  Bucket (Chains)
  0    → NULL
  1    → [("apple", 5)] → NULL
  2    → [("banana", 3)] → [("berry", 7)] → NULL
  3    → NULL
  4    → [("cherry", 9)] → NULL
  5    → [("date", 2)] → [("durian", 4)] → NULL
  6    → NULL
  7    → [("fig", 6)] → NULL

Open Addressing (Linear Probing):

Index  Key        Value
  0    "apple"    5
  1    "banana"   3
  2    "cherry"   9
  3    <empty>
  4    "date"     2
  5    "fig"      6
  6    <empty>
  7    "grape"    8

If inserting "avocado" and hash("avocado") = 0:
  - Bucket 0 occupied → probe bucket 1
  - Bucket 1 occupied → probe bucket 2
  - Bucket 2 occupied → probe bucket 3
  - Bucket 3 empty → insert here

Cuckoo Hashing:

Table 1 (hash1):          Table 2 (hash2):
┌───┬─────────┐          ┌───┬─────────┐
│ 0 │ "apple" │          │ 0 │ "date"  │
│ 1 │ <empty> │          │ 1 │ "fig"   │
│ 2 │ "banana"│          │ 2 │ <empty> │
│ 3 │ "cherry"│          │ 3 │ "grape" │
└───┴─────────┘          └───┴─────────┘

Key exists if found in table1[hash1(key)] OR table2[hash2(key)]

Alternative Mental Models:

Phone Book Analogy:
- Hash function = "Turn to the section for first letter of last name"
- Bucket = "Page of names starting with that letter"
- Collision = "Multiple people with same first letter"
Library Catalog:
- Hash function = "Dewey Decimal System category"
- Bucket = "Shelf for that category"
- Collision = "Multiple books in same category"
Post Office Boxes:
- Hash function = "PO Box number assignment"
- Bucket = "Individual box"
- Collision = "Shared box for family members"

Notation:

T[i]: Bucket at index i
h(k): Hash value of key k
h'(k): Secondary hash function (double hashing)
k ↦ v: Key k maps to value v
α = n/m: Load factor
⟨k, v⟩: Key-value pair

2.2 Logical Framework

Invariants and Properties

Core Invariants:

Hash Determinism:
- Same key always produces same hash value
- Formally: ∀x, h(x) is a function (not a relation)
Equality Consistency:
- If keys are equal, hashes must be equal
- Formally: k₁ = k₂ ⇒ h(k₁) = h(k₂)
- Contrapositive: h(k₁) ≠ h(k₂) ⇒ k₁ ≠ k₂
Uniqueness Invariant (for maps):
- Each key appears at most once
- Inserting duplicate key updates existing value
Capacity Constraint (open addressing):
- n < m always (number of elements < table size)

Boundary Conditions:

α = 0: Empty table, all operations trivial
α → 1 (open addressing): Performance degrades severely, must resize
α ≫ 1 (chaining): Long chains, approaching O(n) search

Invariant Violation Consequences:

Hash changes for same key:

class BadKey:
    def __init__(self, val):
        self.val = val
    def __hash__(self):
        return random.randint(0, 100)  # WRONG: non-deterministic

d = {}
key = BadKey(5)
d[key] = "value"
print(d[key])  # May fail: hash changed between put and get

Mutable key:

d = {}
key = [1, 2, 3]  # Lists are mutable, can't be hashed
d[key] = "value"  # Raises TypeError in Python

Restoration After Violation:

Resize operation: Rehash all elements to restore load factor invariant
Deletion with tombstones: Maintain probe sequence integrity in open addressing
Rebuild hash table: If corruption detected, reconstruct from scratch

Correctness and Proofs

Algorithm Correctness:

Theorem: Hash table correctly implements the Map ADT under SUHA.

Proof Sketch:

Insertion: Key k hashed to index i, stored at T[i] (or chain at T[i])
Retrieval: Key k hashed to same index i, found at T[i]
Deletion: Key k hashed to index i, removed from T[i]
Correctness: Since h(k) is deterministic, same key always maps to same location

Loop Invariant (for searching in chaining):

def search(table, key):
    index = hash(key) % len(table)
    node = table[index]
    while node is not None:
        if node.key == key:
            return node.value
        node = node.next
    return None

Invariant: At the start of each iteration, if key exists in the chain, it's either at node or later in the chain.

Proof of Termination:

Chains are finite (acyclic linked lists)
Each iteration advances to next node
Eventually reaches end (NULL)
Therefore, loop terminates

Open Addressing Termination:

For linear probing with α < 1:

There exists at least one empty slot
Probe sequence covers all slots
Therefore, will eventually find empty slot or key

Optimal Substructure:

Hash tables don't directly exhibit optimal substructure (not a DP problem), but collision resolution has recursive structure:

Chaining: Insert into chain is a linked list insertion subproblem Cuckoo hashing: Eviction creates subproblem of inserting evicted element

Assumptions and Requirements

Preconditions:

Keys must be hashable:
- Immutable (or hash based on immutable properties)
- Implement __hash__() and __eq__() (Python) or hashCode() and equals() (Java)
Sufficient memory:
- Enough memory to maintain load factor < threshold
- Can allocate new array during resize
Hash function available:
- Efficient to compute
- Good distribution properties

Postconditions:

After put(key, value):

Key exists in table
get(key) returns value
Size incremented by 1 (if key was new)

After delete(key):

Key no longer in table
get(key) returns NULL/None
Size decremented by 1

After resize():

Load factor below threshold
All keys still retrievable
Indices may have changed (rehashing occurred)

Assumption Violations:

Mutable keys:

# Violation
my_dict = {}
key = [1, 2]  # Mutable
# Python raises: TypeError: unhashable type: 'list'

Inconsistent equals/hash:

// Java example
class BadKey {
    int val;
    @Override
    public boolean equals(Object o) {
        return ((BadKey)o).val == this.val;
    }
    // Missing hashCode() override
    // Violates contract: equal objects must have equal hash codes
}

Hash function returns different values:

import random

class BadHash:
    def __hash__(self):
        return random.randint(0, 1000)  # WRONG: non-deterministic

3. Implementation and Mechanics

3.1 How It Works

Internal Workings

Data Flow During Operations:

Insertion Flow:

Key + Value
    ↓
Compute hash code: hash(key)
    ↓
Compress to index: hash(key) % table_size
    ↓
Check bucket at index:
    - Empty? → Insert directly
    - Occupied? → Handle collision

Search Flow:

Key
    ↓
Compute hash code: hash(key)
    ↓
Compress to index: hash(key) % table_size
    ↓
Check bucket:
    - Chaining: Traverse linked list, compare keys
    - Open Addressing: Probe sequence, compare keys
    ↓
Found? → Return value
Not found? → Return NULL/None

Element Storage:

Chaining:

Bucket array stores pointers to first node of chain
Each node stores: (key, value, next_pointer)
Nodes allocated dynamically on heap

Open Addressing:

Bucket array stores (key, value) pairs directly
Empty slots marked as NULL or special sentinel
Deleted slots marked with tombstone (DELETED marker)

Modification:

Update: Find key, replace value in place
Delete: Remove entry, potentially leave tombstone
Resize: Allocate new array, rehash all entries

Bit/Byte Level:

Hash Computation (for integers):

def hash_int(key, table_size):
    # Step 1: Mix bits using multiplication
    key = ((key >> 16) ^ key) * 0x45d9f3b
    key = ((key >> 16) ^ key) * 0x45d9f3b
    key = (key >> 16) ^ key

    # Step 2: Reduce to table range
    index = key % table_size
    return index

String Hashing (polynomial rolling hash):

def hash_string(s, table_size):
    hash_val = 0
    prime = 31  # Small prime
    for char in s:
        hash_val = (hash_val * prime + ord(char)) % (2**32)
    return hash_val % table_size

At byte level:

Each character converted to ASCII/Unicode value
Accumulated into hash using multiplication and addition
Final modulo ensures result fits in table

Operation Mechanics

Insert Operation (Separate Chaining):

Step-by-step:

Compute index = hash(key) % table_size
Access bucket at table[index]
Traverse chain checking for existing key
If found: Update value
If not found: Create new node, prepend to chain
Increment size counter
If size/capacity > load_threshold: resize and rehash

Delete Operation (Separate Chaining):

Compute index = hash(key) % table_size
Access bucket at table[index]
Traverse chain with two pointers (prev, current)
If key found:
- If at head: table[index] = current.next
- Else: prev.next = current.next
- Decrement size counter
If not found: No-op or raise exception

Search Operation:

Compute index = hash(key) % table_size
Access bucket at table[index]
Chaining: Traverse linked list, compare each key
Open Addressing: Probe sequence until found or empty slot
Return value if found, else None

Insert Operation (Open Addressing - Linear Probing):

1. index = hash(key) % table_size
2. i = 0
3. while True:
4.     probe_index = (index + i) % table_size
5.     if table[probe_index] is EMPTY or DELETED:
6.         table[probe_index] = (key, value)
7.         size++
8.         break
9.     if table[probe_index].key == key:
10.        table[probe_index].value = value  # Update
11.        break
12.    i++
13. if size/capacity > threshold: resize()

Delete Operation (Open Addressing - with Tombstones):

1. index = hash(key) % table_size
2. i = 0
3. while True:
4.     probe_index = (index + i) % table_size
5.     if table[probe_index] is EMPTY:
6.         return  # Not found
7.     if table[probe_index].key == key:
8.         table[probe_index] = DELETED  # Tombstone
9.         size--
10.        break
11.    i++

Resize Operation:

1. old_table = table
2. table = new array of size (2 * old_capacity) or next prime
3. size = 0
4. for each bucket in old_table:
5.     if chaining:
6.         for each node in chain:
7.             insert(node.key, node.value)  # Rehash into new table
8.     else:
9.         if bucket not EMPTY and not DELETED:
10.            insert(bucket.key, bucket.value)

Algorithm Execution

Typical Execution Flow (Chaining):

# Example: Inserting multiple elements
hash_table = HashTable(size=7)

hash_table.put("apple", 5)
# 1. hash("apple") = 47896 → 47896 % 7 = 3
# 2. table[3] is empty
# 3. Create node ("apple", 5), set table[3] = node

hash_table.put("banana", 3)
# 1. hash("banana") = 52381 → 52381 % 7 = 2
# 2. table[2] is empty
# 3. Create node ("banana", 3), set table[2] = node

hash_table.put("apricot", 7)
# 1. hash("apricot") = 47901 → 47901 % 7 = 3
# 2. table[3] is occupied (collision!)
# 3. Create node ("apricot", 7)
# 4. Prepend to chain: node.next = table[3], table[3] = node

Phases of Hash Table Lifecycle:

Initialization:
- Allocate array of specified size
- Initialize all buckets to NULL/empty
- Set size = 0, capacity = initial_size
Growth Phase:
- Elements inserted
- Load factor increases
- When α > threshold: resize (typically at α = 0.75)
Steady State:
- Mix of insertions, deletions, lookups
- Occasional resizes maintain load factor
- Performance remains O(1) average
Shrink Phase (optional):
- If many deletions, may resize down to save memory
- Less common in practice

Iteration Lifecycle:

For linear probing insertion:

Iteration 0: Check table[hash(key) % size]
Iteration 1: Check table[(hash(key) + 1) % size]
Iteration 2: Check table[(hash(key) + 2) % size]
...
Until: Empty slot found or key matched

State and Transitions

Possible States:

Empty: No elements, all buckets NULL
Partially Full: 0 < α < threshold
Resize Triggered: α ≥ threshold, needs expansion
Resizing: In process of allocating new array and rehashing
Full (open addressing): α → 1, imminent performance collapse

State Transitions:

Empty --[insert]--> Partially Full
Partially Full --[insert (α > threshold)]--> Resize Triggered
Resize Triggered --[resize()]--> Resizing
Resizing --[rehash complete]--> Partially Full
Partially Full --[delete all]--> Empty

Triggers for State Changes:

Insert: May trigger resize if α exceeds threshold
Delete: May trigger shrink if α falls below lower threshold (rare)
Resize: Triggered automatically by insert/delete

Internal Bookkeeping:

size: Current number of elements
capacity: Size of underlying array
threshold: size at which to resize (e.g., 0.75 * capacity)
mod_count (optional): For fail-fast iterators, detect concurrent modification

Open Addressing Specific:

Track number of DELETED slots to decide when to rebuild (remove tombstones)

3.2 Dependencies and Interactions

Internal Dependencies

Component Interactions:

Hash Function ↔ Compression:
- Hash function produces hash code (any integer)
- Compression function (% table_size) maps to valid index
- Quality of hash function determines distribution
Collision Resolution ↔ Bucket Structure:
- Chaining requires linked list or dynamic array per bucket
- Open addressing requires probe sequence calculation
- Choice of resolution strategy determines memory layout
Load Factor ↔ Resize:
- Load factor monitored after each insert/delete
- Triggers resize when threshold exceeded
- Resize involves rehashing, which uses hash function again

Operation Dependencies:

Resize depends on Insert/Delete: Size changes trigger load factor check
Insert depends on Search: Must check if key exists before inserting
Delete depends on Search: Must find key before deletion

Concurrent Modifications:

Problem: Two threads inserting simultaneously can corrupt structure
Chaining: May lose elements if both modify same chain
Open Addressing: Probe sequences can be broken, elements lost

Example Failure (Java):

// Thread 1: Inserting during resize
HashMap<String, Integer> map = new HashMap<>();
// Thread 1 calls put() → triggers resize
// Thread 2 calls get() during resize
// Result: ConcurrentModificationException or undefined behavior

Solution: Use ConcurrentHashMap with fine-grained locking or lock-free algorithms

Causal Chains

Single Operation Cascading Effects:

Insert causing resize:

Insert "key" with α = 0.75 (at threshold)
  → size++ makes α = 0.76
  → Triggers resize()
  → Allocate new array (2x size)
  → Rehash ALL existing elements
  → Hash codes recomputed
  → Indices change (different table size → different % result)
  → All elements move to new locations
  → Old array garbage collected

Delete in open addressing:

Delete "key" at index i
  → Place DELETED tombstone at i
  → Future searches must continue past tombstone
  → Future inserts can reuse tombstone slot
  → If too many tombstones: degrade to O(n) search
  → Eventually triggers rebuild to remove tombstones

Hash collision propagation:

Poor hash function causes clustering
  → Multiple keys hash to same region
  → Collisions increase
  → Chains grow longer (chaining) or probes increase (open addressing)
  → Search time increases
  → More likely to trigger resize
  → Resize computes new hashes (hopefully better distribution)

Feedback Loops:

Negative Feedback (resize):
- High load factor → slow operations
- Slow operations → trigger resize
- Resize → lower load factor
- Lower load factor → faster operations
Positive Feedback (clustering in poor hash functions):
- Cluster forms → more collisions in that region
- Collisions → longer probe sequences
- Longer probes → more likely to add to cluster
- Cluster grows → even worse performance

Conditional Factors

External Factors Affecting Behavior:

Input Size:
- Small n: Hash table overhead may dominate, simple array faster
- Large n: Hash table O(1) advantage clear
Input Distribution:
- Uniformly random keys → excellent performance
- Clustered keys (e.g., sequential IDs) → depends on hash function quality
- Adversarial input (all hash to same bucket) → O(n) worst case
Key Types:
- Integers: Fast to hash (bitwise operations)
- Strings: Slower to hash (must process each character)
- Custom objects: Hash quality depends on implementation

Performance Variation:

Input Pattern	Chaining	Open Addressing
Random keys	O(1)	O(1)
Sequential keys, good hash	O(1)	O(1)
Sequential keys, bad hash	O(n) worst	O(n) worst
Many duplicates	O(1)	O(1)
All collide	O(n)	O(n)

Environmental Factors:

Cache Size:
- Small cache: Chaining suffers (pointer chasing)
- Large cache: Chaining benefits from prefetching
- Open addressing generally better cache performance
Processor:
- Branch predictor quality affects conditional-heavy code
- SIMD capabilities: Swiss Tables use SIMD for faster lookup
Memory Allocator:
- Chaining: Many small allocations (node per element)
- Open addressing: Few large allocations (array only)
- Allocator speed affects chaining more
Programming Language:
- Python: dict is built-in, optimized open addressing
- Java: HashMap uses chaining (pre-8) or trees (Java 8+)
- C++: std::unordered_map traditionally chaining, but Swiss Tables use open addressing

Emergent Properties

Complex Behaviors from Simple Rules:

Clustering (Linear Probing):
- Simple rule: "If occupied, try next slot"
- Emergence: Long contiguous runs of occupied slots form
- Consequence: "Rich get richer" - clusters attract more collisions
Load Factor Sweet Spot:
- Simple rule: "Resize when α > 0.75"
- Emergence: Trade-off between memory and speed naturally optimizes
- Consequence: Industry standard of 0.75 emerges across implementations
Hash Function Sensitivity:
- Simple rule: "Map key to integer"
- Emergence: Tiny changes in hash function drastically affect distribution
- Example: Changing prime in polynomial hash changes entire distribution

Local Operations → Global Properties:

Local: Insert single element into bucket
Global: Distribution of chain lengths follows Poisson distribution (under SUHA)
Local: Single key hashes to index
Global: Birthday paradox means collisions likely with √m elements

Patterns from Repeated Operations:

Pattern: Insert, insert, insert → resize → insert, insert, insert → resize
Emergence: Geometric growth pattern (table size: 16, 32, 64, 128, ...)
Pattern: Insert many elements with same hash prefix
Emergence: Primary clustering in linear probing (long runs of occupied slots)

5. Complexity Analysis

5.1 Time Complexity

Operation-Wise Analysis

Hash Table Operations - Separate Chaining:

Operation	Best Case	Average Case	Worst Case	Amortized
Search	O(1)	O(1 + α)	O(n)	O(1)
Insert	O(1)	O(1 + α)	O(n)	O(1)
Delete	O(1)	O(1 + α)	O(n)	O(1)
Resize	-	O(n)	O(n)	O(1)

Where α = n/m (load factor), n = number of elements, m = table size.

Hash Table Operations - Open Addressing (Linear Probing):

Operation	Best Case	Average Case	Worst Case
Search (successful)	O(1)	O(1/(1-α))	O(n)
Search (unsuccessful)	O(1)	O(1/(1-α)²)	O(n)
Insert	O(1)	O(1/(1-α)²)	O(n)
Delete	O(1)	O(1/(1-α))	O(n)

Complexity Analysis by Hash Table Topic:

Topic 58: Hash Table Fundamentals

Create empty table: O(m) where m is initial capacity
Hash function computation: O(k) where k is key size
Index computation: O(1)

Topic 59: Hash Functions and Collision Handling

Division method: O(1)
Multiplication method: O(1)
Universal hashing: O(1) for computation
Polynomial rolling hash: O(k) for string of length k
Collision resolution (chaining): O(chain length) = O(1 + α) average
Collision resolution (probing): O(probes) = O(1/(1-α)) average

Topic 60: Chaining vs Open Addressing

Chaining:

Insert: O(1) if inserting at head, O(n) if checking duplicates
Search: O(1 + α) average
Delete: O(1 + α) average
Supports α > 1 (load factor can exceed 1)

Open Addressing (Linear Probing):

Insert: O(1/(1-α)²) average
Search: O(1/(1-α)) average
Delete: O(1/(1-α)) with tombstones
Must maintain α < 1

Open Addressing (Double Hashing):

Better distribution than linear probing
Insert: O(1/(1-α)) average
Search: O(1/(1-α)) average

Topic 61: Load Factor and Rehashing

Load factor computation: O(1)
Rehashing operation: O(n + m) where n = elements, m = old capacity
Amortized cost per insert with rehashing: O(1)
Typical threshold: α = 0.75

Topic 62: HashMap Implementation

get(key): O(1) average, O(n) worst
put(key, value): O(1) average, O(n) worst
remove(key): O(1) average, O(n) worst
containsKey(key): O(1) average, O(n) worst
keySet/values/entries iteration: O(n + m)

Topic 63: HashSet Implementation

add(element): O(1) average, O(n) worst
remove(element): O(1) average, O(n) worst
contains(element): O(1) average, O(n) worst
size(): O(1)
isEmpty(): O(1)

Topic 64: Two Sum Problem and Variations

Two Sum:

Time: O(n) - single pass with hash map
Space: O(n) - store n elements in hash map

Two Sum II (sorted array):

Time: O(n) - two pointer approach (no hash table needed)
Space: O(1) - two pointers only

Three Sum:

Time: O(n²) - O(n) for outer loop, O(n) for two-sum per iteration
Space: O(n) - hash set for duplicates

Four Sum:

Time: O(n³) - three nested loops
Space: O(n) - hash set

Topic 65: Group Anagrams Using Hash Map

Time: O(n * k log k) where n = number of strings, k = max string length
  - Sorting each string: O(k log k)
  - n strings total: O(n * k log k)

Alternative with counting:
Time: O(n * k) where k is string length
  - Count chars for each string: O(k)
  - n strings: O(n * k)

Space: O(n * k) - storing all strings in groups

Topic 66: Longest Consecutive Sequence

Time: O(n) - add all to set O(n), check sequences O(n)
Space: O(n) - hash set storing n elements

Key insight: Each sequence only checked once from its start

Topic 67: Subarray Sum Equals K

Time: O(n) - single pass through array
Space: O(n) - hash map storing prefix sums

Optimization: Prefix sum + hash map approach
Without hash map: O(n²) nested loops

Topic 68: Designing Hash Map from Scratch

Initial allocation: O(m) where m = capacity
All operations: Same as standard hash map above
Implementing custom hash function: Depends on key type
Implementing collision resolution: Depends on strategy

Topic 69: Rolling Hash (Rabin-Karp Algorithm)

Pattern matching in text:
Time: O(n + m) average, O(nm) worst
  where n = text length, m = pattern length

Rolling hash computation: O(1) per character
Initial hash: O(m) for pattern
Hash update: O(1) per slide

Space: O(1) - only store hash values

Overall Algorithm Complexity

Deriving Hash Table Complexity from First Principles:

Successful Search Analysis (Chaining):

Under Simple Uniform Hashing Assumption (SUHA):

Each of n keys equally likely to hash to any of m buckets
Expected number of keys in a bucket = α = n/m

Expected search time:

E[search time] = 1 + E[chain length]
               = 1 + α
               = O(1 + α)

With constant load factor (α ≤ 0.75), this is O(1).

Insert Analysis (Open Addressing - Linear Probing):

Probability that i-th slot is occupied: α^i (approximation)

Expected number of probes:

E[probes] = Σ(i=0 to ∞) α^i
          = 1/(1-α)  (geometric series)

For α = 0.75: E[probes] ≈ 4

Rehashing Amortized Analysis:

Suppose we double table size when α > 0.75:

Rehash at n = 1, 2, 4, 8, 16, ..., n
Cost of i-th rehash: O(2^i)

Total cost for n insertions:

Σ(i=0 to log n) 2^i = 2^(log n + 1) - 1 = 2n - 1 = O(n)

Amortized cost per insert: O(n)/n = O(1)

Birthday Paradox and Collision Probability:

For m buckets and n keys:

P(no collision) ≈ e^(-n(n-1)/(2m))
P(collision) ≈ 1 - e^(-n²/(2m))

For 50% collision probability:

n ≈ 1.177 * √m

Example: 32-bit hash (m = 2^32) has 50% collision with only ~77,163 keys.

Conditional Complexity

Input Characteristics Affecting Performance:

Key Distribution:
- Uniform random: O(1) average case holds
- Clustered keys: Depends on hash function quality
- Sequential integers with poor hash: O(n) worst case (all collide)
- Adversarial input: O(n) if hash function predictable

Load Factor:

α = 0.25: Very fast, memory wasteful
α = 0.50: Fast, balanced
α = 0.75: Standard threshold, good balance
α = 0.90: Slower, memory efficient (chaining only)
α → 1.0: Very slow for open addressing
α > 1.0: Only possible with chaining, linear search

Hash Function Quality:
- Good hash (uniform distribution): O(1) average
- Poor hash (clustering): O(n) worst case
- Cryptographic hash: O(k) where k = key size, very uniform
Table Size:
- Prime size: Better with poor hash functions
- Power of 2: Faster modulo (bitwise AND), requires good hash
- Small table: More collisions, slower
- Large table: Fewer collisions, cache misses

Best Case Performance:

Empty table or perfect hash function (no collisions)
All operations: O(1)
Occurs when: Hash function perfectly distributes keys

Worst Case Performance:

All keys hash to same bucket
All operations: O(n)
Occurs when:
- Adversarial input with predictable hash function
- Poor hash function (e.g., always returns 0)
- Birthday paradox with insufficient table size

Average Case Performance:

Assumes Simple Uniform Hashing Assumption (SUHA)
Keys distribute uniformly across buckets
Load factor α is constant (typically ≤ 0.75)
All operations: O(1)

Amortized Analysis for Dynamic Resizing:

Using aggregate method:

Operation sequence: n insertions
Regular inserts: n * O(1) = O(n)
Resizes: O(1) + O(2) + O(4) + ... + O(n) = O(2n) = O(n)
Total: O(n) + O(n) = O(n)
Amortized per operation: O(1)

Using accounting method:

Charge 3 units per insert:
- 1 unit for immediate insert
- 1 unit credit for itself during future resize
- 1 unit credit for one old element during resize
Always have enough credit for resize
Amortized: O(1)

Complexity Trade-offs

Time vs Space:

Lower α (more space): Faster operations, less collisions
Higher α (less space): Slower operations, more collisions
Optimal α ≈ 0.75: Empirically best balance

Chaining vs Open Addressing:

Aspect	Chaining	Open Addressing
Best/avg complexity	O(1)	O(1/(1-α))
Space overhead	Pointers (8 bytes each)	None, but need α < 1
Cache performance	Poor (pointer chasing)	Excellent (contiguous)
Supports α > 1	Yes	No

Hash Function Complexity:

Fast hash (multiplication): O(1), possibly poor distribution
Moderate hash (polynomial): O(k), good distribution
Cryptographic hash (SHA-256): O(k), excellent distribution, slower

Collision Resolution:

Linear probing: Fast, clustering issues
Quadratic probing: Moderate, reduces clustering
Double hashing: Slower per probe, best distribution

5.2 Space Complexity

Memory Usage

Basic Space Requirements:

Separate Chaining:

Total space = Array + Nodes + Overhead

Array: O(m) pointers (m = table size)
  - 8 bytes per pointer on 64-bit system
  - Total: 8m bytes

Nodes: O(n) nodes (n = number of elements)
  - Each node: key + value + next pointer
  - Assuming 8-byte key, 8-byte value, 8-byte pointer
  - Total: 24n bytes

Metadata: O(1)
  - size, capacity, threshold, hash seed
  - ~32 bytes

Total: O(m + n)
In terms of n (with α = n/m):
  - m = n/α
  - Space = O(n/α + n) = O(n) with constant α

Open Addressing:

Total space = Array + Overhead

Array: O(m) entries
  - Each entry: key + value + status (empty/occupied/deleted)
  - Assuming 8-byte key, 8-byte value, 1-byte status
  - Total: 17m bytes

Must maintain α < 1, typically α ≤ 0.75
So m ≥ n/0.75 = 1.33n

Metadata: O(1)
  - size, capacity, threshold
  - ~24 bytes

Total: O(m) = O(n) with constant α

Space Comparison:

For n = 1,000,000 elements, α = 0.75:

Chaining:

Array: 8 * 1,333,333 ≈ 10.7 MB
Nodes: 24 * 1,000,000 = 24 MB
Total: ~34.7 MB

Open Addressing:

Array: 17 * 1,333,333 ≈ 22.7 MB
Total: ~22.7 MB

Open addressing uses ~35% less space (no pointer overhead).

Memory Layout

Separate Chaining Memory Layout:

Hash Table Object:
├─ table: pointer to array (8 bytes)
├─ size: integer (4 bytes)
├─ capacity: integer (4 bytes)
├─ threshold: float (4 bytes)
├─ hash_seed: integer (4 bytes)
└─ padding (4 bytes) → Total: 28 bytes

Array (heap):
┌────────────────────────────────────┐
│ ptr₀ │ ptr₁ │ ptr₂ │ ... │ ptrₘ₋₁│  (m * 8 bytes)
└──┬───┴──┬───┴──┬───┴─────┴──┬─────┘
   │      │      │            │
   ↓      ↓      ↓            ↓
  Node   Node   Node         Node  (each 24 bytes)
   ↓      ↓
  Node   Node
   ↓
  Node

Memory Fragmentation:

Chaining: Many small allocations (nodes) → fragmentation
Open addressing: Few large allocations → less fragmentation

Cache Line Analysis:

Modern CPUs have 64-byte cache lines.

Chaining:

Each node access: potential cache miss
Following linked list: poor spatial locality
Average cache misses per search: α/2 (half the chain)

Open Addressing (Linear Probing):

Contiguous array: excellent spatial locality
Probe sequence stays in same cache line (if cluster small)
Average cache misses per search: ≈1 (if probe sequence < 8 entries)

Memory Overhead Comparison:

Structure	Per-Element Overhead	Total Overhead
Chaining	8 bytes (next pointer)	O(n)
Open Addressing	1 byte (status) + slack	O(m)
Cuckoo Hashing	0 bytes, but 2 tables	O(2m)

Space-Time Trade-offs

Load Factor Trade-off:

# Lower α: More space, faster operations
α = 0.5:
  Space: 2n entries
  Search time: O(1 + 0.5) = O(1.5)
  Cache efficiency: Excellent

# Standard α: Balanced
α = 0.75:
  Space: 1.33n entries
  Search time: O(1 + 0.75) = O(1.75)
  Cache efficiency: Good

# Higher α: Less space, slower operations
α = 1.0:
  Space: n entries (chaining only)
  Search time: O(1 + 1.0) = O(2)
  Cache efficiency: Moderate

Memoization with Hash Tables:

Classic time-space trade-off in dynamic programming:

# Without memoization: Exponential time, O(1) space
def fib(n):
    if n <= 1:
        return n
    return fib(n-1) + fib(n-2)
# Time: O(2^n), Space: O(n) recursion stack

# With hash table memoization: Linear time, linear space
def fib_memo(n, memo={}):
    if n <= 1:
        return n
    if n not in memo:
        memo[n] = fib_memo(n-1, memo) + fib_memo(n-2, memo)
    return memo[n]
# Time: O(n), Space: O(n) memo table + O(n) stack

Auxiliary Space for Operations:

Search:

Iterative: O(1) auxiliary space
Recursive (rare): O(log n) stack space

Insert with Resize:

O(m) temporary space for new array
Old array freed after rehashing

Iteration:

O(1) auxiliary space for iterator state

Sorting hash table values:

Extract to array: O(n) auxiliary space
Sort: O(log n) stack space (quicksort)

Recursion Depth (where applicable)

Hash tables rarely use recursion, but some scenarios involve it:

Recursive Operations:

Cuckoo Hashing Insert:

def insert(key, value, depth=0):
    if depth > max_depth:
        rehash()  # Prevent infinite loop
        return insert(key, value, 0)

    old_key, old_value = table[hash(key)]
    table[hash(key)] = (key, value)

    if old_key is not None:
        insert(old_key, old_value, depth + 1)  # Recursive eviction

Worst-case depth: O(n) (but extremely rare)
Average depth: O(1)
Stack space: O(depth)

Java 8 HashMap (Tree Nodes): When chain length > 8, converts to red-black tree:
- Tree operations: O(log n) recursion depth
- Stack space: O(log n)

Stack Space Analysis:

Iterative implementations (standard): O(1) stack space Recursive cuckoo hashing: O(log n) expected, O(n) worst Hybrid tree-based buckets: O(log n) stack space

5.3 Other Complexity Measures

Cache Efficiency

Cache Hierarchy:

L1 Cache: 32-64 KB, ~4 cycles latency
L2 Cache: 256-512 KB, ~10 cycles latency
L3 Cache: 8-16 MB, ~40 cycles latency
RAM: GBs, ~200 cycles latency

Chaining Cache Performance:

# Poor cache locality
def search_chaining(table, key):
    index = hash(key) % len(table)
    node = table[index]  # Cache miss 1: Array access
    while node:
        if node.key == key:  # Cache miss 2: Node access
            return node.value
        node = node.next  # Cache miss 3: Next node (pointer chasing)
    return None

Cache misses:

Array access: 1 miss (if not in cache)
Each node: 1 miss (nodes scattered in memory)
Average misses: 1 + α/2 (1 array + half the chain)

Open Addressing Cache Performance:

# Excellent cache locality
def search_linear_probing(table, key):
    index = hash(key) % len(table)
    i = 0
    while table[index + i] is not EMPTY:  # Sequential access
        if table[index + i].key == key:
            return table[index + i].value
        i += 1
    return None

Cache misses:

Initial access: 1 miss (if not in cache)
Probe sequence: Same cache line if cluster small
Average misses: ~1 (entire probe sequence often in 1-2 cache lines)

Cache Line Utilization:

64-byte cache line holds:

8 pointers (chaining)
~3-4 key-value pairs (open addressing, assuming 16 bytes each)

Linear probing benefits:

Probing within same cache line: 0 additional misses
Sequential prefetching by CPU

Swiss Tables (Google's Abseil):

Optimized for cache efficiency:

Group of 16 slots + 16-byte control byte array
Control bytes: Empty, Deleted, or H2 hash (top 7 bits)

Search:
1. Compute H2 (7 bits of hash)
2. SIMD compare: 16 control bytes at once
3. Only check matching slots

Cache efficiency:
- Control bytes: 16 bytes = fits in cache line
- SIMD: 16 comparisons in parallel
- Fewer memory accesses

Practical Performance

Constant Factors:

Big-O hides constant factors that matter in practice:

Hash Function Overhead:

# Simple hash: Fast but poor distribution
def simple_hash(key):
    return key  # O(1), constant = 1

# Polynomial hash: Moderate speed, good distribution
def poly_hash(s):
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) % (2**32)
    return h  # O(k), constant = ~5 operations per char

# Cryptographic hash: Slow, excellent distribution
def crypto_hash(key):
    return hashlib.sha256(key).digest()  # O(k), constant = ~50 ops/char

Collision Resolution Overhead:

Linear probing:

Each probe: 1 comparison + 1 array access
Constant: ~2 operations per probe

Chaining:

Each node: 1 comparison + 1 pointer dereference
Constant: ~3 operations per node (worse due to cache misses)

Small Input Performance:

For n < 100:

Hash table overhead: Hash computation + collision handling
Linear search: Simple, cache-friendly

Breakeven point (empirical):

n ≈ 10-20: Hash table starts to win
n < 10: Linear search often faster

Real-World Benchmarks:

Python dict (open addressing):

n = 1,000: Insert 0.05 µs, Search 0.03 µs
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs
n = 100,000,000: Insert 0.08 µs, Search 0.06 µs

Nearly constant time, slight degradation due to cache misses

Java HashMap (chaining → trees):

n = 1,000: Insert 0.08 µs, Search 0.05 µs
n = 1,000,000: Insert 0.10 µs, Search 0.07 µs

Slightly slower than Python due to pointer overhead

C++ std::unordered_map (traditionally chaining, now Swiss Tables):

Old (chaining):
n = 1,000,000: Insert 0.12 µs, Search 0.09 µs

New (Swiss Tables):
n = 1,000,000: Insert 0.06 µs, Search 0.04 µs

~2x speedup from cache-friendly design

Practical Bottlenecks:

Hash Function:
- Slow hash → dominates for small keys
- Optimize: Use fast hash for performance-critical code
Memory Allocation (Chaining):
- malloc/free overhead for each node
- Optimize: Use memory pool, allocate nodes in batches
Cache Misses (Chaining):
- Pointer chasing → random memory access
- Optimize: Use open addressing or array-based buckets
Resize Operation:
- Occasional O(n) spike
- Optimize: Incremental resizing, pre-allocate if size known
Load Factor:
- Too high → many collisions
- Too low → memory waste, cache misses
- Optimize: Tune based on profiling (typically 0.75)

When Theory Meets Practice:

Theory predicts: O(1) average case Practice shows:

Constant factor varies 10x based on implementation
Cache effects dominate for small-medium datasets
Hash function quality matters more than collision resolution
Open addressing outperforms chaining in modern systems

Practical Complexity Guidelines:

Tiny (n < 20): Linear search may be faster
Small (n < 1,000): Hash table good, simple implementation fine
Medium (n < 1,000,000): Hash table excellent, optimize hash function
Large (n < 1,000,000,000): Hash table great, optimize for cache
Massive (n > 1 billion): Distributed hash table or approximate structures

6. Optimization and Trade-offs

6.1 Performance Optimization

Optimization Techniques

Hash Function Optimization:

Fast Integer Hashing:

# Slow: Modulo operation
def hash_slow(key, size):
    return key % size

# Fast: Bitwise AND (requires size = power of 2)
def hash_fast(key, size):
    return key & (size - 1)  # Equivalent to key % size when size = 2^k

# Example: size = 16 (binary: 10000)
# size - 1 = 15 (binary: 01111)
# key = 42 (binary: 101010)
# 42 & 15 = 10 (binary: 001010)

# Speed: ~5x faster on modern CPUs

Multiplication Method (Faster than Division):

# Knuth's multiplicative hashing
def hash_multiplicative(key, size):
    # Golden ratio: (√5 - 1) / 2 ≈ 0.6180339887
    A = 0.6180339887498948
    # Or use integer version: multiply by 2^32 * A
    A_int = 2654435769  # 2^32 * golden ratio
    return ((key * A_int) >> (32 - size.bit_length())) & (size - 1)

# Fast: One multiplication, one shift, one AND

String Hashing Optimization:

# Standard polynomial hash (slow for long strings)
def hash_poly_slow(s):
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) % (2**32)
    return h

# Optimized: Process multiple characters at once
def hash_poly_fast(s):
    h = 0
    # Process 4 characters at a time (32-bit chunks)
    for i in range(0, len(s), 4):
        chunk = 0
        for j in range(4):
            if i + j < len(s):
                chunk = (chunk << 8) | ord(s[i + j])
        h = h * 31 + chunk
    return h & 0xFFFFFFFF

# Speed: ~4x faster for long strings

Cache Hash Values:

# For immutable keys, cache hash value
class CachedHashKey:
    def __init__(self, key):
        self.key = key
        self._hash = None  # Lazy compute

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(self.key)
        return self._hash

    def __eq__(self, other):
        return self.key == other.key

# Python strings do this automatically
# Java String.hashCode() is cached after first call

Collision Resolution Optimization:

Robin Hood Hashing (Variance Reduction):

def robin_hood_insert(table, key, value):
    """
    Insert with Robin Hood hashing: "rich" entries give way to "poor" ones.
    Distance from ideal position = probe distance.
    """
    index = hash(key) % len(table)
    distance = 0

    while table[index] is not EMPTY:
        existing_key, existing_value, existing_dist = table[index]

        if existing_key == key:
            table[index] = (key, value, distance)  # Update
            return

        # Robin Hood: If current item is richer (less distance), swap
        if distance > existing_dist:
            key, value, distance = existing_key, existing_value, existing_dist
            table[index] = (key, value, distance)

        index = (index + 1) % len(table)
        distance += 1

    table[index] = (key, value, distance)

# Benefit: Reduces variance in probe lengths
# Average probe length: Same as linear probing
# Worst-case probe length: Much better (more predictable)

Quadratic Probing (Reduce Clustering):

def quadratic_probe(table, key, value):
    """
    Probe sequence: h(k), h(k)+1^2, h(k)+2^2, h(k)+3^2, ...
    Reduces primary clustering of linear probing.
    """
    index = hash(key) % len(table)
    i = 0

    while table[index] is not EMPTY:
        if table[index].key == key:
            table[index].value = value  # Update
            return
        i += 1
        index = (hash(key) + i * i) % len(table)

    table[index] = (key, value)

# Benefit: Reduces clustering compared to linear probing
# Caveat: Requires table size to be prime or power of 2

Double Hashing (Best Probe Distribution):

def double_hash_probe(table, key, value):
    """
    Use second hash function for step size.
    Probe sequence: h1(k), h1(k)+h2(k), h1(k)+2*h2(k), ...
    """
    h1 = hash(key) % len(table)
    h2 = 1 + (hash(key) // len(table)) % (len(table) - 1)  # Step size
    # h2 must be coprime to table size (prime table size ensures this)

    index = h1
    i = 0
    while table[index] is not EMPTY:
        if table[index].key == key:
            table[index].value = value
            return
        i += 1
        index = (h1 + i * h2) % len(table)

    table[index] = (key, value)

# Benefit: Best distribution, minimal clustering
# Cost: Two hash computations instead of one

Resize Optimization:

Incremental Resizing:

class IncrementalHashTable:
    """
    Spread resize cost across multiple operations.
    Avoid O(n) spike.
    """
    def __init__(self):
        self.old_table = None
        self.new_table = []
        self.resize_index = 0
        self.resize_in_progress = False

    def insert(self, key, value):
        # Do a bit of incremental resizing
        if self.resize_in_progress:
            self._rehash_some_entries()

        # Normal insert
        if self.should_resize():
            self._start_resize()

        # Insert into new table if resizing
        table = self.new_table if self.resize_in_progress else self.old_table
        self._insert_into(table, key, value)

    def _rehash_some_entries(self):
        """Rehash a small batch of entries from old to new table."""
        batch_size = 10  # Amortize cost
        for _ in range(batch_size):
            if self.resize_index < len(self.old_table):
                entry = self.old_table[self.resize_index]
                if entry is not EMPTY:
                    self._insert_into(self.new_table, entry.key, entry.value)
                self.resize_index += 1
            else:
                # Resize complete
                self.old_table = self.new_table
                self.resize_in_progress = False
                break

# Benefit: No single slow operation
# Cost: More complex implementation, dual-table lookup during resize

Pre-allocate if Size Known:

# Slow: Start small, resize multiple times
hash_map = {}
for i in range(1000000):
    hash_map[i] = i
# Multiple resizes: at 12, 24, 48, ..., 524288, 1048576

# Fast: Pre-allocate to avoid resizes
hash_map = dict.fromkeys(range(1000000))  # Python
# Or explicitly: hash_map = {}; hash_map.reserve(1000000)  # C++
for i in range(1000000):
    hash_map[i] = i
# No resizes: ~2-3x faster

Lazy Deletion (Avoid Rehashing):

class LazyDeleteHashTable:
    """
    Mark deleted slots, rebuild only when too many tombstones.
    """
    def __init__(self):
        self.table = []
        self.size = 0
        self.deleted_count = 0

    def delete(self, key):
        index = self._find(key)
        if index is not None and self.table[index] is not DELETED:
            self.table[index] = DELETED
            self.size -= 1
            self.deleted_count += 1

            # Rebuild if too many tombstones
            if self.deleted_count > len(self.table) // 4:
                self._rebuild()

    def _rebuild(self):
        """Remove tombstones without increasing table size."""
        old_table = self.table
        self.table = [EMPTY] * len(old_table)
        self.deleted_count = 0

        for entry in old_table:
            if entry is not EMPTY and entry is not DELETED:
                self._insert(entry.key, entry.value)

# Benefit: Fast deletion, infrequent rebuilds

Lazy Evaluation and Caching:

Lazy Bucket Initialization:

class LazyHashTable:
    """
    Don't allocate bucket arrays until needed.
    Saves memory for sparse hash tables.
    """
    def __init__(self, size):
        self.buckets = [None] * size  # Array of None, not empty lists

    def insert(self, key, value):
        index = hash(key) % len(self.buckets)

        # Lazy init: Create bucket list only when first item inserted
        if self.buckets[index] is None:
            self.buckets[index] = []

        self.buckets[index].append((key, value))

# Benefit: Save memory for sparse tables
# Use case: Large table with few entries

Memoization Pattern:

from functools import lru_cache

# Automatic memoization with hash table
@lru_cache(maxsize=None)  # Unbounded cache
def expensive_function(x):
    # Complex computation
    return sum(i**2 for i in range(x))

# Or manual:
def memoize(func):
    cache = {}
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
        return cache[args]
    return wrapper

@memoize
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# Time: O(2^n) → O(n) with memoization

Bottleneck Analysis

Identifying Bottlenecks:

Hash Function Profiling:

import time

def profile_hash_function(hash_func, keys):
    start = time.perf_counter()
    for key in keys:
        _ = hash_func(key)
    end = time.perf_counter()
    return end - start

# Test different hash functions
keys = [str(i) for i in range(1000000)]

print("Simple sum:", profile_hash_function(lambda s: sum(ord(c) for c in s), keys))
print("Polynomial:", profile_hash_function(lambda s: hash(s), keys))
print("SHA-256:", profile_hash_function(lambda s: hashlib.sha256(s.encode()).hexdigest(), keys))

# Results (example):
# Simple sum: 0.15s
# Polynomial: 0.08s (built-in, optimized)
# SHA-256: 2.5s (cryptographic, slow)

Collision Rate Analysis:

def analyze_collisions(hash_table):
    """
    Measure collision rate and chain length distribution.
    """
    chain_lengths = []
    for bucket in hash_table.buckets:
        if bucket is not None:
            chain_lengths.append(len(bucket))
        else:
            chain_lengths.append(0)

    total_collisions = sum(max(0, length - 1) for length in chain_lengths)
    avg_chain = sum(chain_lengths) / len(chain_lengths)
    max_chain = max(chain_lengths)

    print(f"Average chain length: {avg_chain:.2f}")
    print(f"Max chain length: {max_chain}")
    print(f"Total collisions: {total_collisions}")
    print(f"Load factor: {sum(chain_lengths) / len(chain_lengths):.2f}")

    # Histogram
    from collections import Counter
    histogram = Counter(chain_lengths)
    print("Chain length distribution:")
    for length, count in sorted(histogram.items()):
        print(f"  Length {length}: {count} buckets")

# Ideal: Most buckets have 0-1 elements
# Problem: Many long chains indicate poor hash function or high load factor

Memory Access Patterns:

def measure_cache_misses(operation, iterations=1000000):
    """
    Estimate cache efficiency by timing.
    More cache misses = slower per operation.
    """
    # Warm up
    for _ in range(1000):
        operation()

    # Measure
    start = time.perf_counter()
    for _ in range(iterations):
        operation()
    end = time.perf_counter()

    return (end - start) / iterations * 1e9  # nanoseconds per op

# Compare chaining vs open addressing
chaining_time = measure_cache_misses(lambda: chaining_table[random_key()])
open_addr_time = measure_cache_misses(lambda: open_table[random_key()])

print(f"Chaining: {chaining_time:.1f} ns/op")
print(f"Open addressing: {open_addr_time:.1f} ns/op")

# Expected: Open addressing ~2-3x faster due to better cache locality

Common Bottlenecks:

Slow Hash Function:
- Symptom: Insertion/search time dominated by hash computation
- Detection: Profile shows >50% time in hash function
- Solution: Use faster hash (multiplication method, cached hash)
High Collision Rate:
- Symptom: Long chains or probe sequences
- Detection: Average chain length > 2, max chain length > 10
- Solution: Improve hash function, reduce load factor, resize table
Frequent Resizing:
- Symptom: Periodic slow insertions
- Detection: Profiler shows time in resize/rehash
- Solution: Pre-allocate if size known, incremental resizing
Cache Misses (Chaining):
- Symptom: Slower than expected for low load factor
- Detection: Performance doesn't match O(1) prediction
- Solution: Switch to open addressing, use memory pool for nodes
Poor Load Factor:
- Too high: Many collisions
- Too low: Memory waste, cache misses
- Solution: Tune threshold (typically 0.75)

Ideal Conditions

Optimal Configuration:

Load Factor:
- Chaining: α = 0.75 (standard)
- Open addressing: α = 0.5-0.7 (lower for better performance)
- Cryptographic attacks concerned: α = 0.5 (reduces collision probability)
Initial Capacity:
- Known size: Set to expected_size / load_factor
- Unknown size: Start with 16-32 (small overhead, quick to grow)
- Large datasets: Start with 1024+ (avoid early resizes)
Hash Function:
- Integers: Multiplication method or identity (if well-distributed)
- Strings: Polynomial hash (built-in) or MurmurHash
- Security-critical: SipHash or Blake2
Collision Resolution:
- General purpose: Chaining (simple, robust)
- Performance-critical: Open addressing with linear probing (cache-friendly)
- Predictability-critical: Robin Hood hashing or cuckoo hashing

Best Practices:

class OptimizedHashTable:
    """
    Production-ready hash table with best practices.
    """
    def __init__(self, initial_capacity=16, load_factor=0.75):
        # Use power of 2 for fast modulo (bitwise AND)
        self.capacity = self._next_power_of_2(initial_capacity)
        self.size = 0
        self.load_factor = load_factor
        self.threshold = int(self.capacity * load_factor)

        # Randomized hash seed to prevent DoS attacks
        import random
        self.hash_seed = random.getrandbits(32)

        # Initialize table
        self.table = [None] * self.capacity

    def _next_power_of_2(self, n):
        """Round up to next power of 2."""
        return 1 << (n - 1).bit_length()

    def _hash(self, key):
        """
        High-quality hash function.
        """
        # Use built-in hash + randomized seed
        h = hash(key) ^ self.hash_seed

        # Additional mixing (optional, for extra quality)
        h = ((h >> 16) ^ h) * 0x45d9f3b
        h = ((h >> 16) ^ h) * 0x45d9f3b
        h = (h >> 16) ^ h

        return h & (self.capacity - 1)  # Fast modulo

    def put(self, key, value):
        """Insert with automatic resizing."""
        if self.size >= self.threshold:
            self._resize()

        # ... insert logic ...

    def _resize(self):
        """Double capacity and rehash."""
        old_table = self.table
        self.capacity *= 2
        self.threshold = int(self.capacity * self.load_factor)
        self.table = [None] * self.capacity
        self.size = 0

        # Rehash all entries
        for entry in old_table:
            if entry is not None:
                self.put(entry.key, entry.value)

# Usage:
hash_table = OptimizedHashTable(
    initial_capacity=expected_size // 0.75,  # Avoid resizes
    load_factor=0.75  # Standard trade-off
)

Performance Tuning Guide:

Scenario	Recommended Settings
General purpose	α=0.75, chaining, polynomial hash
Performance-critical	α=0.5, open addressing (linear probing), fast hash
Memory-constrained	α=1.0, chaining, lazy deletion
Predictable latency	α=0.5, Robin Hood or cuckoo hashing
Security-critical	α=0.5, SipHash, randomized seed
Known size	Pre-allocate, α=0.75
Unknown size	Start small (16), α=0.75

6.2 Design Trade-offs

Fundamental Trade-offs

Time vs Space:

# Space-efficient but slower: High load factor
class DenseHashTable:
    def __init__(self):
        self.load_factor = 0.95  # Use more space
        # Pros: Minimal memory overhead
        # Cons: More collisions, slower operations
        # Use when: Memory is limited

# Fast but space-intensive: Low load factor
class SparseHashTable:
    def __init__(self):
        self.load_factor = 0.5  # Use more space
        # Pros: Fewer collisions, faster operations
        # Cons: ~2x memory usage
        # Use when: Speed is critical, memory available

Simplicity vs Performance:

# Simple: Chaining with list
class SimpleHashTable:
    def __init__(self):
        self.buckets = [[] for _ in range(capacity)]

    def put(self, key, value):
        bucket = self.buckets[hash(key) % len(self.buckets)]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))

    # Pros: Easy to implement, easy to debug
    # Cons: Pointer overhead, cache misses

# Complex but fast: Open addressing with Robin Hood hashing
class ComplexHashTable:
    def __init__(self):
        self.table = [EMPTY] * capacity
        self.distances = [0] * capacity

    def put(self, key, value):
        # Robin Hood logic: swap if current entry is "richer"
        # ... complex insertion logic with distance tracking ...
        pass

    # Pros: Better performance, cache-friendly
    # Cons: Complex implementation, harder to debug

Determinism vs Security:

# Deterministic: Predictable, vulnerable to attacks
class DeterministicHashTable:
    def __init__(self):
        self.hash_seed = 0  # Fixed seed

    def hash(self, key):
        return hash(key)  # Same output every run

    # Pros: Reproducible behavior, easier debugging
    # Cons: Vulnerable to hash collision DoS attacks

# Randomized: Secure, non-deterministic
class RandomizedHashTable:
    def __init__(self):
        import random
        self.hash_seed = random.getrandbits(64)  # Random per instance

    def hash(self, key):
        return hash((key, self.hash_seed))  # Different output each run

    # Pros: Resistant to DoS attacks
    # Cons: Non-deterministic iteration order, harder to debug

Worst-case vs Average-case:

# Average-case optimized: Standard hash table
class StandardHashTable:
    # O(1) average case, O(n) worst case
    # Most operations fast, rare slow operations acceptable
    pass

# Worst-case optimized: Cuckoo hashing
class CuckooHashTable:
    # O(1) worst-case lookup, O(1) amortized insert
    # Guaranteed fast lookups, insertions occasionally expensive

    def __init__(self):
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity

    def get(self, key):
        # Check both tables: guaranteed O(1)
        h1 = hash1(key) % len(self.table1)
        if self.table1[h1] and self.table1[h1][0] == key:
            return self.table1[h1][1]

        h2 = hash2(key) % len(self.table2)
        if self.table2[h2] and self.table2[h2][0] == key:
            return self.table2[h2][1]

        return None  # Only 2 lookups, guaranteed

    # Use when: Predictable latency critical (real-time systems)

Alternative Approaches

Comparison Table:

Approach	Insert	Search	Delete	Space	Cache	Use Case
Hash Table (Chaining)	O(1) avg	O(1) avg	O(1) avg	O(n)	Poor	General purpose
Hash Table (Open Addr)	O(1) avg	O(1) avg	O(1) avg	O(n)	Good	Performance-critical
Binary Search Tree	O(log n)	O(log n)	O(log n)	O(n)	Poor	Ordered iteration
Balanced BST (Red-Black)	O(log n)	O(log n)	O(log n)	O(n)	Poor	Ordered + balanced
Sorted Array	O(n)	O(log n)	O(n)	O(n)	Excellent	Static data, range queries
Unsorted Array	O(1)	O(n)	O(n)	O(n)	Excellent	Small datasets
Trie	O(k)	O(k)	O(k)	O(n*k)	Good	Prefix searches
Skip List	O(log n) avg	O(log n) avg	O(log n) avg	O(n)	Good	Concurrent access
Bloom Filter	O(k)	O(k)	No	O(m)	Good	Approximate membership

Choosing the Right Structure:

def choose_data_structure(requirements):
    """
    Decision tree for selecting data structure.
    """
    if requirements.need_fast_lookup:
        if requirements.need_ordered:
            return "Balanced BST (TreeMap)"
        elif requirements.need_range_queries:
            return "Balanced BST or Segment Tree"
        elif requirements.need_prefix_search:
            return "Trie"
        else:
            if requirements.memory_limited:
                return "Bloom Filter (approximate)"
            else:
                return "Hash Table"  # ← Most common choice

    elif requirements.need_fast_insertion:
        if requirements.data_static:
            return "Sorted Array + Binary Search"
        else:
            return "Hash Table"

    elif requirements.need_ordered_iteration:
        return "Balanced BST (Red-Black, AVL)"

    else:
        if requirements.dataset_small:
            return "Unsorted Array"
        else:
            return "Hash Table (default choice)"

When to Prefer Alternatives:

Use Binary Search Tree when:
- Need ordered iteration
- Need range queries (find all keys between X and Y)
- Need min/max operations
- Example: Event scheduling, interval queries
Use Sorted Array when:
- Data is static or rarely changes
- Need fast lookups (binary search)
- Memory is tight (no overhead)
- Example: Dictionary lookup, spell checker
Use Trie when:
- Keys are strings
- Need prefix searches
- Need autocomplete functionality
- Example: Autocomplete, IP routing, dictionary
Use Bloom Filter when:
- Need membership testing
- Can tolerate false positives
- Memory is very limited
- Example: Web cache, spell checker, malware detection

Balancing Competing Factors

Fast Reads vs Fast Writes:

# Optimize for reads: Lower load factor
class ReadOptimizedHashTable:
    def __init__(self):
        self.load_factor = 0.5  # Fewer collisions
        # Reads: Very fast (avg 1.5 probes)
        # Writes: Slower (more frequent resizes)
        # Memory: 2x overhead

    # Use when: Read-heavy workload (90% reads, 10% writes)

# Optimize for writes: Higher load factor
class WriteOptimizedHashTable:
    def __init__(self):
        self.load_factor = 0.9  # Rare resizes
        # Reads: Slower (avg 10 probes for open addressing)
        # Writes: Faster (infrequent resizes)
        # Memory: 1.1x overhead

    # Use when: Write-heavy workload or memory-constrained

Consistency vs Performance:

# Strong consistency: Lock entire table
class ConsistentHashTable:
    def __init__(self):
        self.lock = threading.Lock()

    def put(self, key, value):
        with self.lock:  # Exclusive access
            # ... insert logic ...
            pass

    # Pros: Simple, correct
    # Cons: No concurrent writes, single point of contention

# Weak consistency: Fine-grained locking
class ConcurrentHashTable:
    def __init__(self):
        self.segment_locks = [threading.Lock() for _ in range(16)]

    def put(self, key, value):
        segment = hash(key) % len(self.segment_locks)
        with self.segment_locks[segment]:  # Lock only this segment
            # ... insert logic ...
            pass

    # Pros: Concurrent writes to different segments
    # Cons: More complex, potential for stale reads

Heuristics for Tuning:

class AdaptiveHashTable:
    """
    Hash table that tunes itself based on workload.
    """
    def __init__(self):
        self.load_factor = 0.75
        self.resize_count = 0
        self.collision_count = 0
        self.operation_count = 0

    def put(self, key, value):
        self.operation_count += 1

        # Track metrics
        if self._collision_occurred():
            self.collision_count += 1

        # Adapt load factor based on collision rate
        if self.operation_count % 10000 == 0:
            self._adjust_load_factor()

        # ... normal insert ...

    def _adjust_load_factor(self):
        collision_rate = self.collision_count / self.operation_count

        if collision_rate > 0.3:  # Too many collisions
            self.load_factor = max(0.5, self.load_factor - 0.1)
            print(f"Reducing load factor to {self.load_factor}")
        elif collision_rate < 0.1:  # Too sparse
            self.load_factor = min(0.9, self.load_factor + 0.1)
            print(f"Increasing load factor to {self.load_factor}")

# Use when: Workload characteristics unknown or changing

6.3 Advanced Optimizations

Amortization

Amortized Analysis for Resizing:

class AmortizedHashTable:
    """
    Demonstrate amortized O(1) insertion with resizing.
    """
    def __init__(self):
        self.capacity = 8
        self.size = 0
        self.table = [None] * self.capacity
        self.resize_costs = []  # Track resize costs

    def put(self, key, value):
        # Resize if needed (happens at size = 8, 16, 32, 64, ...)
        if self.size >= self.capacity * 0.75:
            old_capacity = self.capacity
            self._resize()
            self.resize_costs.append(old_capacity)  # Track resize cost

        # Normal insert: O(1)
        index = hash(key) % self.capacity
        self.table[index] = (key, value)
        self.size += 1

    def _resize(self):
        """Resize and rehash all elements."""
        old_table = self.table
        self.capacity *= 2
        self.table = [None] * self.capacity

        # Rehash: O(n) for this operation
        for entry in old_table:
            if entry is not None:
                index = hash(entry[0]) % self.capacity
                self.table[index] = entry

    def analyze_amortized_cost(self):
        """
        Analyze amortized cost.

        For n insertions:
        - Regular inserts: n * O(1) = O(n)
        - Resizes at sizes: 8, 16, 32, ..., n
        - Resize costs: 8 + 16 + 32 + ... + n = 2n - 1 = O(n)
        - Total: O(n) + O(n) = O(2n) = O(n)
        - Amortized per insert: O(n) / n = O(1)
        """
        total_resize_cost = sum(self.resize_costs)
        total_cost = self.size + total_resize_cost
        amortized = total_cost / self.size if self.size > 0 else 0

        print(f"Total insertions: {self.size}")
        print(f"Total resize cost: {total_resize_cost}")
        print(f"Total cost: {total_cost}")
        print(f"Amortized cost per insert: {amortized:.2f}")
        # Output: Amortized cost ≈ 2 (constant!)

# Demonstration
ht = AmortizedHashTable()
for i in range(1000):
    ht.put(i, i)
ht.analyze_amortized_cost()
# Shows amortized O(1) despite occasional O(n) resize

Accounting Method:

class AmortizedHashTableWithCredits:
    """
    Using accounting method: charge extra "credits" per operation.
    """
    def __init__(self):
        self.capacity = 8
        self.size = 0
        self.table = [None] * self.capacity
        self.credits = 0

    def put(self, key, value):
        # Charge 3 units per insert (amortized cost)
        cost = 3

        # 1 unit: Actual insert work
        self._insert(key, value)
        actual_cost = 1

        # 2 units: Credits for future resize
        # - 1 credit for moving this element during resize
        # - 1 credit for moving one old element
        self.credits += (cost - actual_cost)

        # When resizing, use accumulated credits
        if self.size >= self.capacity * 0.75:
            resize_cost = self.size  # Cost to rehash all elements
            assert self.credits >= resize_cost, "Not enough credits!"
            self.credits -= resize_cost
            self._resize()

        return cost  # Return amortized cost (always 3)

    def _insert(self, key, value):
        index = hash(key) % self.capacity
        self.table[index] = (key, value)
        self.size += 1

    def _resize(self):
        old_table = self.table
        self.capacity *= 2
        self.table = [None] * self.capacity

        for entry in old_table:
            if entry is not None:
                index = hash(entry[0]) % self.capacity
                self.table[index] = entry

# Key insight: By charging 3 units per insert instead of 1,
# we always have enough credits saved to pay for resize operations.
# Amortized cost: O(1) per insert.

Potential Method:

def potential_function(hash_table):
    """
    Potential Φ(table) = 2 * size - capacity

    Intuition: Potential increases as table fills up,
    releasing "stored energy" during resize.
    """
    return 2 * hash_table.size - hash_table.capacity

def amortized_cost_insert(hash_table):
    """
    Amortized cost = Actual cost + ΔΦ
    """
    old_potential = potential_function(hash_table)

    # Actual insert cost
    if hash_table.size < hash_table.capacity * 0.75:
        actual_cost = 1  # Just insert
        hash_table.size += 1
    else:
        # Resize + insert
        actual_cost = hash_table.size + 1  # Rehash all + insert
        hash_table.capacity *= 2
        hash_table.size += 1

    new_potential = potential_function(hash_table)
    delta_potential = new_potential - old_potential

    amortized_cost = actual_cost + delta_potential
    return amortized_cost

# Analysis:
# Regular insert:
#   Actual = 1
#   ΔΦ = (2(n+1) - m) - (2n - m) = 2
#   Amortized = 1 + 2 = 3
#
# Resize insert (when n = m*0.75):
#   Actual = n + 1
#   ΔΦ = (2(n+1) - 2m) - (2n - m) = 2 - m = 2 - (n/0.75) ≈ -n/0.75
#   Amortized = n + 1 - n/0.75 ≈ n - 1.33n = -0.33n (negative!)
#   But potential released pays for expensive operation
#
# Result: Amortized O(1) per insert

Parallelization

Concurrent Hash Table Design:

import threading
from concurrent.futures import ThreadPoolExecutor

class ConcurrentHashTable:
    """
    Thread-safe hash table with fine-grained locking.
    """
    def __init__(self, num_segments=16):
        self.num_segments = num_segments
        self.segments = [[] for _ in range(num_segments)]
        self.locks = [threading.Lock() for _ in range(num_segments)]

    def _get_segment(self, key):
        """Map key to segment."""
        return hash(key) % self.num_segments

    def put(self, key, value):
        """Thread-safe insert."""
        segment = self._get_segment(key)
        with self.locks[segment]:
            # Only lock this segment, others can be accessed concurrently
            for i, (k, v) in enumerate(self.segments[segment]):
                if k == key:
                    self.segments[segment][i] = (key, value)
                    return
            self.segments[segment].append((key, value))

    def get(self, key):
        """Thread-safe read."""
        segment = self._get_segment(key)
        with self.locks[segment]:
            for k, v in self.segments[segment]:
                if k == key:
                    return v
        return None

    def parallel_build(self, items):
        """Build hash table in parallel."""
        with ThreadPoolExecutor(max_workers=self.num_segments) as executor:
            # Partition items by segment
            segment_items = [[] for _ in range(self.num_segments)]
            for key, value in items:
                segment = self._get_segment(key)
                segment_items[segment].append((key, value))

            # Insert each segment in parallel (no lock contention)
            def insert_segment(segment_id):
                for key, value in segment_items[segment_id]:
                    self.put(key, value)

            executor.map(insert_segment, range(self.num_segments))

# Scalability: With S segments and T threads:
# - Throughput: ~T/S times faster (if T ≤ S)
# - Best case: T = S, no lock contention, ~T times faster

Lock-Free Hash Table (Advanced):

import threading
from ctypes import c_bool, c_int

class LockFreeHashTable:
    """
    Lock-free hash table using Compare-And-Swap (CAS).

    Note: Python doesn't have true CAS; this is conceptual.
    Real implementation would use C/C++ with atomic operations.
    """
    def __init__(self, capacity):
        # Each entry: (key, value, tombstone_flag)
        self.table = [(None, None, False) for _ in range(capacity)]
        self.capacity = capacity

    def put(self, key, value):
        """
        Lock-free insert using CAS.
        """
        index = hash(key) % self.capacity

        while True:
            # Read current value
            current = self.table[index]
            current_key, current_value, is_deleted = current

            # Case 1: Empty slot or deleted
            if current_key is None or is_deleted:
                # Try to CAS: if still empty, insert
                new_entry = (key, value, False)
                if self._cas(index, current, new_entry):
                    return  # Success
                # CAS failed, retry

            # Case 2: Key already exists
            elif current_key == key:
                # Try to CAS: update value
                new_entry = (key, value, False)
                if self._cas(index, current, new_entry):
                    return  # Success
                # CAS failed, retry

            # Case 3: Collision
            else:
                index = (index + 1) % self.capacity  # Linear probe

    def _cas(self, index, expected, new_value):
        """
        Compare-And-Swap operation (atomic).

        if table[index] == expected:
            table[index] = new_value
            return True
        else:
            return False
        """
        # In real implementation, this would be atomic CPU instruction
        # In Python, we simulate (not truly lock-free):
        with threading.Lock():  # Simulating atomicity
            if self.table[index] == expected:
                self.table[index] = new_value
                return True
            return False

    # Benefits of lock-free:
    # - No deadlock possible
    # - Always makes progress (some thread succeeds)
    # - Better worst-case latency (no lock waiting)
    #
    # Drawbacks:
    # - Complex to implement correctly
    # - May have higher contention under high load
    # - Requires hardware support (CAS instruction)

# Real-world examples:
# - Java's ConcurrentHashMap (since Java 8)
# - C++ folly::AtomicHashMap
# - Rust's dashmap

Embarrassingly Parallel Operations:

def parallel_hash_operations(keys, values):
    """
    Some hash table operations are embarrassingly parallel.
    """

    # 1. Parallel building (no dependencies)
    def parallel_build(items):
        hash_tables = [HashTable() for _ in range(num_threads)]

        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            # Each thread builds its own partition
            def build_partition(partition_id):
                for key, value in items[partition_id::num_threads]:
                    hash_tables[partition_id].put(key, value)

            executor.map(build_partition, range(num_threads))

        # Merge tables (sequential, but fast)
        final_table = HashTable()
        for ht in hash_tables:
            for key, value in ht.items():
                final_table.put(key, value)

        return final_table

    # 2. Parallel lookups (read-only, no locks needed)
    def parallel_lookups(hash_table, keys):
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            results = list(executor.map(hash_table.get, keys))
        return results

    # 3. Parallel frequency counting (merge phase)
    def parallel_word_count(documents):
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            # Each thread counts its partition
            partial_counts = executor.map(count_words, documents)

        # Merge counts
        total_count = {}
        for partial in partial_counts:
            for word, count in partial.items():
                total_count[word] = total_count.get(word, 0) + count

        return total_count

    def count_words(doc):
        counts = {}
        for word in doc.split():
            counts[word] = counts.get(word, 0) + 1
        return counts

# Speedup: Near-linear with number of cores for read-heavy workloads

Approximation

Approximate Data Structures:

Bloom Filter (Approximate Membership):

import mmh3  # MurmurHash3

class BloomFilter:
    """
    Space-efficient probabilistic data structure for membership testing.

    Trade-off: Small false positive rate for large space savings.
    """
    def __init__(self, size, num_hash_functions):
        self.size = size
        self.num_hashes = num_hash_functions
        self.bit_array = [False] * size

    def add(self, item):
        """Add item to set."""
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            self.bit_array[index] = True

    def contains(self, item):
        """Check if item might be in set."""
        for seed in range(self.num_hashes):
            index = mmh3.hash(item, seed) % self.size
            if not self.bit_array[index]:
                return False  # Definitely NOT in set
        return True  # Probably in set (might be false positive)

    def false_positive_rate(self, n):
        """
        Calculate false positive probability.

        FPR ≈ (1 - e^(-k*n/m))^k
        where k = num_hashes, n = items inserted, m = size
        """
        import math
        k, m = self.num_hashes, self.size
        return (1 - math.exp(-k * n / m)) ** k

# Example: 1 million items
# Hash table: ~24 MB (with pointers)
# Bloom filter: ~1.2 MB (1% false positive rate)
# Space savings: 20x smaller!

# Use cases:
# - Web crawlers: Check if URL already visited
# - Databases: Avoid disk lookups for non-existent keys
# - CDNs: Check if content in cache before fetching

Count-Min Sketch (Approximate Frequency):

class CountMinSketch:
    """
    Space-efficient probabilistic data structure for frequency counting.

    Trade-off: Approximate counts (overestimation) for space savings.
    """
    def __init__(self, width, depth):
        self.width = width  # Number of buckets per row
        self.depth = depth  # Number of hash functions
        self.table = [[0] * width for _ in range(depth)]

    def add(self, item, count=1):
        """Increment count for item."""
        for i in range(self.depth):
            index = hash((item, i)) % self.width
            self.table[i][index] += count

    def estimate(self, item):
        """Estimate frequency of item."""
        return min(
            self.table[i][hash((item, i)) % self.width]
            for i in range(self.depth)
        )

    def error_bound(self, total_count):
        """
        Error bound: ε * total_count with probability 1 - δ
        where ε = e / width, δ = (1/2)^depth
        """
        import math
        epsilon = math.e / self.width
        delta = (1/2) ** self.depth
        return epsilon * total_count, delta

# Example: Count 1 billion elements
# Hash table: ~16 GB (assuming 16 bytes per entry)
# Count-Min Sketch: ~1 MB (1% error, 99% confidence)
# Space savings: 16,000x smaller!

# Use cases:
# - Network traffic analysis: Count packets per IP
# - Stream processing: Top-K frequent items
# - Database query optimization: Estimate join sizes

HyperLogLog (Approximate Cardinality):

import math

class HyperLogLog:
    """
    Space-efficient cardinality estimation.

    Trade-off: ~2% error for ~1.5 KB space (for billions of elements).
    """
    def __init__(self, precision=14):
        self.precision = precision
        self.m = 1 << precision  # 2^precision buckets
        self.buckets = [0] * self.m

        # Bias correction constant
        if self.m == 16:
            self.alpha = 0.673
        elif self.m == 32:
            self.alpha = 0.697
        elif self.m == 64:
            self.alpha = 0.709
        else:
            self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def add(self, item):
        """Add item to set."""
        # Hash item to 64 bits
        h = hash(item) & ((1 << 64) - 1)

        # First 'precision' bits determine bucket
        bucket = h >> (64 - self.precision)

        # Count leading zeros in remaining bits (+ 1)
        remaining = h & ((1 << (64 - self.precision)) - 1)
        leading_zeros = self._leading_zeros(remaining) + 1

        # Update bucket with maximum leading zeros seen
        self.buckets[bucket] = max(self.buckets[bucket], leading_zeros)

    def cardinality(self):
        """Estimate number of unique items."""
        # Harmonic mean of 2^bucket values
        raw_estimate = self.alpha * self.m * self.m / sum(
            2 ** (-bucket) for bucket in self.buckets
        )

        # Bias correction for small/large cardinalities
        if raw_estimate <= 2.5 * self.m:
            # Small range correction
            zeros = self.buckets.count(0)
            if zeros != 0:
                return self.m * math.log(self.m / zeros)

        if raw_estimate <= (1/30) * (1 << 32):
            return raw_estimate
        else:
            # Large range correction
            return -1 * (1 << 32) * math.log(1 - raw_estimate / (1 << 32))

    def _leading_zeros(self, num):
        """Count leading zeros in binary representation."""
        if num == 0:
            return 64 - self.precision
        return (64 - self.precision) - num.bit_length()

# Example: Count unique users from 1 billion events
# Hash set: ~16 GB
# HyperLogLog: ~12 KB (for 14-bit precision, ~0.8% standard error)
# Space savings: >1,000,000x smaller!

# Use cases:
# - Analytics: Count unique visitors
# - Databases: Estimate distinct values for query optimization
# - Ad tech: Count unique ad impressions

When to Use Approximation:

Exact Structure	Approximate Alternative	Space Savings	Use When
Hash Set	Bloom Filter	10-100x	Membership testing, can tolerate false positives
Hash Map (counts)	Count-Min Sketch	1000-10000x	Frequency estimation, overestimate acceptable
Hash Set (cardinality)	HyperLogLog	1000000x	Counting unique items, ~2% error acceptable

def choose_approximate_structure(use_case):
    """Decision guide for approximate structures."""
    if use_case == "membership_testing":
        if can_tolerate_false_positives():
            return BloomFilter(size=optimal_size(), num_hashes=optimal_k())
        else:
            return HashSet()  # Need exact answers

    elif use_case == "frequency_counting":
        if need_exact_counts():
            return HashMap()  # Exact counts
        elif can_tolerate_overestimation():
            return CountMinSketch(width=1000, depth=5)  # Approximate
        else:
            return HashMap()  # Need exact

    elif use_case == "cardinality_estimation":
        if need_exact_cardinality():
            return HashSet()  # Exact unique count
        elif dataset_massive():
            return HyperLogLog(precision=14)  # Approximate, tiny space
        else:
            return HashSet()  # Small enough for exact

# Rule of thumb:
# - Small datasets (< 1M elements): Use exact structures
# - Large datasets (> 100M elements): Consider approximate
# - Massive datasets (> 1B elements): Approximate almost required

4. Problem-Solving and Applications

4.1 Use Cases and Applications

Where and When to Use

Most Effective Scenarios:

Fast Lookup with Non-Integer Keys:
- Need O(1) retrieval using strings, objects, or complex keys
- Example: User ID → User profile mapping
Frequency Counting:
- Count occurrences of elements in stream or array
- Example: Word frequency in document
Deduplication:
- Remove or detect duplicates efficiently
- Example: Find unique elements in array
Caching/Memoization:
- Store computed results for reuse
- Example: Dynamic programming with complex state
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group by category

When to Choose Hash Table Over Alternatives:

Requirement	Choose	Over
Fast lookup by key	Hash Table	Binary Search Tree
Don't need sorted order	Hash Table	Balanced BST
Membership testing	Hash Set	Array or List
Counting frequencies	Hash Map (Counter)	Sorted Map
Caching results	Hash Map	Dictionary with linear search

When NOT to Use:

Need Ordered Iteration:
- Hash tables don't maintain insertion or sorted order
- Use: TreeMap, OrderedDict, or sorted array instead
Range Queries:
- Can't find "all keys between X and Y"
- Use: Balanced BST (TreeMap) or sorted array with binary search
Prefix Searches:
- Can't find "all keys starting with X"
- Use: Trie or prefix tree
Small Datasets (< 10-20 elements):
- Hash table overhead not justified
- Simple array with linear search may be faster
Memory Constrained:
- Hash tables need extra space (load factor < 1)
- Use: In-place algorithms or compact data structures

Problem Statement Signals:

Keywords/phrases that suggest hash tables:

"Find if pair/triplet sums to target" → Hash set for O(1) lookup
"Count frequency/occurrences" → Hash map for counting
"Group by property" → Hash map with list values
"Check if exists" → Hash set for membership
"First non-repeating character" → Hash map with counts
"Detect duplicates" → Hash set
"Longest substring without repeating" → Hash set for seen characters
"Two arrays intersection" → Hash set for O(n) intersection

Problem Categories

Classic Problem Types:

Two-Sum Pattern:
- Find pair of numbers summing to target
- Use hash map to store complements
- Example: Two Sum, Three Sum, Four Sum
Frequency/Counting:
- Count element occurrences
- Example: Most frequent element, k most frequent
Grouping/Partitioning:
- Group elements by property
- Example: Group anagrams, group shifted strings
Subarray/Substring Problems:
- Find subarray with property (sum, distinct elements)
- Example: Subarray sum equals K, longest substring without repeating
Deduplication:
- Remove or identify duplicates
- Example: Contains duplicate, longest consecutive sequence
Graph Problems:
- Represent graph as adjacency list (hash map)
- Example: Clone graph, course schedule

Canonical/Classic Problems:

Two Sum (Topic 64)
Group Anagrams (Topic 65)
Longest Consecutive Sequence (Topic 66)
Subarray Sum Equals K (Topic 67)
Design HashMap from Scratch (Topic 68)
Rolling Hash - Rabin-Karp (Topic 69)

Also classic:

First non-repeating character
Isomorphic strings
Valid Sudoku
LRU Cache (hash map + doubly linked list)

Real-World Applications

Production Systems:

Database Indexing:
- Hash indexes for equality searches
- PostgreSQL, MySQL support hash indexes
- Example: SELECT * FROM users WHERE email = 'user@example.com'
Caching Layers:
- Redis: In-memory key-value store using hash tables
- Memcached: Distributed hash table for caching
- CDN caching: Hash URLs to cached responses
Compiler/Interpreter Symbol Tables:
- Map variable names to memory locations/types
- Python: Global/local variable namespaces are dicts
- JavaScript: Object properties are hash maps
Web Frameworks:
- Routing: Map URL patterns to handlers
- Session management: Session ID → session data
- Cookies: Key-value pairs
Blockchain:
- Bitcoin: Map transaction ID → transaction details
- Ethereum: Account address → account state
- Merkle trees use cryptographic hashes

Famous Systems Using Hash Tables:

System	Hash Table Usage
Google Search	Index: keyword → document IDs
Facebook	Social graph: User ID → friend list
Amazon	Product catalog: SKU → product details
DNS	Domain name → IP address
Git	Blob hash → file content
Linux Kernel	Process table: PID → process structure
Chrome V8	Object property maps (hidden classes)

Domain-Specific Applications:

Finance:

Order books: Order ID → Order details
Position tracking: Ticker → positions
Risk calculations: Portfolio ID → risk metrics

E-commerce:

Inventory: SKU → stock count
Shopping carts: User ID → cart items
Recommendation engines: User ID → recommended products

Social Media:

Friend graphs: User ID → friends
News feeds: User ID → feed posts
Hashtag tracking: Tag → posts

Gaming:

Entity management: Entity ID → game object
Player data: Player ID → stats
Leaderboards: Player → score (sorted by score)

Bioinformatics:

K-mer counting: k-mer → count
Sequence alignment: Hash subsequences for fast matching
Gene databases: Gene ID → gene info

Scale and Performance Requirements

Dataset Size Suitability:

Small (< 1,000):

Hash table works but may be overkill
Linear search can be competitive
Memory overhead noticeable

Medium (1,000 - 1,000,000):

Sweet spot for hash tables
Clear advantage over O(log n) structures
Fits comfortably in RAM

Large (1,000,000 - 1,000,000,000):

Hash tables excellent if fit in memory
Watch for cache misses
Consider cache-optimized variants (Swiss Tables)

Massive (> 1 billion):

May not fit in single machine memory
Use distributed hash tables (Chord, Kademlia)
Or approximate data structures (Bloom filter, Count-Min Sketch)

Performance Guarantees:

Operation	Average	Worst	Space
Search	O(1)	O(n)	O(n)
Insert	O(1)	O(n)	O(n)
Delete	O(1)	O(n)	O(n)

Amortized insert/delete: O(1) accounting for occasional resize

When Hash Table is Overkill:

Problem solvable in one pass without lookups
Data already sorted (can use binary search)
Only need to iterate once (no repeated lookups)

When Hash Table is Insufficient:

Need sorted order iteration
Need min/max in O(1) (use heap instead)
Need range queries (use segment tree or BST)
Memory extremely limited (use bit arrays or compact structures)

4.2 Problem-Solving Approaches

Recognition Patterns

How to Recognize Hash Table Problems:

Explicit Counting:
- "Count the number of..." → Hash map
- "Find the most/least frequent..." → Hash map with counts
Existence Checks:
- "Check if X exists" → Hash set
- "Find if there is a pair..." → Hash set
Complement/Pair Finding:
- "Find two elements that sum to target"
- Pattern: For each element x, check if (target - x) exists
Grouping by Property:
- "Group items that share X"
- Pattern: Hash map where key = property, value = list of items
Tracking Seen Elements:
- "Find first duplicate"
- "Longest substring without repeating characters"
- Pattern: Hash set to track what we've seen
Index Mapping:
- "Find indices of two numbers that sum to target"
- Pattern: Hash map storing value → index

Input/Output Patterns:

Input: Array/list + target/condition
Output: Pair/subset satisfying condition
Approach: Hash set/map for O(1) complement lookup
Input: Array of strings
Output: Groups of related strings
Approach: Hash map with canonical form as key

Constraint Indicators:

"In O(n) time" or "Linear time" → Likely hash table
"Find in one pass" → Hash table for seen elements
"Can you do better than O(n log n)?" → Hash table to achieve O(n)

Application Strategy

General Problem-Solving Templates:

# Template 1: Existence/Complement Check
def two_sum_pattern(arr, target):
    seen = {}  # or set() if only checking existence
    for i, num in enumerate(arr):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]  # Found pair
        seen[num] = i
    return None

# Template 2: Frequency Counting
def frequency_pattern(arr):
    freq = {}
    for item in arr:
        freq[item] = freq.get(item, 0) + 1
    # Process frequencies
    return max(freq, key=freq.get)  # Example: most frequent

# Template 3: Grouping by Property
def grouping_pattern(items):
    groups = {}
    for item in items:
        key = compute_key(item)  # Canonical form
        if key not in groups:
            groups[key] = []
        groups[key].append(item)
    return list(groups.values())

# Template 4: Sliding Window with Hash Map
def sliding_window_pattern(arr, k):
    window = {}
    # Initialize window
    for i in range(k):
        window[arr[i]] = window.get(arr[i], 0) + 1

    # Slide window
    for i in range(k, len(arr)):
        # Add new element
        window[arr[i]] = window.get(arr[i], 0) + 1
        # Remove old element
        window[arr[i-k]] -= 1
        if window[arr[i-k]] == 0:
            del window[arr[i-k]]

Adapting to Problem-Specific Constraints:

Limited Memory:
- Use Count-Min Sketch instead of hash map for approximate counts
- Use Bloom filter for approximate membership
Need Sorted Output:
- Use hash map during computation
- Sort results afterward
Multiple Lookups:
- Use hash map with tuple keys
- Or nested hash maps
Custom Key Types:
- Implement __hash__ and __eq__ methods
- Or convert to canonical string/tuple form

Common Variations:

Two Sum → Three Sum: Add nested loop, check if -(arr[i] + arr[j]) exists
Find pair → Find all pairs: Don't return on first find, collect all
Exact count → Top K frequent: Use heap with hash map
Single array → Two arrays: Two hash sets, find intersection/difference

Combination with Other Techniques

Hash Table + Two Pointers:

Problem: Three Sum - find all triplets summing to zero

def three_sum(nums):
    nums.sort()  # Sort for two-pointer technique
    result = []
    seen_triplets = set()  # Hash set to avoid duplicates

    for i in range(len(nums) - 2):
        left, right = i + 1, len(nums) - 1
        while left < right:
            total = nums[i] + nums[left] + nums[right]
            if total == 0:
                triplet = (nums[i], nums[left], nums[right])
                if triplet not in seen_triplets:
                    result.append(list(triplet))
                    seen_triplets.add(triplet)
                left += 1
                right -= 1
            elif total < 0:
                left += 1
            else:
                right -= 1
    return result

Hash Table + Sliding Window:

Problem: Longest Substring Without Repeating Characters

def length_of_longest_substring(s):
    char_index = {}  # char → last seen index
    max_length = 0
    start = 0

    for end in range(len(s)):
        if s[end] in char_index and char_index[s[end]] >= start:
            start = char_index[s[end]] + 1  # Move start past duplicate
        char_index[s[end]] = end
        max_length = max(max_length, end - start + 1)

    return max_length

Hash Table + BFS/DFS:

Problem: Clone Graph

def clone_graph(node):
    if not node:
        return None

    cloned = {}  # Original node → cloned node mapping

    def dfs(node):
        if node in cloned:
            return cloned[node]

        clone = Node(node.val)
        cloned[node] = clone  # Store before recursing (handles cycles)

        for neighbor in node.neighbors:
            clone.neighbors.append(dfs(neighbor))

        return clone

    return dfs(node)

Hash Table + Prefix Sum:

Problem: Subarray Sum Equals K

def subarray_sum(nums, k):
    prefix_sum = 0
    sum_count = {0: 1}  # prefix_sum → count of occurrences
    count = 0

    for num in nums:
        prefix_sum += num
        # If (prefix_sum - k) exists, found subarray
        if prefix_sum - k in sum_count:
            count += sum_count[prefix_sum - k]
        sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1

    return count

Hash Table + Heap:

Problem: Top K Frequent Elements

import heapq
from collections import Counter

def top_k_frequent(nums, k):
    # Hash map for frequency counting
    freq = Counter(nums)

    # Min heap of size k
    return heapq.nlargest(k, freq.keys(), key=freq.get)

Hash Table + Trie:

Problem: Word Search II (find all words from dictionary in 2D board)

class TrieNode:
    def __init__(self):
        self.children = {}  # Hash map for children
        self.word = None

def find_words(board, words):
    # Build trie from words
    root = TrieNode()
    for word in words:
        node = root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.word = word

    # DFS on board using trie
    result = []
    for i in range(len(board)):
        for j in range(len(board[0])):
            dfs(board, i, j, root, result)
    return result

Preprocessing Patterns:

Sort then Hash:
- Group anagrams: Sort each string, use sorted string as key
Normalize then Hash:
- Case-insensitive comparison: Lowercase before hashing
- Canonical form: Convert to standard representation
Accumulate then Hash:
- Prefix sums: Store cumulative sum at each index as key

Postprocessing Patterns:

Hash to Collect, Sort to Output:

freq = Counter(arr)
return sorted(freq.items(), key=lambda x: x[1], reverse=True)

Hash to Deduplicate, List to Output:

seen = set()
result = []
for item in arr:
    if item not in seen:
        result.append(item)
        seen.add(item)

7. Comparative Analysis

7.1 Comparison with Alternatives

Feature-by-Feature Comparison

Hash Table vs Binary Search Tree (BST):

Feature	Hash Table	Binary Search Tree
Search	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Insert	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Delete	O(1) average, O(n) worst	O(log n) average, O(n) worst (unbalanced)
Min/Max	O(n)	O(log n) or O(1) with pointer
Ordered Iteration	No	Yes (in-order traversal)
Range Query	No	O(log n + k) where k = results
Predecessor/Successor	No	O(log n)
Memory Overhead	Load factor dependent	Pointer overhead
Cache Performance	Good (open addr) / Poor (chaining)	Poor (pointer chasing)

When to choose:

Hash Table: Need O(1) lookups, don't need ordering
BST: Need ordering, range queries, or min/max operations

Hash Table vs Balanced BST (Red-Black, AVL):

Feature	Hash Table	Balanced BST
Search	O(1) average	O(log n) guaranteed
Insert	O(1) average	O(log n) guaranteed
Delete	O(1) average	O(log n) guaranteed
Worst-case guarantee	No (O(n) possible)	Yes (O(log n))
Ordered operations	No	Yes
Implementation complexity	Simple (chaining)	Complex (rotations)
Memory overhead	Moderate	High (color/balance bits + pointers)

When to choose:

Hash Table: Performance-critical, no ordering needed
Balanced BST: Need guaranteed O(log n), ordered operations

Hash Table vs Sorted Array:

Feature	Hash Table	Sorted Array
Search	O(1) average	O(log n) binary search
Insert	O(1) average	O(n) maintain sorted order
Delete	O(1) average	O(n) shift elements
Space	O(n / load_factor)	O(n)
Range Query	No	O(log n + k)
Cache Performance	Varies	Excellent (contiguous)
Dynamic resizing	Efficient	Expensive
Best for	Dynamic data	Static or rarely-changed data

When to choose:

Hash Table: Frequent insertions/deletions
Sorted Array: Static data with many queries

Hash Table vs Trie:

Feature	Hash Table	Trie
Search	O(1) average	O(k) where k = key length
Insert	O(1) average	O(k)
Delete	O(1) average	O(k)
Space	O(n)	O(ALPHABETSIZE k _ n) worst case
Prefix search	No	O(p) where p = prefix length
Autocomplete	No	Excellent
Longest common prefix	No	Natural
Best for strings	Yes (hash)	Yes (prefix operations)

When to choose:

Hash Table: Exact match lookups, any key type
Trie: Prefix searches, autocomplete, string-specific operations

Hash Table (Chaining) vs Hash Table (Open Addressing):

Feature	Chaining	Open Addressing
Load factor	Can exceed 1.0	Must stay < 1.0
Cache performance	Poor	Excellent
Deletion	Easy	Complex (tombstones)
Memory overhead	Pointer per node	Slack space in array
Implementation	Simple	Moderate
Worst-case search	O(n)	O(n)
Best for	Variable load, simple code	Performance-critical, fixed load

When to choose:

Chaining: Simple implementation, variable load factor
Open Addressing: Performance-critical, cache-friendly

Quantitative Performance Comparison:

import time
import random

def benchmark_structures(n=100000):
    """
    Benchmark different data structures for common operations.
    """
    keys = list(range(n))
    random.shuffle(keys)

    # Hash Table (dict)
    start = time.perf_counter()
    hash_table = {}
    for key in keys:
        hash_table[key] = key
    insert_time_hash = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        _ = hash_table[key]
    search_time_hash = time.perf_counter() - start

    # Sorted List (bisect)
    import bisect
    start = time.perf_counter()
    sorted_list = []
    for key in keys:
        bisect.insort(sorted_list, key)
    insert_time_sorted = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        bisect.bisect_left(sorted_list, key)
    search_time_sorted = time.perf_counter() - start

    # Unsorted List
    start = time.perf_counter()
    unsorted_list = []
    for key in keys:
        unsorted_list.append(key)
    insert_time_unsorted = time.perf_counter() - start

    start = time.perf_counter()
    for key in keys:
        key in unsorted_list
    search_time_unsorted = time.perf_counter() - start

    print(f"Results for n={n}:")
    print(f"Hash Table - Insert: {insert_time_hash:.3f}s, Search: {search_time_hash:.3f}s")
    print(f"Sorted List - Insert: {insert_time_sorted:.3f}s, Search: {search_time_sorted:.3f}s")
    print(f"Unsorted List - Insert: {insert_time_unsorted:.3f}s, Search: {search_time_unsorted:.3f}s")

# Typical output for n=100,000:
# Hash Table - Insert: 0.008s, Search: 0.005s
# Sorted List - Insert: 8.523s, Search: 0.012s
# Unsorted List - Insert: 0.003s, Search: 12.456s

Contextual Comparison

Use Case Decision Matrix:

Use Case	Best Choice	Runner-up	Avoid
Exact match lookup	Hash Table	-	Trie
Prefix search	Trie	-	Hash Table
Range query	Balanced BST	Sorted Array	Hash Table
Frequent insert/delete	Hash Table	-	Sorted Array
Static data	Sorted Array	Hash Table	Linked List
Ordered iteration	Balanced BST	Sorted Array	Hash Table
Memory-constrained	Sorted Array	Hash Table (high load)	Trie
Guaranteed worst-case	Balanced BST	Cuckoo Hash	Standard Hash
String prefix ops	Trie	-	Hash Table
Approximate membership	Bloom Filter	-	Hash Set

Context-Specific Recommendations:

Compiler Symbol Table:
- Best: Hash Table
- Reason: Fast O(1) lookups, identifiers are unique, no need for ordering
- Alternative: Balanced BST if scoping requires ordered traversal
Database Index:
- Best: B-tree (Balanced BST variant)
- Reason: Range queries, ordered access, disk-friendly
- Alternative: Hash index for exact-match-only columns
Phone Directory:
- Best: Trie (for name prefix search)
- Reason: Autocomplete, prefix matching
- Alternative: Hash Table for exact name lookup
Spell Checker:
- Best: Hash Set
- Reason: Membership test, exact word match
- Alternative: Trie for suggestions, Bloom filter if memory tight
URL Cache:
- Best: Hash Table (LRU Cache)
- Reason: Fast lookup, combine with doubly-linked list for LRU
- Alternative: None suitable
Event Scheduler:
- Best: Balanced BST or Heap
- Reason: Need ordered by time, min/max operations
- Alternative: Hash Table if only lookup by event ID

Decision Criteria:

def choose_data_structure(requirements):
    """
    Comprehensive decision tree.
    """
    # Priority 1: Ordering requirements
    if requirements.need_ordered_iteration:
        return "Balanced BST"
    if requirements.need_range_queries:
        return "Balanced BST or Segment Tree"

    # Priority 2: Operation type
    if requirements.primary_operation == "prefix_search":
        return "Trie"
    if requirements.primary_operation == "exact_lookup":
        if requirements.performance_critical:
            return "Hash Table (Open Addressing)"
        else:
            return "Hash Table (Chaining)"

    # Priority 3: Data characteristics
    if requirements.data_static:
        return "Sorted Array"
    if requirements.data_very_dynamic:
        return "Hash Table"

    # Priority 4: Guarantees
    if requirements.need_worst_case_guarantee:
        return "Balanced BST or Cuckoo Hash Table"
    if requirements.can_tolerate_approximation:
        return "Bloom Filter or Count-Min Sketch"

    # Default
    return "Hash Table"

7.2 Evolution and Variants

Historical Progression

Timeline of Hash Table Evolution:

1950s - Birth:
- 1953: Hans Peter Luhn (IBM) - First hash table proposal
- 1956: Arnold Dumey - Formalized scatter storage
- Technique: Simple division method, linear probing
- Limitation: Poor collision handling, no load factor management
1960s - Theoretical Foundation:
- 1963: Gene Amdahl - Analyzed chaining
- 1963: Konheim & Weiss - Linear probing mathematical analysis
- Innovation: Separate chaining introduced
- Impact: Theoretical complexity bounds established
1970s - Refinement:
- 1974: Knuth - Comprehensive analysis in TAOCP Vol. 3
- 1977: Robin Hood hashing introduced
- 1979: Carter & Wegman - Universal hashing theory
- Innovation: Load factor management, better hash functions
- Impact: Provable average-case guarantees
1980s - Practical Optimizations:
- 1982: Dynamic hashing (Larson)
- 1985: Linear hashing
- Innovation: Incremental resizing, better memory management
- Impact: Production systems could scale gracefully
1990s - Security and Variants:
- 1993: Multiplicative hashing gains popularity
- 1996: SHA-1 becomes standard secure hash
- 1997: Perfect hashing for static data
- Innovation: Cryptographic hashes, security awareness
- Impact: Protection against collision attacks
2000s - Advanced Variants:
- 2001: Cuckoo hashing (Pagh & Rodler)
- 2004: Hopscotch hashing
- 2007: MurmurHash (fast, good distribution)
- Innovation: Worst-case O(1) lookup, cache-friendly designs
- Impact: Real-time systems could use hash tables
2010s - Modern Optimization:
- 2012: SipHash (defense against DoS)
- 2014: Google Swiss Tables (SIMD optimization)
- 2016: Rust's HashMap uses Robin Hood hashing
- Innovation: Hardware-aware design, SIMD parallelization
- Impact: 2-3x speedup on modern CPUs
2020s - Current State:
- 2020: xxHash3 (extremely fast non-cryptographic hash)
- 2021: wyhash optimization
- Present: Cache-oblivious algorithms, GPU hashing
- Innovation: AI/ML integration, quantum-resistant hashing research
- Impact: Handles big data and real-time analytics

What Each Era Replaced:

Era	Old Approach	New Approach	Improvement
1960s	Linear search	Simple hash table	O(n) → O(1) average
1970s	Poor hash functions	Universal hashing	Provable guarantees
1980s	Static size	Dynamic resizing	No manual tuning
1990s	Predictable hashes	Randomized hashes	DoS protection
2000s	Chaining (pointer overhead)	Open addressing variants	Better cache use
2010s	Scalar operations	SIMD operations	2-4x speedup
2020s	General-purpose	Hardware-specialized	10x+ for specific tasks

Variant Analysis

Major Hash Table Variants:

Separate Chaining:

class ChainingHashTable:
    def __init__(self, capacity=16):
        self.buckets = [[] for _ in range(capacity)]

    def put(self, key, value):
        bucket = self.buckets[hash(key) % len(self.buckets)]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))

Pros: Simple, load factor can exceed 1, easy deletion
Cons: Pointer overhead, cache misses
Best for: General purpose, variable load

Open Addressing (Linear Probing):

class LinearProbingHashTable:
    def __init__(self, capacity=16):
        self.table = [None] * capacity

    def put(self, key, value):
        i = hash(key) % len(self.table)
        while self.table[i] is not None and self.table[i][0] != key:
            i = (i + 1) % len(self.table)
        self.table[i] = (key, value)

Pros: Cache-friendly, no pointers
Cons: Clustering, complex deletion
Best for: Performance-critical, low load factor

Robin Hood Hashing:

class RobinHoodHashTable:
    def __init__(self, capacity=16):
        self.table = [None] * capacity
        self.distances = [0] * capacity

    def put(self, key, value):
        i = hash(key) % len(self.table)
        distance = 0

        while self.table[i] is not None:
            if self.distances[i] < distance:
                # Swap: current entry is "richer" than new entry
                key, value, distance, self.table[i], self.distances[i] = \
                    self.table[i][0], self.table[i][1], self.distances[i], \
                    (key, value), distance
            i = (i + 1) % len(self.table)
            distance += 1

        self.table[i] = (key, value)
        self.distances[i] = distance

Pros: Reduces variance, better worst-case
Cons: More complex than linear probing
Best for: Predictable latency

Cuckoo Hashing:

class CuckooHashTable:
    def __init__(self, capacity=16):
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity
        self.max_evictions = 10

    def put(self, key, value):
        return self._insert(key, value, 0)

    def _insert(self, key, value, evictions):
        if evictions > self.max_evictions:
            self._rehash()
            return self.put(key, value)

        # Try table 1
        i1 = self._hash1(key) % len(self.table1)
        if self.table1[i1] is None:
            self.table1[i1] = (key, value)
            return

        # Evict from table 1, insert in table 2
        old_key, old_value = self.table1[i1]
        self.table1[i1] = (key, value)

        i2 = self._hash2(old_key) % len(self.table2)
        if self.table2[i2] is None:
            self.table2[i2] = (old_key, old_value)
        else:
            evicted = self.table2[i2]
            self.table2[i2] = (old_key, old_value)
            self._insert(evicted[0], evicted[1], evictions + 1)

    def get(self, key):
        # Check both tables: guaranteed O(1)
        i1 = self._hash1(key) % len(self.table1)
        if self.table1[i1] and self.table1[i1][0] == key:
            return self.table1[i1][1]

        i2 = self._hash2(key) % len(self.table2)
        if self.table2[i2] and self.table2[i2][0] == key:
            return self.table2[i2][1]

        return None

Pros: O(1) worst-case lookup
Cons: Complex insertion, requires two tables
Best for: Real-time systems needing guaranteed latency

Hopscotch Hashing:

class HopscotchHashTable:
    def __init__(self, capacity=16, hop_range=32):
        self.table = [None] * capacity
        self.hop_range = hop_range
        self.hop_info = [0] * capacity  # Bitmap of occupied slots

    def put(self, key, value):
        i = hash(key) % len(self.table)

        # Find free slot within hop range
        free_slot = i
        while free_slot < i + self.hop_range and self.table[free_slot] is not None:
            free_slot += 1

        if free_slot >= i + self.hop_range:
            # Move entries to make room (complex logic)
            free_slot = self._make_room(i)

        self.table[free_slot] = (key, value)
        self.hop_info[i] |= (1 << (free_slot - i))

Pros: Bounded search, cache-friendly
Cons: Complex implementation
Best for: High-performance concurrent hash tables

Swiss Tables (Google Abseil):

class SwissTable:
    def __init__(self, capacity=16):
        self.groups = capacity // 16
        self.ctrl = bytearray(capacity)  # Control bytes
        self.slots = [None] * capacity

    def put(self, key, value):
        h1 = hash(key)
        h2 = h1 >> 57  # Top 7 bits for H2 hash
        group = (h1 % self.groups) * 16

        # SIMD: Compare H2 against 16 control bytes at once
        matches = self._simd_match(self.ctrl[group:group+16], h2)

        for offset in matches:
            if self.slots[group + offset] and self.slots[group + offset][0] == key:
                self.slots[group + offset] = (key, value)
                return

        # Insert in first empty slot
        for offset in range(16):
            if self.ctrl[group + offset] == 0:  # Empty
                self.ctrl[group + offset] = h2
                self.slots[group + offset] = (key, value)
                return

Pros: SIMD optimization, excellent cache performance
Cons: Requires modern CPU, complex implementation
Best for: Modern C++ applications (C++17+)

Variant Comparison Table:

Variant	Lookup	Insert	Delete	Space	Cache	Complexity	Use Case
Chaining	O(1) avg	O(1) avg	O(1) avg	High	Poor	Low	General purpose
Linear Probing	O(1/(1-α))	O(1/(1-α)²)	O(1/(1-α))	Medium	Excellent	Low	Performance
Robin Hood	O(1/(1-α))	O(1/(1-α)²)	O(1/(1-α))	Medium	Excellent	Medium	Predictable
Cuckoo	O(1) worst	O(1) amort	O(1)	High	Good	High	Real-time
Hopscotch	O(1)	O(1)	O(1)	Medium	Excellent	High	Concurrent
Swiss Tables	O(1)	O(1)	O(1)	Medium	Excellent	Very High	Modern C++

Choosing the Right Variant:

def choose_hash_table_variant(requirements):
    """
    Choose hash table variant based on requirements.
    """
    if requirements.need_worst_case_guarantee:
        if requirements.can_afford_space:
            return "Cuckoo Hashing"
        else:
            return "Hopscotch Hashing"

    if requirements.need_concurrent:
        return "Hopscotch or Lock-Free"

    if requirements.performance_critical:
        if requirements.modern_cpu:
            return "Swiss Tables"
        else:
            return "Robin Hood or Linear Probing"

    if requirements.simple_implementation:
        return "Separate Chaining"

    if requirements.low_memory:
        return "Open Addressing (Linear Probing)"

    # Default
    return "Separate Chaining"  # Most common, simplest

Hash Table Family Tree:

Hash-Based Data Structures
│
├── Exact Structures
│   ├── Hash Map (key-value)
│   │   ├── HashMap (Java)
│   │   ├── unordered_map (C++)
│   │   ├── dict (Python)
│   │   └── Map (JavaScript)
│   │
│   ├── Hash Set (keys only)
│   │   ├── HashSet (Java)
│   │   ├── unordered_set (C++)
│   │   ├── set (Python)
│   │   └── Set (JavaScript)
│   │
│   ├── Multi-Map/Multi-Set
│   │   └── Allows duplicate keys
│   │
│   └── Bidirectional Map
│       └── Key ↔ Value both indexed
│
├── Approximate Structures
│   ├── Bloom Filter (membership)
│   ├── Count-Min Sketch (frequency)
│   ├── HyperLogLog (cardinality)
│   └── Quotient Filter (membership)
│
├── Distributed Structures
│   ├── Consistent Hashing
│   ├── Distributed Hash Table (DHT)
│   │   ├── Chord
│   │   ├── Kademlia
│   │   └── Pastry
│   │
│   └── Rendezvous Hashing
│
├── Concurrent Structures
│   ├── Lock-Free Hash Table
│   ├── ConcurrentHashMap (Java)
│   ├── TBB concurrent_hash_map (C++)
│   └── Striped Hash Table
│
└── Specialized Structures
    ├── Perfect Hash Table
    ├── Minimal Perfect Hash Table
    ├── Locality-Sensitive Hashing (LSH)
    └── Geometric Hashing

Relationships:

Hash Table ↔ Hash Function:
- Hash table depends on hash function
- Quality of hash function determines collision rate
- Example hash functions:
  - Division method
  - Multiplication method
  - Universal hashing
  - Cryptographic hashes (MD5, SHA-256)
Hash Table ↔ Array:
- Hash table built on top of array
- Hash function maps keys to array indices
- Array provides O(1) random access
Hash Table ↔ Linked List (Chaining):
- Chaining uses linked lists for collision resolution
- Each bucket is a linked list
- Poor cache locality but simple
Hash Table ↔ BST (Java 8 HashMap):
- When chain length > 8, convert to tree
- Guarantees O(log n) even with poor hash
- Best of both worlds
Hash Set ↔ Hash Map:
- Hash set is hash map with dummy values
- Implementation: HashSet uses HashMap internally
- Only stores keys, values are placeholder
Hash Table ↔ Bloom Filter:
- Bloom filter is probabilistic hash set
- Multiple hash functions
- False positives possible, no false negatives
- Much smaller than hash set

Hybrid Solutions:

LRU Cache (Hash Map + Doubly Linked List):

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # Hash map: key -> node
        self.head = Node(0, 0)  # Dummy head
        self.tail = Node(0, 0)  # Dummy tail
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key):
        if key in self.cache:
            node = self.cache[key]
            self._move_to_front(node)
            return node.value
        return -1

    def put(self, key, value):
        if key in self.cache:
            node = self.cache[key]
            node.value = value
            self._move_to_front(node)
        else:
            if len(self.cache) >= self.capacity:
                # Evict LRU
                lru = self.tail.prev
                self._remove(lru)
                del self.cache[lru.key]

            node = Node(key, value)
            self.cache[key] = node
            self._add_to_front(node)

# Hash map: O(1) lookup
# Doubly linked list: O(1) move to front, O(1) remove tail
# Combined: O(1) get and put

TreeMap with Hashing (for buckets):

class HashTreeMap:
    """
    Combines hash table (for speed) with BST (for ordering within buckets).
    """
    def __init__(self, capacity=16):
        self.buckets = [None] * capacity

    def put(self, key, value):
        bucket_idx = hash(key) % len(self.buckets)
        if self.buckets[bucket_idx] is None:
            self.buckets[bucket_idx] = BST()
        self.buckets[bucket_idx].insert(key, value)

    def range_query(self, low, high):
        """
        Get all keys in range [low, high].
        Hash to relevant buckets, then BST range query.
        """
        results = []
        for bucket in self.buckets:
            if bucket:
                results.extend(bucket.range_query(low, high))
        return results

# Benefits: Fast exact lookup (hash) + range queries (BST)

Trie with Hash Maps (for children):

class TrieNode:
    def __init__(self):
        self.children = {}  # Hash map instead of array
        self.is_end = False

# Array-based trie: Fixed alphabet size (e.g., 26 for a-z)
# Hash map-based trie: Any characters, dynamic
# Trade-off: Speed vs flexibility

Cross-Domain Connections

Hash Tables in Different Fields:

Databases:
- Hash Indexes: Fast equality lookups
- Hash Joins: Join two tables by hashing smaller table
- Query Optimization: Cardinality estimation with HyperLogLog
- Caching: Query result cache using hash map
Cryptography:
- Password Storage: Hash passwords with salt
- Digital Signatures: Hash document, sign hash
- Blockchain: Hash pointers in Merkle trees
- Integrity: File hashing (checksums)
Networking:
- Load Balancing: Consistent hashing for server selection
- ARP Tables: IP address → MAC address
- DNS Caching: Domain name → IP address
- NAT: Map internal IP:port to external IP:port
Compilers:
- Symbol Tables: Identifier → type/address
- String Interning: Deduplicate string literals
- Constant Folding: Cache computed constants
- Macro Expansion: Macro name → definition
Operating Systems:
- Page Tables: Virtual address → physical address
- File Systems: Inode hashing, directory structures
- Process Tables: PID → process control block
- Buffer Cache: Block number → cached block
Graphics/Computer Vision:
- Geometric Hashing: Object recognition
- Locality-Sensitive Hashing: Similar image search
- Texture Caching: Hash texture coordinates
- Collision Detection: Spatial hashing
Machine Learning:
- Feature Hashing: High-dimensional sparse features
- Hashing Trick: Reduce dimensionality
- MinHash: Similar document detection
- SimHash: Near-duplicate detection

Mathematical Connections:

Probability Theory:
- Birthday paradox → Collision probability
- Poisson distribution → Chain length distribution
- Load factor → Expected performance
Number Theory:
- Prime numbers → Table size selection
- Modular arithmetic → Index computation
- Universal hashing → Prime field arithmetic
Information Theory:
- Entropy → Hash distribution quality
- Collision resistance → Information loss
- Perfect hashing → Bijective function

7.4 Practical Considerations

Library Implementations

Standard Library Implementations Across Languages:

Python dict:

# Internal: Open addressing with random probing
d = {}  # or dict()
d[key] = value  # O(1) average
value = d[key]  # Raises KeyError if not found
value = d.get(key, default)  # Returns default if not found

# Advanced features:
d.setdefault(key, default)  # Get or insert with default
d.pop(key, default)  # Remove and return, or default
d.keys()  # dict_keys view (dynamic)
d.values()  # dict_values view
d.items()  # dict_items view (key-value pairs)

# Python 3.7+: Maintains insertion order
# Implementation: Combined hash/index table

Java HashMap:

// Internal: Chaining, converts to tree when chain > 8
HashMap<K, V> map = new HashMap<>();
map.put(key, value);  // O(1) average
V value = map.get(key);  // null if not found
map.remove(key);

// Advanced features:
map.putIfAbsent(key, value);
map.computeIfAbsent(key, k -> computeValue(k));
map.getOrDefault(key, defaultValue);
map.forEach((k, v) -> process(k, v));

// Java 8+: Streams support
map.entrySet().stream()
    .filter(e -> e.getValue() > 10)
    .collect(Collectors.toMap(...));

C++ std::unordered_map:

// Internal: Traditionally chaining, modern uses Swiss Tables
std::unordered_map<K, V> map;
map[key] = value;  // O(1) average
auto it = map.find(key);  // Returns iterator
if (it != map.end()) {
    V value = it->second;
}

// Advanced features:
map.emplace(key, value);  // Construct in-place
map.insert_or_assign(key, value);  // C++17
map.try_emplace(key, args...);  // C++17

// Custom hash function:
struct CustomHash {
    size_t operator()(const K& k) const {
        return std::hash<K>{}(k);
    }
};
std::unordered_map<K, V, CustomHash> custom_map;

JavaScript Map:

// Internal: Varies by engine (V8 uses Swiss Tables)
const map = new Map();
map.set(key, value); // O(1) average
const value = map.get(key); // undefined if not found
map.has(key); // boolean
map.delete(key);

// Maintains insertion order
for (const [key, value] of map) {
  console.log(key, value);
}

// Advanced:
map.forEach((value, key) => process(key, value));
const keys = Array.from(map.keys());

// Note: Objects can also be used as maps, but Map is better
// Map vs Object:
// - Map: Any key type, maintains order, .size property
// - Object: String/Symbol keys only, may have prototype pollution

Go map:

// Internal: Open addressing with some optimizations
m := make(map[string]int)
m[key] = value  // O(1) average
value, exists := m[key]  // Check existence
delete(m, key)

// Iteration (order not guaranteed):
for key, value := range m {
    fmt.Println(key, value)
}

// Pre-allocation:
m := make(map[string]int, capacity)

// Note: Maps are not safe for concurrent use
// Use sync.Map for concurrent access

Hidden Optimizations in Libraries:

Python dict (3.6+):
- Compact dict: Separate hash/index table from key/value table
- Memory saving: 20-25% less memory than old implementation
- Insertion order: Free with new design
- Fast iteration: Keys/values stored contiguously
Java HashMap (8+):
- Treeification: Chains → Red-Black trees when length > 8
- Untreeification: Trees → chains when shrinking below 6
- Prevents: O(n) worst-case from hash collision attacks
- Trade-off: More complex, but guaranteed O(log n)
C++ std::unordered_map (recent):
- Swiss Tables: Group-based open addressing with SIMD
- Control bytes: Metadata for each slot (empty/deleted/H2 hash)
- SIMD matching: 16 parallel comparisons
- Performance: 2-3x faster than old chaining implementation
JavaScript V8 Map:
- Swiss Tables: Similar to C++
- Inline caching: JIT optimizations for repeated access patterns
- Hidden classes: Optimize property access on objects used as maps

When to Use Library vs Implement:

Scenario	Use Library	Implement Custom
General purpose	✓
Production code	✓
Learning/education		✓
Special requirements		✓ (e.g., lock-free)
Language built-in	✓
Specialized hash function	Partial	✓ (custom hash)
Memory-constrained	Maybe	✓ (control layout)
Performance-critical	✓ (try first)	✓ (if library insufficient)

API Differences Across Languages:

# Key operations across languages

# Python
value = d.get(key, default)  # Safe get with default

# Java
V value = map.getOrDefault(key, defaultValue);

# C++
auto it = map.find(key);
V value = (it != map.end()) ? it->second : defaultValue;

# JavaScript
const value = map.get(key) ?? defaultValue;

# Go
value, exists := m[key]
if !exists {
    value = defaultValue
}

# Common pattern: Safe get with default
# Python/Java/JS: Built-in method
# C++/Go: Manual check required

Production vs Academic

Real-World vs Interview Differences:

Aspect	Interview	Production
Implementation	From scratch	Use library
Hash function	Simple (e.g., % table_size)	Robust (MurmurHash, SipHash)
Collision handling	Chaining or linear probing	Sophisticated (Robin Hood, Swiss Tables)
Error handling	Minimal	Comprehensive
Thread safety	Single-threaded	Concurrent, lock-free
Testing	Few test cases	Extensive testing, fuzzing
Edge cases	Basic	Comprehensive (overflow, attacks, etc.)
Documentation	None	Required
Performance	Asymptotic analysis	Profiling, benchmarking
Security	Ignored	Critical (DoS prevention)

Production Concerns Not in Interviews:

Thread Safety:

# Interview: Simple hash table (not thread-safe)
class HashTable:
    def put(self, key, value):
        # ... simple insertion ...

# Production: Thread-safe with locks
class ThreadSafeHashTable:
    def __init__(self):
        self.lock = threading.RLock()
        self.table = {}

    def put(self, key, value):
        with self.lock:
            self.table[key] = value

# Or use built-in thread-safe structures:
from multiprocessing import Manager
manager = Manager()
shared_dict = manager.dict()  # Thread-safe

Memory Management:

# Interview: Ignore memory details

# Production: Monitor memory usage
import sys
import tracemalloc

tracemalloc.start()
hash_table = {}
for i in range(1000000):
    hash_table[i] = i

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"Peak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

Security (Hash Collision Attacks):

# Interview: Use simple hash
def simple_hash(key):
    return sum(ord(c) for c in str(key))

# Production: Randomized hash to prevent DoS
import hashlib
import secrets

class SecureHashTable:
    def __init__(self):
        self.salt = secrets.token_bytes(16)  # Random per instance

    def _hash(self, key):
        return hashlib.sha256(
            str(key).encode() + self.salt
        ).digest()

Monitoring and Metrics:

# Production: Track performance metrics
class MonitoredHashTable:
    def __init__(self):
        self.table = {}
        self.hit_count = 0
        self.miss_count = 0
        self.collision_count = 0

    def get(self, key):
        if key in self.table:
            self.hit_count += 1
            return self.table[key]
        else:
            self.miss_count += 1
            return None

    def metrics(self):
        total = self.hit_count + self.miss_count
        hit_rate = self.hit_count / total if total > 0 else 0
        return {
            'hit_rate': hit_rate,
            'collision_rate': self.collision_count / total,
            'load_factor': len(self.table) / self.capacity
        }

Constant Factors in Practice:

Despite same Big-O, implementations can vary 10x in speed:

import time

def benchmark_hash_implementations():
    n = 1000000
    keys = list(range(n))

    # Simple division method
    start = time.perf_counter()
    hash1 = [hash(k) % n for k in keys]
    time1 = time.perf_counter() - start

    # Multiplication method
    start = time.perf_counter()
    A = 0.6180339887  # Golden ratio
    hash2 = [int(n * (k * A % 1)) for k in keys]
    time2 = time.perf_counter() - start

    # Cryptographic hash (overkill for hash table)
    import hashlib
    start = time.perf_counter()
    hash3 = [int(hashlib.md5(str(k).encode()).hexdigest(), 16) % n for k in keys]
    time3 = time.perf_counter() - start

    print(f"Division method: {time1:.3f}s")
    print(f"Multiplication method: {time2:.3f}s")
    print(f"Cryptographic hash: {time3:.3f}s")

    # Typical output:
    # Division method: 0.15s
    # Multiplication method: 0.18s (slightly slower, better distribution)
    # Cryptographic hash: 8.50s (much slower, unnecessary for hash table)

When Asymptotic Guarantees Don't Match Reality:

Small datasets (n < 100):
- O(1) hash table vs O(n) linear search
- Linear search often faster due to:
  - No hash computation overhead
  - Better cache locality
  - No resizing
Cache effects dominate:
- Chaining: O(1 + α) average
- Linear probing: O(1 / (1-α)) average
- But linear probing 2-3x faster due to cache locality
Dynamic resizing:
- Amortized O(1) insertion
- But occasional O(n) spike
- Can cause tail latency issues in real-time systems

Best Practices from Production:

Pre-size if possible:

# Bad: Multiple resizes
d = {}
for i in range(1000000):
    d[i] = i

# Good: Pre-allocate
d = dict.fromkeys(range(1000000))
for i in range(1000000):
    d[i] = i

Use appropriate data structure:

# Counting: Use Counter
from collections import Counter
counts = Counter(items)  # Optimized for frequency counting

# Ordered: Use OrderedDict or dict (Python 3.7+)
from collections import OrderedDict
ordered = OrderedDict()  # Explicit ordering guarantee

# Default values: Use defaultdict
from collections import defaultdict
groups = defaultdict(list)  # Auto-initialize lists
for item in items:
    groups[get_key(item)].append(item)

Profile before optimizing:

import cProfile
import pstats

def hash_intensive_function():
    d = {}
    for i in range(100000):
        d[i] = i
        _ = d.get(i)

profiler = cProfile.Profile()
profiler.enable()
hash_intensive_function()
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions

8. Edge Cases and Limitations

8.1 Known Limitations

Structural Limitations

Fundamental Limitations of Hash Tables:

No Ordering:

# Hash tables don't maintain order (except Python 3.7+ dict)
hash_table = {3: "three", 1: "one", 2: "two"}

# Can't efficiently:
# - Iterate in sorted order
# - Find minimum/maximum key
# - Get range of keys [a, b]
# - Find predecessor/successor

# Workaround: Extract and sort
sorted_items = sorted(hash_table.items())  # O(n log n)

# Better alternative: Use BST if ordering is primary requirement
from sortedcontainers import SortedDict
sorted_dict = SortedDict({3: "three", 1: "one", 2: "two"})
print(sorted_dict.keys())  # Always sorted: [1, 2, 3]

No Range Queries:

# Can't efficiently find all keys in range [low, high]
def range_query(hash_table, low, high):
    # Must check every key: O(n)
    return {k: v for k, v in hash_table.items() if low <= k <= high}

# BST alternative: O(log n + k) where k = results
# Segment tree: O(log n + k)

Rehashing Cost:

# Occasional O(n) spike during resize
import time

hash_table = {}
times = []

for i in range(100000):
    start = time.perf_counter()
    hash_table[i] = i
    elapsed = time.perf_counter() - start
    if elapsed > 0.001:  # More than 1ms
        times.append((i, elapsed))
        print(f"Slow insertion at i={i}: {elapsed*1000:.2f}ms")

# Output shows spikes at powers of 2:
# Slow insertion at i=12: 1.2ms
# Slow insertion at i=24: 2.3ms
# Slow insertion at i=48: 4.1ms
# Slow insertion at i=96: 7.8ms

Hash Function Dependency:

# Poor hash function → O(n) worst case
class BadHashKey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return 42  # All objects hash to same value!

hash_table = {}
for i in range(1000):
    key = BadHashKey(i)
    hash_table[key] = i

# All keys collide → effectively a linked list → O(n) operations

Memory Overhead:

# Hash table requires load factor < 1 (usually 0.75)
# Means 25-33% wasted space

import sys

# Sorted array: No overhead
sorted_array = list(range(1000000))
print(f"Sorted array: {sys.getsizeof(sorted_array) / 10**6:.2f}MB")

# Hash table: Load factor overhead
hash_table = {i: i for i in range(1000000)}
print(f"Hash table: {sys.getsizeof(hash_table) / 10**6:.2f}MB")

# Output:
# Sorted array: 8.0MB
# Hash table: 36.0MB (4.5x more!)

Problems Hash Tables Cannot Solve Efficiently:

Problem	Hash Table	Better Alternative	Reason
Find kth smallest	O(n log n) sort	Quickselect O(n) avg	Need ordering
Range sum query	O(n) scan	Segment tree O(log n)	No range support
Prefix matching	O(n) scan	Trie O(k)	Need prefix structure
Nearest neighbor	O(n) scan	K-D tree O(log n)	Need spatial structure
Interval queries	O(n) scan	Interval tree O(log n)	Need range support
Sorted iteration	O(n log n) sort	BST O(n)	No ordering
Median finding	O(n log n) sort	Two heaps O(log n)	Need ordering

Scalability Limits:

# Maximum practical size depends on available memory

import sys

def estimate_max_size(available_memory_gb=8):
    """
    Estimate maximum hash table size.
    """
    # Assume 64-byte per entry (key + value + overhead)
    bytes_per_entry = 64
    load_factor = 0.75

    available_bytes = available_memory_gb * 10**9
    max_entries = int((available_bytes / bytes_per_entry) * load_factor)

    print(f"Available memory: {available_memory_gb}GB")
    print(f"Estimated max entries: {max_entries:,}")
    print(f"At 1M ops/sec, would take {max_entries / 10**6:.1f} seconds to build")

    return max_entries

estimate_max_size(8)
# Output:
# Available memory: 8GB
# Estimated max entries: 93,750,000
# At 1M ops/sec, would take 93.8 seconds to build

# For larger datasets: Use distributed hash tables or approximate structures

When NOT to Use

Anti-Patterns and Failure Modes:

Small datasets (< 20 elements):

# Bad: Hash table overhead not worth it
user_types = {"admin": 1, "user": 2, "guest": 3}

# Better: Simple list of tuples
user_types = [("admin", 1), ("user", 2), ("guest", 3)]

# Lookup: O(n) but faster in practice for small n
def get_type(name):
    for n, t in user_types:
        if n == name:
            return t
    return None

Need sorted/ordered access:

# Bad: Hash table + sorting
events = {}
for event in event_stream:
    events[event.timestamp] = event

sorted_events = sorted(events.items())  # O(n log n) every time!

# Better: Use sorted structure from start
from sortedcontainers import SortedDict
events = SortedDict()
for event in event_stream:
    events[event.timestamp] = event
# Already sorted: O(1) to iterate

Mutable keys:

# Bad: Using mutable object as key
key = [1, 2, 3]
hash_table = {}
hash_table[key] = "value"  # TypeError: unhashable type: 'list'

# Also bad: Mutable object that is hashable but changes
class MutableKey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return self.value == other.value

key = MutableKey(5)
hash_table = {}
hash_table[key] = "value"
print(hash_table[key])  # Works

key.value = 10  # Mutate key
print(hash_table[key])  # May not find it! Hash changed.

# Correct: Use immutable keys
key = (1, 2, 3)  # Tuple
hash_table[key] = "value"  # Works correctly

Real-time systems with strict latency:

# Bad: Resizing causes unpredictable latency spikes
hash_table = {}
for i in range(1000000):
    hash_table[i] = i  # Occasional O(n) spike during resize

# Better for real-time: Pre-allocate or use Cuckoo hashing
hash_table = dict.fromkeys(range(1000000))  # Pre-allocate
# Or use Cuckoo hashing: O(1) worst-case lookup

Prefix/fuzzy matching:

# Bad: Hash table for prefix search
dictionary = {"apple", "application", "apply", "banana"}
hash_set = set(dictionary)

def find_prefix(prefix):
    # O(n) scan
    return [word for word in hash_set if word.startswith(prefix)]

# Better: Use Trie
class Trie:
    # O(k) search where k = prefix length
    pass

Extreme Conditions

Handling Extreme Inputs:

Maximum integer keys:

# Large integers
import sys

hash_table = {}
max_int = sys.maxsize  # 2^63 - 1 on 64-bit
hash_table[max_int] = "max"
hash_table[-max_int] = "min"

# Hash function should handle full range
print(hash(max_int))  # Works in Python

# Potential issue: Hash collision with large integers
# Python's hash is identity for small integers, but hashes large ones

Memory pressure:

# What happens when running out of memory?
import sys

hash_table = {}
try:
    for i in range(10**9):  # Try to insert 1 billion elements
        hash_table[i] = [0] * 1000  # Large values
        if i % 1000000 == 0:
            print(f"Inserted {i:,} elements")
except MemoryError:
    print(f"Out of memory at {i:,} elements")
    print(f"Table size: {sys.getsizeof(hash_table) / 10**9:.2f}GB")

# Graceful degradation:
# 1. Use approximate structures (Bloom filter, Count-Min Sketch)
# 2. Distributed hash tables
# 3. Disk-based hash tables

Adversarial input (Hash collision attacks):

# Attacker crafts keys that all hash to same value
# → All collide → O(n) operations → DoS

# Example: Predictable hash function
def bad_hash(s):
    return sum(ord(c) for c in s) % 1000

# Attacker generates many strings with same hash:
# "abc", "bac", "cab" all have same character sum

# Defense 1: Randomized hashing
import random
import hashlib

class SecureHashTable:
    def __init__(self):
        self.salt = random.randbytes(16)

    def _hash(self, key):
        return hashlib.sha256(str(key).encode() + self.salt).digest()

# Defense 2: Switch to tree when chain too long (Java 8 HashMap)
class TreeifyingHashTable:
    TREEIFY_THRESHOLD = 8

    def _handle_collision(self, bucket):
        if len(bucket) > self.TREEIFY_THRESHOLD:
            # Convert linked list to red-black tree
            bucket = RedBlackTree(bucket)
            # Now O(log n) even with all collisions

Numerical overflow in hash function:

# Multiplication can overflow in languages without arbitrary precision
def hash_with_overflow(key):
    prime = 31
    hash_val = 0
    for char in str(key):
        hash_val = hash_val * prime + ord(char)
        # In C/Java: This can overflow!
        # Python: No problem, arbitrary precision
    return hash_val

# C/Java solution: Use modulo to keep in bounds
def safe_hash(key, modulo=2**32):
    prime = 31
    hash_val = 0
    for char in str(key):
        hash_val = (hash_val * prime + ord(char)) % modulo
    return hash_val

8.2 Edge Cases

Input Edge Cases

Comprehensive Edge Case Testing:

class HashTableTester:
    """
    Test hash table with all edge cases.
    """

    def test_empty_table(self, hash_table):
        """Empty table operations."""
        assert hash_table.size() == 0
        assert hash_table.get("nonexistent") is None
        assert hash_table.delete("nonexistent") is False
        assert list(hash_table.keys()) == []

    def test_single_element(self, hash_table):
        """Single element edge case."""
        hash_table.put("only", "value")
        assert hash_table.size() == 1
        assert hash_table.get("only") == "value"
        hash_table.delete("only")
        assert hash_table.size() == 0

    def test_duplicate_keys(self, hash_table):
        """Duplicate insertions should update."""
        hash_table.put("key", "value1")
        hash_table.put("key", "value2")
        assert hash_table.size() == 1
        assert hash_table.get("key") == "value2"

    def test_null_none_keys(self, hash_table):
        """None/null as key."""
        # Python allows None as key
        hash_table.put(None, "null_value")
        assert hash_table.get(None) == "null_value"

        # Empty string as key
        hash_table.put("", "empty")
        assert hash_table.get("") == "empty"

    def test_special_values(self, hash_table):
        """Special numeric values."""
        hash_table.put(0, "zero")
        hash_table.put(-0, "negative_zero")  # Same as 0 in Python
        assert hash_table.size() == 1  # -0 == 0

        # Infinity
        hash_table.put(float('inf'), "infinity")
        assert hash_table.get(float('inf')) == "infinity"

        # NaN (tricky: NaN != NaN)
        nan = float('nan')
        hash_table.put(nan, "not_a_number")
        # May not be retrievable: hash(nan) works but nan != nan

    def test_maximum_values(self, hash_table):
        """Maximum integer values."""
        import sys
        max_int = sys.maxsize
        hash_table.put(max_int, "max")
        hash_table.put(-max_int, "min")
        assert hash_table.get(max_int) == "max"

    def test_unicode_keys(self, hash_table):
        """Unicode and special characters."""
        hash_table.put("hello", "english")
        hash_table.put("こんにちは", "japanese")
        hash_table.put("🔥", "emoji")
        assert hash_table.size() == 3

    def test_collision_scenario(self, hash_table):
        """Force collisions (if using simple hash)."""
        # Keys that may collide with simple hash
        keys = ["abc", "bac", "cab"]  # Same character sum
        for i, key in enumerate(keys):
            hash_table.put(key, i)

        # All should be retrievable
        assert hash_table.get("abc") == 0
        assert hash_table.get("bac") == 1
        assert hash_table.get("cab") == 2

    def test_resize_boundary(self, hash_table):
        """Test around resize threshold."""
        # Assuming load factor = 0.75, capacity = 16
        # Resize should occur at 12th element
        for i in range(15):
            hash_table.put(i, i)
            print(f"Size: {hash_table.size()}, Capacity: {hash_table.capacity()}")

    def test_delete_nonexistent(self, hash_table):
        """Delete key that doesn't exist."""
        result = hash_table.delete("nonexistent")
        assert result is False or result is None

    def test_update_during_iteration(self, hash_table):
        """Concurrent modification detection."""
        hash_table.put(1, "one")
        hash_table.put(2, "two")

        try:
            for key in hash_table.keys():
                hash_table.put(3, "three")  # Modify during iteration
                # Some implementations throw ConcurrentModificationException
        except RuntimeError:
            pass  # Expected in some implementations

Edge Case Categories:

Size edge cases:
- Empty table (size = 0)
- Single element (size = 1)
- At resize threshold (size = capacity * load_factor)
- Maximum size (limited by memory)
Key edge cases:
- None/null keys
- Empty string keys
- Very long keys (10KB+ strings)
- Duplicate keys (should update)
- Non-existent keys (lookups)
Value edge cases:
- None/null values
- Large values (stress memory)
- Mutable values (should be allowed)
Numeric edge cases:
- Zero (0, -0, 0.0)
- Negative numbers
- Maximum/minimum integers
- Infinity (float('inf'))
- NaN (float('nan'))
String edge cases:
- Empty string ("")
- Unicode strings
- Very long strings
- Strings with special characters

8.3 Error Handling and Robustness

Failure Modes

Common Failure Scenarios:

class RobustHashTable:
    """
    Hash table with comprehensive error handling.
    """

    def __init__(self, capacity=16, load_factor=0.75):
        if capacity <= 0:
            raise ValueError(f"Capacity must be positive, got {capacity}")
        if not (0 < load_factor < 1):
            raise ValueError(f"Load factor must be in (0, 1), got {load_factor}")

        self.capacity = capacity
        self.load_factor = load_factor
        self.size = 0
        self.table = [[] for _ in range(capacity)]

    def put(self, key, value):
        """
        Insert with error handling.
        """
        # Validate key
        if key is None:
            raise KeyError("None is not a valid key")

        # Check if key is hashable
        try:
            hash_val = hash(key)
        except TypeError as e:
            raise TypeError(f"Key {key} is not hashable: {e}")

        # Check memory availability before resize
        if self._should_resize():
            try:
                self._resize()
            except MemoryError:
                raise MemoryError("Cannot resize: out of memory")

        # Insert
        index = hash_val % self.capacity
        bucket = self.table[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)  # Update
                return

        bucket.append((key, value))
        self.size += 1

    def get(self, key, default=None):
        """
        Safe get with default value.
        """
        try:
            hash_val = hash(key)
        except TypeError:
            return default  # Unhashable key

        index = hash_val % self.capacity
        bucket = self.table[index]

        for k, v in bucket:
            if k == key:
                return v

        return default

    def delete(self, key):
        """
        Delete with error handling.
        """
        try:
            hash_val = hash(key)
        except TypeError as e:
            raise TypeError(f"Cannot delete unhashable key: {e}")

        index = hash_val % self.capacity
        bucket = self.table[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                del bucket[i]
                self.size -= 1
                return True

        # Key not found
        return False

    def _should_resize(self):
        """Check if resize needed."""
        return self.size >= self.capacity * self.load_factor

    def _resize(self):
        """
        Resize with error handling.
        """
        old_table = self.table
        self.capacity *= 2

        try:
            self.table = [[] for _ in range(self.capacity)]
        except MemoryError:
            # Restore old state
            self.capacity //= 2
            raise

        self.size = 0
        for bucket in old_table:
            for key, value in bucket:
                self.put(key, value)

Error Categories:

Invalid Input:
- Non-hashable keys
- None as key (if not allowed)
- Invalid capacity/load factor
Resource Exhaustion:
- Out of memory during resize
- Too many elements (platform limits)
Concurrent Modification:
- Modifying table during iteration
- Thread safety violations
Corruption:
- Hash function changes for same key
- Mutable key modified after insertion

Recovery Mechanisms:

class ResilientHashTable:
    """
    Hash table with recovery mechanisms.
    """

    def __init__(self):
        self.table = {}
        self.backup = None
        self.operation_log = []

    def put(self, key, value):
        """
        Transactional put with rollback.
        """
        # Create backup before modification
        old_value = self.table.get(key)
        self.operation_log.append(('put', key, old_value))

        try:
            self.table[key] = value
        except Exception as e:
            # Rollback on error
            self.rollback()
            raise

    def batch_put(self, items):
        """
        Batch insert with all-or-nothing semantics.
        """
        self.create_checkpoint()

        try:
            for key, value in items:
                self.table[key] = value
        except Exception as e:
            # Rollback all changes
            self.restore_checkpoint()
            raise Exception(f"Batch put failed, rolled back: {e}")
        finally:
            self.clear_checkpoint()

    def create_checkpoint(self):
        """Create backup for rollback."""
        import copy
        self.backup = copy.deepcopy(self.table)

    def restore_checkpoint(self):
        """Restore from backup."""
        if self.backup is not None:
            self.table = self.backup
            self.backup = None

    def clear_checkpoint(self):
        """Clear backup."""
        self.backup = None

    def rollback(self):
        """Rollback last operation."""
        if self.operation_log:
            op, key, old_value = self.operation_log.pop()
            if op == 'put':
                if old_value is None:
                    del self.table[key]
                else:
                    self.table[key] = old_value

Validation

Data Integrity Checks:

class ValidatedHashTable:
    """
    Hash table with validation.
    """

    def __init__(self):
        self.table = {}
        self.checksum = 0

    def put(self, key, value):
        """Insert with validation."""
        # Validate inputs
        self._validate_key(key)
        self._validate_value(value)

        # Update checksum
        old_checksum = self.checksum
        self.checksum ^= hash((key, value))

        try:
            self.table[key] = value
        except Exception:
            # Restore checksum on failure
            self.checksum = old_checksum
            raise

    def verify_integrity(self):
        """
        Verify hash table integrity.
        """
        # Recompute checksum from scratch
        computed_checksum = 0
        for key, value in self.table.items():
            computed_checksum ^= hash((key, value))

        if computed_checksum != self.checksum:
            raise ValueError("Hash table corrupted: checksum mismatch")

        return True

    def _validate_key(self, key):
        """Validate key."""
        if key is None:
            raise ValueError("Key cannot be None")

        try:
            hash(key)
        except TypeError:
            raise TypeError(f"Key must be hashable, got {type(key)}")

    def _validate_value(self, value):
        """Validate value (optional)."""
        # Can add custom validation logic
        pass

9. Practical Mastery

9.1 Implementation

Coding from Scratch

Complete Hash Table Implementation (All 12 Topics Integrated)

This section provides production-ready Python implementations for all 12 hash table topics.

Topic 58 & 62: Hash Table Fundamentals & HashMap Implementation

class HashMap:
    """
    Complete HashMap implementation with separate chaining.
    Demonstrates hash table fundamentals with dynamic resizing.
    """

    class _Node:
        """Internal node for chaining."""
        __slots__ = ['key', 'value', 'next']

        def __init__(self, key, value, next_node=None):
            self.key = key
            self.value = value
            self.next = next_node

    def __init__(self, initial_capacity=16, load_factor_threshold=0.75):
        """
        Initialize hash map.

        Args:
            initial_capacity: Initial size of bucket array
            load_factor_threshold: Resize when n/m exceeds this
        """
        if initial_capacity < 1:
            raise ValueError("Capacity must be positive")
        if not (0 < load_factor_threshold < 1):
            raise ValueError("Load factor must be between 0 and 1")

        self._capacity = self._next_power_of_2(initial_capacity)
        self._buckets = [None] * self._capacity
        self._size = 0
        self._load_factor_threshold = load_factor_threshold
        self._modification_count = 0  # For fail-fast iterators

    def _next_power_of_2(self, n):
        """Find next power of 2 >= n."""
        power = 1
        while power < n:
            power <<= 1
        return power

    def _hash(self, key):
        """
        Compute hash for key.
        Uses Python's built-in hash with perturbation for better distribution.
        """
        h = hash(key)
        # Additional mixing for better distribution
        h ^= (h >> 16)
        return h

    def _get_index(self, key):
        """Get bucket index for key."""
        return self._hash(key) & (self._capacity - 1)  # Fast modulo for power of 2

    def put(self, key, value):
        """
        Insert or update key-value pair.

        Time: O(1) average, O(n) worst case
        Space: O(1)
        """
        if key is None:
            raise ValueError("Key cannot be None")

        index = self._get_index(key)
        node = self._buckets[index]

        # Search for existing key
        while node:
            if node.key == key or (node.key is not None and node.key == key):
                # Update existing
                node.value = value
                return
            node = node.next

        # Insert new node at head of chain
        new_node = self._Node(key, value, self._buckets[index])
        self._buckets[index] = new_node
        self._size += 1
        self._modification_count += 1

        # Check if resize needed
        if self._size / self._capacity > self._load_factor_threshold:
            self._resize(self._capacity * 2)

    def get(self, key, default=None):
        """
        Get value for key.

        Time: O(1) average, O(n) worst case
        Space: O(1)
        """
        if key is None:
            return default

        index = self._get_index(key)
        node = self._buckets[index]

        while node:
            if node.key == key:
                return node.value
            node = node.next

        return default

    def remove(self, key):
        """
        Remove key-value pair.

        Time: O(1) average, O(n) worst case
        Space: O(1)

        Returns:
            Value that was removed, or None if key not found
        """
        if key is None:
            return None

        index = self._get_index(key)
        node = self._buckets[index]
        prev = None

        while node:
            if node.key == key:
                # Found it
                if prev:
                    prev.next = node.next
                else:
                    self._buckets[index] = node.next

                self._size -= 1
                self._modification_count += 1
                return node.value

            prev = node
            node = node.next

        return None

    def contains_key(self, key):
        """Check if key exists."""
        return self.get(key, object()) is not object()

    def _resize(self, new_capacity):
        """
        Resize and rehash all elements.

        Time: O(n) where n = number of elements
        Space: O(n) for new array
        """
        old_buckets = self._buckets
        self._capacity = new_capacity
        self._buckets = [None] * self._capacity
        self._size = 0

        # Rehash all elements
        for bucket in old_buckets:
            node = bucket
            while node:
                self.put(node.key, node.value)
                node = node.next

    def keys(self):
        """Return list of all keys."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append(node.key)
                node = node.next
        return result

    def values(self):
        """Return list of all values."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append(node.value)
                node = node.next
        return result

    def items(self):
        """Return list of (key, value) tuples."""
        result = []
        for bucket in self._buckets:
            node = bucket
            while node:
                result.append((node.key, node.value))
                node = node.next
        return result

    def __len__(self):
        return self._size

    def __contains__(self, key):
        return self.contains_key(key)

    def __getitem__(self, key):
        value = self.get(key, object())
        if value is object():
            raise KeyError(key)
        return value

    def __setitem__(self, key, value):
        self.put(key, value)

    def __delitem__(self, key):
        if self.remove(key) is None:
            raise KeyError(key)

    def __iter__(self):
        """Iterate over keys."""
        return iter(self.keys())

    def __repr__(self):
        items = ', '.join(f'{k!r}: {v!r}' for k, v in self.items())
        return f'{{{items}}}'

    def get_load_factor(self):
        """Get current load factor."""
        return self._size / self._capacity if self._capacity > 0 else 0

    def get_stats(self):
        """Get statistics about hash table."""
        chain_lengths = []
        for bucket in self._buckets:
            length = 0
            node = bucket
            while node:
                length += 1
                node = node.next
            chain_lengths.append(length)

        non_empty = sum(1 for l in chain_lengths if l > 0)
        max_chain = max(chain_lengths) if chain_lengths else 0
        avg_chain = sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0

        return {
            'size': self._size,
            'capacity': self._capacity,
            'load_factor': self.get_load_factor(),
            'non_empty_buckets': non_empty,
            'max_chain_length': max_chain,
            'avg_chain_length': avg_chain,
        }


# Example usage and testing
if __name__ == "__main__":
    # Create hash map
    hm = HashMap()

    # Insert elements
    hm.put("apple", 5)
    hm.put("banana", 3)
    hm.put("cherry", 7)
    hm.put("date", 2)

    # Retrieve elements
    print(f"apple: {hm.get('apple')}")  # 5
    print(f"banana: {hm['banana']}")    # 3

    # Update
    hm["apple"] = 10
    print(f"apple updated: {hm['apple']}")  # 10

    # Check existence
    print(f"Contains 'cherry': {hm.contains_key('cherry')}")  # True
    print(f"Contains 'grape': {'grape' in hm}")  # False

    # Remove
    hm.remove("banana")
    print(f"After removing banana: {'banana' in hm}")  # False

    # Iterate
    print("Keys:", hm.keys())
    print("Values:", hm.values())
    print("Items:", hm.items())

    # Statistics
    print("Stats:", hm.get_stats())

Topic 63: HashSet Implementation

class HashSet:
    """
    HashSet implementation - stores unique elements only.
    Built on top of HashMap, values are ignored.
    """

    _PRESENT = object()  # Dummy value

    def __init__(self, initial_capacity=16):
        """Initialize hash set using internal hash map."""
        self._map = HashMap(initial_capacity)

    def add(self, element):
        """
        Add element to set.

        Time: O(1) average
        Space: O(1)
        """
        self._map.put(element, self._PRESENT)

    def remove(self, element):
        """
        Remove element from set.

        Time: O(1) average
        Returns: True if element was present, False otherwise
        """
        return self._map.remove(element) is not None

    def contains(self, element):
        """Check if element exists in set."""
        return self._map.contains_key(element)

    def size(self):
        """Return number of elements."""
        return len(self._map)

    def is_empty(self):
        """Check if set is empty."""
        return len(self._map) == 0

    def clear(self):
        """Remove all elements."""
        self._map = HashMap()

    def to_list(self):
        """Convert to list."""
        return self._map.keys()

    def union(self, other):
        """Return union of this set and other set."""
        result = HashSet()
        for elem in self.to_list():
            result.add(elem)
        for elem in other.to_list():
            result.add(elem)
        return result

    def intersection(self, other):
        """Return intersection of this set and other set."""
        result = HashSet()
        for elem in self.to_list():
            if other.contains(elem):
                result.add(elem)
        return result

    def difference(self, other):
        """Return elements in this set but not in other."""
        result = HashSet()
        for elem in self.to_list():
            if not other.contains(elem):
                result.add(elem)
        return result

    def is_subset(self, other):
        """Check if this is subset of other."""
        for elem in self.to_list():
            if not other.contains(elem):
                return False
        return True

    def __len__(self):
        return self.size()

    def __contains__(self, element):
        return self.contains(element)

    def __iter__(self):
        return iter(self.to_list())

    def __repr__(self):
        return '{' + ', '.join(repr(e) for e in self.to_list()) + '}'


# Example usage
if __name__ == "__main__":
    # Create set
    s1 = HashSet()
    s1.add(1)
    s1.add(2)
    s1.add(3)
    s1.add(2)  # Duplicate ignored

    print(f"Set: {s1}")  # {1, 2, 3}
    print(f"Size: {len(s1)}")  # 3
    print(f"Contains 2: {2 in s1}")  # True

    # Set operations
    s2 = HashSet()
    s2.add(2)
    s2.add(3)
    s2.add(4)

    print(f"Union: {s1.union(s2)}")  # {1, 2, 3, 4}
    print(f"Intersection: {s1.intersection(s2)}")  # {2, 3}
    print(f"Difference: {s1.difference(s2)}")  # {1}

Topic 59: Hash Functions and Collision Handling

class HashFunctions:
    """Collection of hash function implementations."""

    @staticmethod
    def division_method(key, table_size):
        """
        Division method: h(k) = k mod m

        Works well when m is prime and not close to power of 2.
        """
        return key % table_size

    @staticmethod
    def multiplication_method(key, table_size):
        """
        Multiplication method: h(k) = floor(m * (k * A mod 1))

        A is constant, typically golden ratio (√5 - 1)/2 ≈ 0.618
        Works well for any table size.
        """
        A = 0.6180339887  # Golden ratio
        return int(table_size * ((key * A) % 1))

    @staticmethod
    def string_hash_polynomial(s, table_size, base=31):
        """
        Polynomial rolling hash for strings.

        h(s) = (s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1]) mod table_size

        Common bases: 31, 33, 37 (small primes)
        """
        hash_value = 0
        for char in s:
            hash_value = (hash_value * base + ord(char))
        return hash_value % table_size

    @staticmethod
    def string_hash_djb2(s):
        """
        DJB2 hash function by Dan Bernstein.

        Simple and effective for strings.
        hash(i) = hash(i-1) * 33 + str[i]
        """
        hash_value = 5381
        for char in s:
            hash_value = ((hash_value << 5) + hash_value) + ord(char)
        return hash_value & 0xFFFFFFFF  # Keep it 32-bit

    @staticmethod
    def integer_hash_knuth(key):
        """
        Knuth's multiplicative hash.

        Multiply by large prime and take middle bits.
        """
        key = ((key >> 3) * 2654435761) & 0xFFFFFFFF
        return key

    @staticmethod
    def universal_hash(key, a, b, prime, table_size):
        """
        Universal hashing: h_ab(k) = ((a*k + b) mod p) mod m

        Choose random a, b from [1, p-1] where p is prime > universe size.
        Guarantees collision probability <= 1/m for any pair of keys.
        """
        return ((a * key + b) % prime) % table_size

    @staticmethod
    def mix_bits(key):
        """
        Bit mixing for better distribution.

        Used in Java's HashMap.
        """
        key ^= (key >> 20) ^ (key >> 12)
        return key ^ (key >> 7) ^ (key >> 4)


class CollisionResolutionStrategies:
    """Implementations of different collision resolution strategies."""

    @staticmethod
    def linear_probing_sequence(index, i, table_size):
        """
        Linear probing: h(k, i) = (h(k) + i) mod m

        Probe sequence: index, index+1, index+2, ...
        Simple but causes primary clustering.
        """
        return (index + i) % table_size

    @staticmethod
    def quadratic_probing_sequence(index, i, table_size, c1=1, c2=1):
        """
        Quadratic probing: h(k, i) = (h(k) + c1*i + c2*i^2) mod m

        Probe sequence: index, index+1, index+4, index+9, ...
        Reduces primary clustering but can cause secondary clustering.
        """
        return (index + c1 * i + c2 * i * i) % table_size

    @staticmethod
    def double_hashing_sequence(key, i, table_size, hash1_val):
        """
        Double hashing: h(k, i) = (h1(k) + i * h2(k)) mod m

        Uses two hash functions.
        h2(k) must be relatively prime to m (use h2(k) = 1 + (k mod (m-1))).
        Best open addressing method - minimal clustering.
        """
        hash2_val = 1 + (key % (table_size - 1))
        return (hash1_val + i * hash2_val) % table_size


# Example: Hash table with different collision resolution
class OpenAddressingHashMap:
    """Hash map using open addressing with linear probing."""

    class _Entry:
        __slots__ = ['key', 'value', 'is_deleted']

        def __init__(self, key, value):
            self.key = key
            self.value = value
            self.is_deleted = False

    def __init__(self, capacity=16):
        self._capacity = capacity
        self._table = [None] * capacity
        self._size = 0

    def _hash(self, key):
        return hash(key) % self._capacity

    def put(self, key, value):
        """Insert with linear probing."""
        if self._size >= self._capacity * 0.7:
            self._resize(self._capacity * 2)

        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None or entry.is_deleted:
                # Found empty slot
                self._table[probe_index] = self._Entry(key, value)
                self._size += 1
                return
            elif entry.key == key:
                # Update existing
                entry.value = value
                return

            i += 1

        raise Exception("Hash table full")

    def get(self, key):
        """Search with linear probing."""
        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None:
                return None  # Not found
            elif not entry.is_deleted and entry.key == key:
                return entry.value

            i += 1

        return None

    def remove(self, key):
        """Delete with tombstone marking."""
        index = self._hash(key)
        i = 0

        while i < self._capacity:
            probe_index = (index + i) % self._capacity
            entry = self._table[probe_index]

            if entry is None:
                return None
            elif not entry.is_deleted and entry.key == key:
                entry.is_deleted = True
                self._size -= 1
                return entry.value

            i += 1

        return None

    def _resize(self, new_capacity):
        """Resize and rehash."""
        old_table = self._table
        self._capacity = new_capacity
        self._table = [None] * new_capacity
        self._size = 0

        for entry in old_table:
            if entry and not entry.is_deleted:
                self.put(entry.key, entry.value)

Topic 60: Chaining vs Open Addressing - Detailed Comparison

class ChainingVsOpenAddressingDemo:
    """
    Demonstration comparing chaining and open addressing.
    """

    @staticmethod
    def benchmark_insertion(n=10000):
        """Compare insertion performance."""
        import time

        # Chaining
        chaining_map = HashMap()
        start = time.time()
        for i in range(n):
            chaining_map.put(i, i * 2)
        chaining_time = time.time() - start

        # Open addressing
        open_map = OpenAddressingHashMap(capacity=n * 2)
        start = time.time()
        for i in range(n):
            open_map.put(i, i * 2)
        open_time = time.time() - start

        print(f"Insertion of {n} elements:")
        print(f"  Chaining: {chaining_time:.4f}s")
        print(f"  Open Addressing: {open_time:.4f}s")

        return chaining_time, open_time

    @staticmethod
    def compare_memory_usage(n=1000):
        """
        Compare memory overhead.

        Chaining: Each node has key, value, next pointer
        Open addressing: Only key, value in array slots
        """
        import sys

        # Chaining
        chaining_map = HashMap()
        for i in range(n):
            chaining_map.put(i, i)

        # Open addressing
        open_map = OpenAddressingHashMap(capacity=n * 2)
        for i in range(n):
            open_map.put(i, i)

        # Estimate memory (rough approximation)
        chaining_size = sys.getsizeof(chaining_map._buckets)
        open_size = sys.getsizeof(open_map._table)

        print(f"Memory for {n} elements:")
        print(f"  Chaining: ~{chaining_size} bytes")
        print(f"  Open Addressing: ~{open_size} bytes")

        return chaining_size, open_size

    @staticmethod
    def demonstrate_clustering():
        """Show clustering effect in linear probing."""
        # Create small table with linear probing
        table_size = 10
        table = [None] * table_size

        # Insert keys that cause clustering
        keys_to_insert = [0, 10, 20, 1, 11]  # All hash to 0 or 1

        print("Inserting keys with linear probing:")
        for key in keys_to_insert:
            index = key % table_size
            i = 0
            while i < table_size:
                probe_index = (index + i) % table_size
                if table[probe_index] is None:
                    table[probe_index] = key
                    print(f"  Key {key} (hash={index}) → slot {probe_index} (probes={i+1})")
                    break
                i += 1

        print(f"\nFinal table: {table}")
        print("Notice clustering at indices 0-4!")


# Run demonstrations
if __name__ == "__main__":
    demo = ChainingVsOpenAddressingDemo()

    print("=== Performance Comparison ===")
    demo.benchmark_insertion(10000)

    print("\n=== Memory Comparison ===")
    demo.compare_memory_usage(1000)

    print("\n=== Clustering Demonstration ===")
    demo.demonstrate_clustering()

Topic 61: Load Factor and Rehashing

class DynamicHashMap:
    """
    Hash map with detailed load factor tracking and rehashing.
    """

    def __init__(self, initial_capacity=8, max_load_factor=0.75, min_load_factor=0.25):
        """
        Args:
            initial_capacity: Starting size
            max_load_factor: Resize up when exceeded
            min_load_factor: Resize down when below (optional)
        """
        self._capacity = initial_capacity
        self._buckets = [[] for _ in range(self._capacity)]
        self._size = 0
        self._max_load_factor = max_load_factor
        self._min_load_factor = min_load_factor
        self._resize_count = 0
        self._resize_history = []

    def _hash(self, key):
        return hash(key) % self._capacity

    def get_load_factor(self):
        """Current load factor α = n/m."""
        return self._size / self._capacity if self._capacity > 0 else 0

    def put(self, key, value):
        """Insert with automatic resize when load factor exceeded."""
        # Check if resize needed BEFORE insertion
        if self.get_load_factor() >= self._max_load_factor:
            self._resize_up()

        index = self._hash(key)
        bucket = self._buckets[index]

        # Update if exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return

        # Insert new
        bucket.append((key, value))
        self._size += 1

    def remove(self, key):
        """Remove with automatic resize down when load factor too low."""
        index = self._hash(key)
        bucket = self._buckets[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket.pop(i)
                self._size -= 1

                # Resize down if too sparse
                if self.get_load_factor() < self._min_load_factor and self._capacity > 8:
                    self._resize_down()

                return v

        return None

    def _resize_up(self):
        """Double capacity and rehash."""
        old_capacity = self._capacity
        self._resize(self._capacity * 2)
        self._resize_history.append(('up', old_capacity, self._capacity, self._size))
        print(f"Resized UP: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")

    def _resize_down(self):
        """Halve capacity and rehash."""
        old_capacity = self._capacity
        self._resize(self._capacity // 2)
        self._resize_history.append(('down', old_capacity, self._capacity, self._size))
        print(f"Resized DOWN: {old_capacity} → {self._capacity} (size={self._size}, α={self.get_load_factor():.3f})")

    def _resize(self, new_capacity):
        """Rehash all elements to new capacity."""
        self._resize_count += 1
        old_buckets = self._buckets

        self._capacity = new_capacity
        self._buckets = [[] for _ in range(self._capacity)]
        self._size = 0

        # Rehash all elements
        for bucket in old_buckets:
            for key, value in bucket:
                self.put(key, value)

    def get_stats(self):
        """Get detailed statistics."""
        chain_lengths = [len(bucket) for bucket in self._buckets]
        non_empty = sum(1 for length in chain_lengths if length > 0)

        return {
            'size': self._size,
            'capacity': self._capacity,
            'load_factor': self.get_load_factor(),
            'resize_count': self._resize_count,
            'non_empty_buckets': non_empty,
            'max_chain_length': max(chain_lengths) if chain_lengths else 0,
            'avg_chain_length': sum(chain_lengths) / len(chain_lengths) if chain_lengths else 0,
            'resize_history': self._resize_history
        }


# Demonstration of load factor and rehashing
if __name__ == "__main__":
    print("=== Load Factor and Rehashing Demo ===\n")

    hm = DynamicHashMap(initial_capacity=4, max_load_factor=0.75)

    print("Inserting elements to trigger resizes:")
    for i in range(20):
        hm.put(f"key{i}", i)
        if i % 5 == 4:
            print(f"  After {i+1} inserts: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")

    print("\nRemoving elements to trigger shrink:")
    for i in range(18):
        hm.remove(f"key{i}")
        if i % 5 == 4:
            print(f"  After {i+1} removes: size={hm._size}, capacity={hm._capacity}, α={hm.get_load_factor():.3f}")

    print("\nFinal stats:")
    stats = hm.get_stats()
    for key, value in stats.items():
        print(f"  {key}: {value}")

Now continuing with the remaining topics (64-69) which are problem-solving applications...

Topic 64: Two Sum Problem and Variations

class TwoSumProblems:
    """Solutions to Two Sum and its variations using hash tables."""

    @staticmethod
    def two_sum(nums, target):
        """
        Find two numbers that sum to target.

        Time: O(n)
        Space: O(n)

        Args:
            nums: List of integers
            target: Target sum

        Returns:
            Indices of two numbers, or None if not found
        """
        seen = {}  # value → index

        for i, num in enumerate(nums):
            complement = target - num
            if complement in seen:
                return [seen[complement], i]
            seen[num] = i

        return None

    @staticmethod
    def two_sum_all_pairs(nums, target):
        """
        Find ALL pairs that sum to target.

        Time: O(n)
        Space: O(n)
        """
        seen = set()
        pairs = set()

        for num in nums:
            complement = target - num
            if complement in seen:
                # Add as sorted tuple to avoid duplicates
                pair = tuple(sorted([num, complement]))
                pairs.add(pair)
            seen.add(num)

        return list(pairs)

    @staticmethod
    def two_sum_count(nums, target):
        """
        Count number of pairs that sum to target.

        Time: O(n)
        Space: O(n)
        """
        from collections import Counter
        count_map = Counter(nums)
        pairs_count = 0

        for num in count_map:
            complement = target - num
            if complement in count_map:
                if num == complement:
                    # Same number, use combination formula
                    n = count_map[num]
                    pairs_count += n * (n - 1) // 2
                elif num < complement:
                    # Count once to avoid duplicates
                    pairs_count += count_map[num] * count_map[complement]

        return pairs_count

    @staticmethod
    def three_sum(nums, target=0):
        """
        Find all unique triplets that sum to target.

        Time: O(n²)
        Space: O(n)

        Combination of sorting + two-pointer + hash set for dedup.
        """
        nums.sort()
        result = []
        seen_triplets = set()

        for i in range(len(nums) - 2):
            # Skip duplicates for first element
            if i > 0 and nums[i] == nums[i-1]:
                continue

            # Two sum on remaining array
            seen = {}
            for j in range(i + 1, len(nums)):
                complement = target - nums[i] - nums[j]
                if complement in seen:
                    triplet = (nums[i], complement, nums[j])
                    if triplet not in seen_triplets:
                        result.append(list(triplet))
                        seen_triplets.add(triplet)
                seen[nums[j]] = j

        return result

    @staticmethod
    def four_sum(nums, target):
        """
        Find all unique quadruplets that sum to target.

        Time: O(n³)
        Space: O(n)
        """
        nums.sort()
        result = []
        n = len(nums)

        for i in range(n - 3):
            if i > 0 and nums[i] == nums[i-1]:
                continue

            for j in range(i + 1, n - 2):
                if j > i + 1 and nums[j] == nums[j-1]:
                    continue

                # Two sum with hash map
                seen = {}
                for k in range(j + 1, n):
                    complement = target - nums[i] - nums[j] - nums[k]
                    if complement in seen:
                        result.append([nums[i], nums[j], complement, nums[k]])
                        # Skip duplicates
                        while k + 1 < n and nums[k] == nums[k+1]:
                            k += 1
                    seen[nums[k]] = k

        return result

    @staticmethod
    def two_sum_closest(nums, target):
        """
        Find pair with sum closest to target.

        Time: O(n log n) due to sorting
        Space: O(1)

        Note: This uses two pointers after sorting, not hash table,
        but included for completeness of "two sum" family.
        """
        nums.sort()
        left, right = 0, len(nums) - 1
        closest_sum = float('inf')
        result = None

        while left < right:
            current_sum = nums[left] + nums[right]

            if abs(current_sum - target) < abs(closest_sum - target):
                closest_sum = current_sum
                result = [nums[left], nums[right]]

            if current_sum < target:
                left += 1
            else:
                right -= 1

        return result, closest_sum


# Test Two Sum variations
if __name__ == "__main__":
    solver = TwoSumProblems()

    print("=== Two Sum Problem ===")
    nums = [2, 7, 11, 15]
    target = 9
    print(f"Input: {nums}, target: {target}")
    print(f"Indices: {solver.two_sum(nums, target)}")

    print("\n=== Two Sum All Pairs ===")
    nums = [1, 5, 3, 7, 2, 8, 4]
    target = 9
    print(f"Input: {nums}, target: {target}")
    print(f"All pairs: {solver.two_sum_all_pairs(nums, target)}")

    print("\n=== Two Sum Count ===")
    nums = [1, 1, 1, 1]
    target = 2
    print(f"Input: {nums}, target: {target}")
    print(f"Pair count: {solver.two_sum_count(nums, target)}")

    print("\n=== Three Sum ===")
    nums = [-1, 0, 1, 2, -1, -4]
    print(f"Input: {nums}")
    print(f"Triplets: {solver.three_sum(nums, target=0)}")

Topic 65: Group Anagrams Using Hash Map

class AnagramProblems:
    """Solutions using hash maps for anagram problems."""

    @staticmethod
    def group_anagrams_sorted(strs):
        """
        Group anagrams by sorting each string as key.

        Time: O(n * k log k) where n = number of strings, k = max length
        Space: O(n * k)

        Args:
            strs: List of strings

        Returns:
            List of grouped anagrams
        """
        from collections import defaultdict

        groups = defaultdict(list)

        for s in strs:
            # Use sorted string as canonical form (key)
            key = ''.join(sorted(s))
            groups[key].append(s)

        return list(groups.values())

    @staticmethod
    def group_anagrams_count(strs):
        """
        Group anagrams by character count.

        Time: O(n * k) where n = number of strings, k = max length
        Space: O(n * k)

        Faster than sorting approach.
        """
        from collections import defaultdict

        groups = defaultdict(list)

        for s in strs:
            # Count frequency of each character
            count = [0] * 26  # Assuming lowercase a-z
            for char in s:
                count[ord(char) - ord('a')] += 1

            # Use tuple of counts as key (lists aren't hashable)
            key = tuple(count)
            groups[key].append(s)

        return list(groups.values())

    @staticmethod
    def are_anagrams(s1, s2):
        """
        Check if two strings are anagrams.

        Time: O(n)
        Space: O(1) - fixed size alphabet
        """
        if len(s1) != len(s2):
            return False

        from collections import Counter
        return Counter(s1) == Counter(s2)

    @staticmethod
    def find_anagrams_in_string(s, p):
        """
        Find all start indices of anagrams of p in s.

        Time: O(n) where n = len(s)
        Space: O(1) - fixed size alphabet

        Uses sliding window + hash map.
        """
        from collections import Counter

        if len(p) > len(s):
            return []

        result = []
        p_count = Counter(p)
        window_count = Counter()

        # Initialize window
        for i in range(len(p)):
            window_count[s[i]] += 1

        if window_count == p_count:
            result.append(0)

        # Slide window
        for i in range(len(p), len(s)):
            # Add new character
            window_count[s[i]] += 1

            # Remove old character
            old_char = s[i - len(p)]
            window_count[old_char] -= 1
            if window_count[old_char] == 0:
                del window_count[old_char]

            # Check if anagram
            if window_count == p_count:
                result.append(i - len(p) + 1)

        return result

    @staticmethod
    def group_shifted_strings(strings):
        """
        Group strings that are shifts of each other.

        E.g., "abc" and "bcd" are shifts (each char shifted by 1).

        Time: O(n * k) where n = number of strings, k = max length
        Space: O(n * k)
        """
        from collections import defaultdict

        def get_shift_key(s):
            """
            Get canonical form by computing differences between consecutive chars.
            E.g., "abc" → (1, 1), "xyz" → (1, 1), "za" → (25)
            """
            if len(s) <= 1:
                return ()

            shifts = []
            for i in range(1, len(s)):
                shift = (ord(s[i]) - ord(s[i-1])) % 26
                shifts.append(shift)
            return tuple(shifts)

        groups = defaultdict(list)

        for s in strings:
            key = get_shift_key(s)
            groups[key].append(s)

        return list(groups.values())


# Test anagram problems
if __name__ == "__main__":
    solver = AnagramProblems()

    print("=== Group Anagrams (Sorting) ===")
    strs = ["eat", "tea", "tan", "ate", "nat", "bat"]
    print(f"Input: {strs}")
    print(f"Groups: {solver.group_anagrams_sorted(strs)}")

    print("\n=== Group Anagrams (Counting) ===")
    print(f"Groups: {solver.group_anagrams_count(strs)}")

    print("\n=== Find Anagrams in String ===")
    s = "cbaebabacd"
    p = "abc"
    print(f"String: {s}, Pattern: {p}")
    print(f"Anagram indices: {solver.find_anagrams_in_string(s, p)}")

    print("\n=== Group Shifted Strings ===")
    strings = ["abc", "bcd", "acef", "xyz", "az", "ba", "a", "z"]
    print(f"Input: {strings}")
    print(f"Shift groups: {solver.group_shifted_strings(strings)}")

Topic 66: Longest Consecutive Sequence

class LongestConsecutiveSequence:
    """
    Find longest consecutive sequence in unsorted array.

    Time: O(n)
    Space: O(n)

    Key insight: Use hash set for O(1) lookup.
    Only start counting from numbers that are sequence starts.
    """

    @staticmethod
    def longest_consecutive(nums):
        """
        Find length of longest consecutive elements sequence.

        Example: [100, 4, 200, 1, 3, 2] → 4 (sequence: 1, 2, 3, 4)
        """
        if not nums:
            return 0

        num_set = set(nums)  # O(n) space, O(1) lookup
        longest = 0

        for num in num_set:
            # Only start counting if this is a sequence start
            # (i.e., num-1 is not in set)
            if num - 1 not in num_set:
                current_num = num
                current_length = 1

                # Count consecutive numbers
                while current_num + 1 in num_set:
                    current_num += 1
                    current_length += 1

                longest = max(longest, current_length)

        return longest

    @staticmethod
    def longest_consecutive_with_sequence(nums):
        """Return both length and the actual sequence."""
        if not nums:
            return 0, []

        num_set = set(nums)
        longest = 0
        best_sequence = []

        for num in num_set:
            if num - 1 not in num_set:  # Sequence start
                sequence = [num]
                current_num = num

                while current_num + 1 in num_set:
                    current_num += 1
                    sequence.append(current_num)

                if len(sequence) > longest:
                    longest = len(sequence)
                    best_sequence = sequence

        return longest, best_sequence


# Test
if __name__ == "__main__":
    solver = LongestConsecutiveSequence()

    nums = [100, 4, 200, 1, 3, 2]
    print(f"Input: {nums}")
    print(f"Longest consecutive length: {solver.longest_consecutive(nums)}")
    length, seq = solver.longest_consecutive_with_sequence(nums)
    print(f"Sequence: {seq} (length {length})")

Topic 67: Subarray Sum Equals K

class SubarraySumProblems:
    """Subarray sum problems using hash map with prefix sums."""

    @staticmethod
    def subarray_sum_equals_k(nums, k):
        """
        Count subarrays with sum equal to k.

        Time: O(n)
        Space: O(n)

        Key insight: Use prefix sum + hash map.
        If prefix_sum[j] - prefix_sum[i] = k, then subarray[i+1:j+1] sums to k.
        Equivalently: if prefix_sum - k exists, found valid subarray.
        """
        prefix_sum = 0
        sum_count = {0: 1}  # prefix_sum → count (0:1 handles subarrays from start)
        count = 0

        for num in nums:
            prefix_sum += num

            # If (prefix_sum - k) exists, those positions form valid subarrays ending here
            if prefix_sum - k in sum_count:
                count += sum_count[prefix_sum - k]

            # Record current prefix sum
            sum_count[prefix_sum] = sum_count.get(prefix_sum, 0) + 1

        return count

    @staticmethod
    def max_subarray_sum_k(nums, k):
        """
        Find maximum length subarray with sum equal to k.

        Time: O(n)
        Space: O(n)
        """
        prefix_sum = 0
        sum_index = {0: -1}  # prefix_sum → earliest index
        max_length = 0

        for i, num in enumerate(nums):
            prefix_sum += num

            if prefix_sum - k in sum_index:
                length = i - sum_index[prefix_sum - k]
                max_length = max(max_length, length)

            # Only store first occurrence of each prefix sum
            if prefix_sum not in sum_index:
                sum_index[prefix_sum] = i

        return max_length

    @staticmethod
    def subarray_sum_divisible_by_k(nums, k):
        """
        Count subarrays with sum divisible by k.

        Time: O(n)
        Space: O(k)

        Key: Use modulo arithmetic. If (prefix % k) repeats, difference is divisible by k.
        """
        prefix_sum = 0
        mod_count = {0: 1}  # remainder → count
        count = 0

        for num in nums:
            prefix_sum += num
            remainder = prefix_sum % k

            # Handle negative remainders
            if remainder < 0:
                remainder += k

            if remainder in mod_count:
                count += mod_count[remainder]

            mod_count[remainder] = mod_count.get(remainder, 0) + 1

        return count


# Test
if __name__ == "__main__":
    solver = SubarraySumProblems()

    print("=== Subarray Sum Equals K ===")
    nums = [1, 1, 1]
    k = 2
    print(f"Input: {nums}, k={k}")
    print(f"Count: {solver.subarray_sum_equals_k(nums, k)}")  # 2: [1,1] at two positions

    nums = [1, 2, 3]
    k = 3
    print(f"\nInput: {nums}, k={k}")
    print(f"Count: {solver.subarray_sum_equals_k(nums, k)}")  # 2: [3] and [1,2]

Topic 69: Rolling Hash (Rabin-Karp Algorithm)

class RollingHash:
    """
    Rolling hash for pattern matching (Rabin-Karp algorithm).

    Key idea: Efficiently compute hash of sliding window.
    Hash of window [i:i+m] can be updated to [i+1:i+m+1] in O(1).
    """

    def __init__(self, base=256, mod=10**9 + 7):
        """
        Args:
            base: Base for polynomial hash (typically 256 for ASCII)
            mod: Large prime for modulo to avoid overflow
        """
        self.base = base
        self.mod = mod

    def compute_hash(self, s):
        """Compute hash of string s."""
        h = 0
        for char in s:
            h = (h * self.base + ord(char)) % self.mod
        return h

    def rabin_karp_search(self, text, pattern):
        """
        Find all occurrences of pattern in text.

        Time: O(n + m) average case, O(nm) worst case
        Space: O(1)

        Returns: List of starting indices
        """
        n, m = len(text), len(pattern)
        if m > n:
            return []

        # Precompute base^(m-1) mod mod
        h_multiplier = pow(self.base, m - 1, self.mod)

        # Compute hash of pattern
        pattern_hash = self.compute_hash(pattern)

        # Compute hash of first window
        window_hash = self.compute_hash(text[:m])

        result = []

        for i in range(n - m + 1):
            # Check if hashes match
            if window_hash == pattern_hash:
                # Verify actual string (to handle hash collisions)
                if text[i:i+m] == pattern:
                    result.append(i)

            # Roll the hash to next window
            if i < n - m:
                # Remove leftmost character
                window_hash = (window_hash - ord(text[i]) * h_multiplier) % self.mod
                # Add rightmost character
                window_hash = (window_hash * self.base + ord(text[i + m])) % self.mod
                # Handle negative values
                if window_hash < 0:
                    window_hash += self.mod

        return result

    def longest_common_substring(self, s1, s2):
        """
        Find longest common substring using rolling hash + binary search.

        Time: O((n+m) * log(min(n,m)))
        Space: O(n+m)
        """
        def get_all_hashes(s, length):
            """Get all hashes of substrings of given length."""
            if length > len(s):
                return set()

            h_multiplier = pow(self.base, length - 1, self.mod)
            window_hash = self.compute_hash(s[:length])
            hashes = {window_hash}

            for i in range(len(s) - length):
                window_hash = (window_hash - ord(s[i]) * h_multiplier) % self.mod
                window_hash = (window_hash * self.base + ord(s[i + length])) % self.mod
                hashes.add(window_hash)

            return hashes

        # Binary search on length
        left, right = 0, min(len(s1), len(s2))
        result_length = 0

        while left <= right:
            mid = (left + right) // 2

            hashes1 = get_all_hashes(s1, mid)
            hashes2 = get_all_hashes(s2, mid)

            if hashes1 & hashes2:  # Common hash found
                result_length = mid
                left = mid + 1
            else:
                right = mid - 1

        return result_length


# Test
if __name__ == "__main__":
    rh = RollingHash()

    print("=== Rabin-Karp Pattern Search ===")
    text = "ABABCABABA"
    pattern = "ABA"
    indices = rh.rabin_karp_search(text, pattern)
    print(f"Text: {text}")
    print(f"Pattern: {pattern}")
    print(f"Found at indices: {indices}")  # [0, 5, 7]

    print("\n=== Longest Common Substring ===")
    s1 = "abcdxyz"
    s2 = "xyzabcd"
    length = rh.longest_common_substring(s1, s2)
    print(f"String 1: {s1}")
    print(f"String 2: {s2}")
    print(f"Longest common substring length: {length}")  # 4 ("abcd" or "xyz")

9.3 Problem-Solving Practice

Learning Path

Progressive Problem Sequence:

Easy (Master Fundamentals):

Two Sum
Contains Duplicate
Valid Anagram
Intersection of Two Arrays
First Unique Character in String

Medium (Build Proficiency): 6. Group Anagrams 7. Longest Substring Without Repeating Characters 8. Subarray Sum Equals K 9. Top K Frequent Elements 10. Longest Consecutive Sequence

Hard (Demonstrate Mastery): 11. Substring with Concatenation of All Words 12. LRU Cache 13. Design Twitter 14. All O(1) Data Structure 15. Longest Duplicate Substring

9.4 Integration and Ecosystem

Library Usage

Python Built-in Dictionaries:

# Python's dict is highly optimized hash table
d = {}
d = dict()
d = {"key": "value"}

# Common operations
d[key] = value  # Put
value = d.get(key, default)  # Get with default
del d[key]  # Remove
key in d  # Contains

# Collections module
from collections import defaultdict, Counter, OrderedDict

# defaultdict: Auto-initialize missing keys
dd = defaultdict(list)
dd["key"].append("value")  # No KeyError

# Counter: Specialized for counting
from collections import Counter
counts = Counter([1, 1, 2, 2, 2, 3])
# Counter({2: 3, 1: 2, 3: 1})

# OrderedDict: Maintains insertion order (Python 3.7+ dicts are ordered by default)
od = OrderedDict()

10. Advanced Concepts

10.1 Advanced Variants and Extensions

Perfect Hashing

class PerfectHashTable:
    """
    Perfect hashing for static datasets - no collisions!

    Two-level hashing:
    - Level 1: Hash to n buckets using h1
    - Level 2: Each bucket has its own hash table sized O(m_i^2)
      where m_i = number of keys hashing to bucket i

    Guarantees: O(1) worst-case lookup, O(n) total space.
    """

    def __init__(self, keys):
        """Build perfect hash table for given static keys."""
        self.n = len(keys)
        # Level 1 table
        self.buckets = [[] for _ in range(self.n)]

        # Hash keys to buckets
        for key in keys:
            bucket_idx = hash(key) % self.n
            self.buckets[bucket_idx].append(key)

        # Level 2: Create perfect hash table for each bucket
        self.secondary = []
        for bucket in self.buckets:
            if len(bucket) == 0:
                self.secondary.append(None)
            else:
                # Size is O(m^2) to guarantee no collisions
                size = len(bucket) ** 2
                secondary_table = [None] * size

                for key in bucket:
                    idx = hash(key * 31) % size  # Different hash function
                    secondary_table[idx] = key

                self.secondary.append(secondary_table)

    def lookup(self, key):
        """O(1) worst-case lookup."""
        bucket_idx = hash(key) % self.n
        if self.secondary[bucket_idx] is None:
            return False

        secondary_table = self.secondary[bucket_idx]
        idx = hash(key * 31) % len(secondary_table)
        return secondary_table[idx] == key

Cuckoo Hashing

class CuckooHashTable:
    """
    Cuckoo hashing: Worst-case O(1) lookup!

    Two hash tables, two hash functions.
    If collision, evict existing element and relocate it.
    """

    def __init__(self, capacity=16):
        self.capacity = capacity
        self.table1 = [None] * capacity
        self.table2 = [None] * capacity
        self.size = 0
        self.max_iterations = 500  # Prevent infinite loops

    def _hash1(self, key):
        return hash(key) % self.capacity

    def _hash2(self, key):
        return (hash(key) * 31) % self.capacity

    def insert(self, key):
        """Insert with cuckoo eviction."""
        if self.contains(key):
            return

        current_key = key
        for _ in range(self.max_iterations):
            # Try table 1
            idx1 = self._hash1(current_key)
            if self.table1[idx1] is None:
                self.table1[idx1] = current_key
                self.size += 1
                return

            # Evict from table 1, try table 2
            self.table1[idx1], current_key = current_key, self.table1[idx1]

            idx2 = self._hash2(current_key)
            if self.table2[idx2] is None:
                self.table2[idx2] = current_key
                self.size += 1
                return

            # Evict from table 2, loop continues
            self.table2[idx2], current_key = current_key, self.table2[idx2]

        # Too many evictions → rebuild with larger tables
        self._resize(self.capacity * 2)
        self.insert(key)

    def contains(self, key):
        """O(1) worst-case lookup!"""
        idx1 = self._hash1(key)
        idx2 = self._hash2(key)
        return self.table1[idx1] == key or self.table2[idx2] == key

    def _resize(self, new_capacity):
        """Rebuild tables."""
        old_keys = []
        for key in self.table1:
            if key is not None:
                old_keys.append(key)
        for key in self.table2:
            if key is not None:
                old_keys.append(key)

        self.capacity = new_capacity
        self.table1 = [None] * new_capacity
        self.table2 = [None] * new_capacity
        self.size = 0

        for key in old_keys:
            self.insert(key)

10.2 Theoretical Depth

Lower Bounds:

For dictionary operations (insert, delete, search) with n elements:

Comparison-based: Ω(log n) per operation (like BST)
Hashing: O(1) expected time (but not comparison-based)

Hash tables achieve O(1) expected time, which is optimal for expected case.

Complexity Classes:

Dictionary problem is in P (polynomial time solvable)
Perfect hashing construction is polynomial
Universal hashing provides probabilistic guarantees

10.3 Cutting-Edge and Future Directions

Recent Innovations:

Swiss Tables (Google, 2017):
- Open addressing with SIMD metadata
- 2x faster than std::unordered_map
Learned Hash Functions (2020s):
- Use machine learning to learn optimal hash function for dataset
- Outperforms traditional hash functions on specific distributions
Quantum Hashing:
- Grovers algorithm: O(√n) quantum search
- Post-quantum cryptographic hash functions

11. Meta-Learning Questions

Practical Judgment:

Use hash tables when:

Need O(1) average lookup
Don't need sorted order
Memory not extremely limited

Use alternatives when:

Need sorted iteration → BST
Need range queries → Segment tree
Memory critical → Bloom filter (approximate)

Prerequisites:

Arrays, linked lists
Big-O notation
Modular arithmetic

Builds Upon:

Dynamic arrays → dynamic resizing
Linked lists → chaining
Probability → collision analysis

Enables:

Graph adjacency lists
LRU/LFU caches
Deduplication algorithms
Indexing in databases

Section 9: Practical Mastery

9.1 Complete Python Implementations for All 12 Topics

Topic 58: Two Sum - Complete Implementation

"""
Two Sum - Find indices of two numbers that add up to target
Time: O(n), Space: O(n)
"""

from typing import List, Optional, Tuple

class TwoSumSolver:
    """
    Complete implementation of Two Sum problem with multiple variants.

    Key Insight: Hash table allows O(1) lookup of complement (target - current)
    """

    def two_sum_basic(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Basic two sum: return indices of two numbers that sum to target.

        Args:
            nums: List of integers
            target: Target sum

        Returns:
            Tuple of (index1, index2) or None if no solution exists

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_basic([2, 7, 11, 15], 9)
            (0, 1)
            >>> solver.two_sum_basic([3, 2, 4], 6)
            (1, 2)
        """
        if not nums or len(nums) < 2:
            return None

        seen = {}  # value -> index mapping

        for i, num in enumerate(nums):
            complement = target - num

            if complement in seen:
                return (seen[complement], i)

            seen[num] = i

        return None

    def two_sum_all_pairs(self, nums: List[int], target: int) -> List[Tuple[int, int]]:
        """
        Find ALL pairs that sum to target (return indices).

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_all_pairs([1, 1, 1, 1], 2)
            [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
        """
        if not nums or len(nums) < 2:
            return []

        # Map value to list of indices
        value_indices = {}
        for i, num in enumerate(nums):
            if num not in value_indices:
                value_indices[num] = []
            value_indices[num].append(i)

        result = []
        seen_pairs = set()

        for i, num in enumerate(nums):
            complement = target - num

            if complement in value_indices:
                for j in value_indices[complement]:
                    if i < j:  # Ensure no duplicates and i < j
                        pair = (i, j)
                        if pair not in seen_pairs:
                            result.append(pair)
                            seen_pairs.add(pair)

        return result

    def two_sum_sorted(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Two sum on sorted array using two pointers.
        More space efficient: O(1) space vs O(n) for hash table.

        Time: O(n), Space: O(1)
        """
        if not nums or len(nums) < 2:
            return None

        left, right = 0, len(nums) - 1

        while left < right:
            current_sum = nums[left] + nums[right]

            if current_sum == target:
                return (left, right)
            elif current_sum < target:
                left += 1
            else:
                right -= 1

        return None

    def two_sum_with_values(self, nums: List[int], target: int) -> Optional[Tuple[int, int]]:
        """
        Return the actual values instead of indices.

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_with_values([2, 7, 11, 15], 9)
            (2, 7)
        """
        result = self.two_sum_basic(nums, target)
        if result:
            i, j = result
            return (nums[i], nums[j])
        return None

    def two_sum_count(self, nums: List[int], target: int) -> int:
        """
        Count number of pairs that sum to target.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_count([1, 1, 1, 1], 2)
            6  # C(4,2) = 6 pairs
        """
        if not nums or len(nums) < 2:
            return 0

        from collections import Counter
        count_map = Counter(nums)
        pairs = 0

        for num in count_map:
            complement = target - num

            if complement in count_map:
                if num == complement:
                    # Same number: choose 2 from count
                    n = count_map[num]
                    pairs += n * (n - 1) // 2
                elif num < complement:
                    # Different numbers: multiply counts
                    pairs += count_map[num] * count_map[complement]

        return pairs

    def two_sum_closest(self, nums: List[int], target: int) -> Tuple[int, int]:
        """
        Find pair with sum closest to target.

        Time: O(n log n) due to sorting, Space: O(1)

        Examples:
            >>> solver = TwoSumSolver()
            >>> solver.two_sum_closest([1, 3, 5, 7, 9], 10)
            (1, 9)  # sum = 10, exact match
            >>> solver.two_sum_closest([1, 2, 3, 4], 10)
            (3, 4)  # sum = 7, closest to 10
        """
        if not nums or len(nums) < 2:
            raise ValueError("Need at least 2 numbers")

        nums_sorted = sorted(enumerate(nums), key=lambda x: x[1])
        left, right = 0, len(nums_sorted) - 1

        closest_diff = float('inf')
        result = (0, 1)

        while left < right:
            current_sum = nums_sorted[left][1] + nums_sorted[right][1]
            diff = abs(current_sum - target)

            if diff < closest_diff:
                closest_diff = diff
                result = (nums_sorted[left][0], nums_sorted[right][0])

            if current_sum < target:
                left += 1
            elif current_sum > target:
                right -= 1
            else:
                break  # Exact match found

        return result


class TwoSumDataStructure:
    """
    Design a data structure that supports:
    - add(number): Add number to internal data structure
    - find(value): Find if there exists any pair that sums to value

    LeetCode 170: Two Sum III - Data structure design
    """

    def __init__(self):
        """Initialize with empty counter."""
        from collections import Counter
        self.nums = Counter()

    def add(self, number: int) -> None:
        """
        Add a number to the data structure.
        Time: O(1)
        """
        self.nums[number] += 1

    def find(self, value: int) -> bool:
        """
        Find if there exists any pair of numbers which sum is equal to value.
        Time: O(n) where n is number of unique values

        Examples:
            >>> ds = TwoSumDataStructure()
            >>> ds.add(1)
            >>> ds.add(3)
            >>> ds.add(5)
            >>> ds.find(4)
            True  # 1 + 3 = 4
            >>> ds.find(7)
            False
        """
        for num in self.nums:
            complement = value - num

            if complement in self.nums:
                # If same number, need at least 2 occurrences
                if num == complement:
                    if self.nums[num] >= 2:
                        return True
                else:
                    return True

        return False


# Comprehensive Test Suite for Two Sum
def test_two_sum():
    """Complete test suite covering all edge cases."""
    solver = TwoSumSolver()

    # Test 1: Basic case
    assert solver.two_sum_basic([2, 7, 11, 15], 9) == (0, 1)

    # Test 2: Multiple valid pairs (returns first found)
    result = solver.two_sum_basic([3, 3], 6)
    assert result == (0, 1)

    # Test 3: No solution
    assert solver.two_sum_basic([1, 2, 3], 10) is None

    # Test 4: Empty array
    assert solver.two_sum_basic([], 5) is None

    # Test 5: Single element
    assert solver.two_sum_basic([5], 5) is None

    # Test 6: Negative numbers
    assert solver.two_sum_basic([-1, -2, -3, -4, -5], -8) == (2, 4)

    # Test 7: All pairs
    pairs = solver.two_sum_all_pairs([1, 1, 1, 1], 2)
    assert len(pairs) == 6  # C(4,2) = 6

    # Test 8: Count pairs
    assert solver.two_sum_count([1, 1, 1, 1], 2) == 6

    # Test 9: Closest sum
    assert solver.two_sum_closest([1, 2, 3, 4], 10) in [(2, 3), (3, 2)]

    # Test 10: Data structure
    ds = TwoSumDataStructure()
    ds.add(1)
    ds.add(3)
    ds.add(5)
    assert ds.find(4) == True
    assert ds.find(7) == False
    ds.add(2)
    assert ds.find(7) == True  # 2 + 5 = 7

    print("All Two Sum tests passed!")


if __name__ == "__main__":
    test_two_sum()

Topic 59: Group Anagrams - Complete Implementation

"""
Group Anagrams - Group strings that are anagrams of each other
Time: O(n * k log k) where n = number of strings, k = max string length
Space: O(n * k)
"""

from typing import List, Dict
from collections import defaultdict

class AnagramGrouper:
    """
    Complete implementation of anagram grouping with multiple approaches.

    Key Insight: Anagrams have same character frequency or same sorted form
    """

    def group_anagrams_sorted(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using sorted string as key.

        Time: O(n * k log k) where k = average string length
        Space: O(n * k) for storing all strings

        Args:
            strs: List of strings

        Returns:
            List of grouped anagrams

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
            >>> sorted([sorted(group) for group in result])
            [['ate', 'eat', 'tea'], ['bat'], ['nat', 'tan']]
        """
        if not strs:
            return []

        anagram_groups = defaultdict(list)

        for s in strs:
            # Use sorted string as key
            key = ''.join(sorted(s))
            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def group_anagrams_frequency(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using character frequency as key.
        More efficient for long strings (O(n*k) vs O(n*k log k)).

        Time: O(n * k) where k = average string length
        Space: O(n * k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
            >>> len(result)
            2
        """
        if not strs:
            return []

        anagram_groups = defaultdict(list)

        for s in strs:
            # Create frequency tuple as key (only for lowercase a-z)
            count = [0] * 26
            for c in s:
                count[ord(c) - ord('a')] += 1

            # Tuple is hashable, list is not
            key = tuple(count)
            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def group_anagrams_prime(self, strs: List[str]) -> List[List[str]]:
        """
        Group anagrams using prime number product as key.
        Mathematical approach: each letter maps to prime number.
        Product is unique for each anagram group (fundamental theorem of arithmetic).

        Time: O(n * k), Space: O(n * k)

        Note: Can overflow for very long strings. Use for educational purposes.
        """
        if not strs:
            return []

        # First 26 prime numbers for a-z
        primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
                  43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]

        anagram_groups = defaultdict(list)

        for s in strs:
            # Calculate product of primes
            key = 1
            for c in s:
                key *= primes[ord(c) - ord('a')]

            anagram_groups[key].append(s)

        return list(anagram_groups.values())

    def find_anagrams_of_word(self, strs: List[str], word: str) -> List[str]:
        """
        Find all anagrams of a specific word in the list.

        Time: O(n * k), Space: O(k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
            ['eat', 'tea', 'ate']
        """
        if not strs or not word:
            return []

        # Create frequency signature for target word
        target_freq = [0] * 26
        for c in word:
            target_freq[ord(c) - ord('a')] += 1
        target_key = tuple(target_freq)

        anagrams = []
        for s in strs:
            # Create frequency signature for current string
            freq = [0] * 26
            for c in s:
                freq[ord(c) - ord('a')] += 1

            if tuple(freq) == target_key:
                anagrams.append(s)

        return anagrams

    def count_anagram_groups(self, strs: List[str]) -> int:
        """
        Count number of distinct anagram groups.

        Time: O(n * k log k), Space: O(n)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"])
            3  # [eat,tea,ate], [tan,nat], [bat]
        """
        if not strs:
            return 0

        seen_signatures = set()

        for s in strs:
            key = ''.join(sorted(s))
            seen_signatures.add(key)

        return len(seen_signatures)

    def largest_anagram_group(self, strs: List[str]) -> List[str]:
        """
        Find the largest group of anagrams.

        Time: O(n * k log k), Space: O(n * k)

        Examples:
            >>> grouper = AnagramGrouper()
            >>> result = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
            >>> sorted(result)
            ['ate', 'eat', 'tea']
        """
        if not strs:
            return []

        groups = self.group_anagrams_sorted(strs)
        return max(groups, key=len) if groups else []


class AnagramDictionary:
    """
    Data structure for efficient anagram lookups.
    Useful for word games like Scrabble.
    """

    def __init__(self, word_list: List[str]):
        """
        Build anagram dictionary from word list.

        Time: O(n * k log k), Space: O(n * k)
        """
        self.anagrams = defaultdict(list)

        for word in word_list:
            key = ''.join(sorted(word.lower()))
            self.anagrams[key].append(word)

    def find_anagrams(self, word: str) -> List[str]:
        """
        Find all anagrams of given word.

        Time: O(k log k), Space: O(k)
        """
        key = ''.join(sorted(word.lower()))
        return self.anagrams.get(key, [])

    def is_anagram(self, word1: str, word2: str) -> bool:
        """
        Check if two words are anagrams.

        Time: O(k log k), Space: O(k)
        """
        return sorted(word1.lower()) == sorted(word2.lower())


# Comprehensive Test Suite
def test_group_anagrams():
    """Complete test suite for anagram grouping."""
    grouper = AnagramGrouper()

    # Test 1: Basic grouping
    result = grouper.group_anagrams_sorted(["eat","tea","tan","ate","nat","bat"])
    assert len(result) == 3

    # Test 2: Empty input
    assert grouper.group_anagrams_sorted([]) == []

    # Test 3: Single string
    assert grouper.group_anagrams_sorted(["a"]) == [["a"]]

    # Test 4: No anagrams
    result = grouper.group_anagrams_sorted(["abc", "def", "ghi"])
    assert len(result) == 3

    # Test 5: All anagrams
    result = grouper.group_anagrams_sorted(["abc", "bca", "cab", "acb"])
    assert len(result) == 1
    assert len(result[0]) == 4

    # Test 6: Frequency method
    result = grouper.group_anagrams_frequency(["eat","tea","tan","ate"])
    assert len(result) == 2

    # Test 7: Find specific anagrams
    anagrams = grouper.find_anagrams_of_word(["eat","tea","tan","ate","bat"], "tea")
    assert sorted(anagrams) == ['ate', 'eat', 'tea']

    # Test 8: Count groups
    assert grouper.count_anagram_groups(["eat","tea","tan","ate","nat","bat"]) == 3

    # Test 9: Largest group
    largest = grouper.largest_anagram_group(["eat","tea","tan","ate","nat","bat"])
    assert len(largest) == 3

    # Test 10: Anagram dictionary
    word_list = ["listen", "silent", "enlist", "hello", "world"]
    dictionary = AnagramDictionary(word_list)
    assert sorted(dictionary.find_anagrams("listen")) == ['enlist', 'listen', 'silent']
    assert dictionary.is_anagram("listen", "silent") == True

    print("All Group Anagrams tests passed!")


if __name__ == "__main__":
    test_group_anagrams()

Topic 60: Valid Anagram - Complete Implementation

"""
Valid Anagram - Check if two strings are anagrams
Time: O(n), Space: O(1) for lowercase English letters
"""

from typing import Dict
from collections import Counter

class AnagramValidator:
    """
    Complete implementation of anagram validation with multiple approaches.

    Key Insight: Anagrams have identical character frequencies
    """

    def is_anagram_sorting(self, s: str, t: str) -> bool:
        """
        Check if two strings are anagrams using sorting.

        Time: O(n log n), Space: O(n)

        Args:
            s: First string
            t: Second string

        Returns:
            True if s and t are anagrams

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_sorting("anagram", "nagaram")
            True
            >>> validator.is_anagram_sorting("rat", "car")
            False
        """
        if len(s) != len(t):
            return False

        return sorted(s) == sorted(t)

    def is_anagram_hashtable(self, s: str, t: str) -> bool:
        """
        Check if two strings are anagrams using hash table.
        More efficient for long strings: O(n) vs O(n log n).

        Time: O(n), Space: O(1) for fixed alphabet size

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_hashtable("anagram", "nagaram")
            True
        """
        if len(s) != len(t):
            return False

        return Counter(s) == Counter(t)

    def is_anagram_array(self, s: str, t: str) -> bool:
        """
        Check anagrams using fixed-size array (only lowercase a-z).
        Most space-efficient: O(1) space.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_array("anagram", "nagaram")
            True
        """
        if len(s) != len(t):
            return False

        # Fixed array for 26 lowercase letters
        char_count = [0] * 26

        for i in range(len(s)):
            char_count[ord(s[i]) - ord('a')] += 1
            char_count[ord(t[i]) - ord('a')] -= 1

        return all(count == 0 for count in char_count)

    def is_anagram_unicode(self, s: str, t: str) -> bool:
        """
        Check anagrams for Unicode strings (supports all characters).

        Time: O(n), Space: O(k) where k = unique characters

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_unicode("你好", "好你")
            True
        """
        if len(s) != len(t):
            return False

        return Counter(s) == Counter(t)

    def is_anagram_case_insensitive(self, s: str, t: str) -> bool:
        """
        Check anagrams ignoring case.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_case_insensitive("Anagram", "Nagaram")
            True
        """
        return sorted(s.lower()) == sorted(t.lower())

    def is_anagram_ignore_spaces(self, s: str, t: str) -> bool:
        """
        Check anagrams ignoring spaces and case.

        Time: O(n), Space: O(n)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.is_anagram_ignore_spaces("conversation", "voices rant on")
            True
        """
        s_clean = s.replace(" ", "").lower()
        t_clean = t.replace(" ", "").lower()
        return sorted(s_clean) == sorted(t_clean)

    def find_anagram_difference(self, s: str, t: str) -> Dict[str, int]:
        """
        Find character frequency differences between two strings.
        Useful for debugging or partial anagram matching.

        Time: O(n), Space: O(k)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.find_anagram_difference("aab", "aac")
            {'b': 1, 'c': -1}
        """
        count_s = Counter(s)
        count_t = Counter(t)

        diff = {}
        all_chars = set(count_s.keys()) | set(count_t.keys())

        for char in all_chars:
            difference = count_s.get(char, 0) - count_t.get(char, 0)
            if difference != 0:
                diff[char] = difference

        return diff

    def minimum_swaps_to_anagram(self, s: str, t: str) -> int:
        """
        Find minimum character swaps to make s an anagram of t.
        Returns -1 if impossible.

        Time: O(n), Space: O(1)

        Examples:
            >>> validator = AnagramValidator()
            >>> validator.minimum_swaps_to_anagram("abc", "bca")
            0  # Already anagrams
            >>> validator.minimum_swaps_to_anagram("abc", "def")
            -1  # Impossible
        """
        if len(s) != len(t):
            return -1

        # Check if same character set
        if Counter(s) != Counter(t):
            return -1

        # Count mismatched positions
        mismatches = 0
        for i in range(len(s)):
            if s[i] != t[i]:
                mismatches += 1

        # Each swap fixes 2 positions
        return mismatches // 2


class AnagramGame:
    """
    Anagram-based word game utilities.
    """

    def __init__(self, dictionary: list[str]):
        """Initialize with word dictionary."""
        self.words = set(word.lower() for word in dictionary)

        # Build anagram index
        self.anagram_index = {}
        for word in self.words:
            key = ''.join(sorted(word))
            if key not in self.anagram_index:
                self.anagram_index[key] = []
            self.anagram_index[key].append(word)

    def get_valid_anagrams(self, letters: str) -> list[str]:
        """
        Get all valid dictionary words that are anagrams of given letters.

        Time: O(k log k + m) where k = len(letters), m = anagrams found

        Examples:
            >>> game = AnagramGame(["listen", "silent", "enlist", "hello"])
            >>> sorted(game.get_valid_anagrams("silent"))
            ['enlist', 'listen', 'silent']
        """
        key = ''.join(sorted(letters.lower()))
        return self.anagram_index.get(key, [])

    def score_anagram(self, word: str) -> int:
        """
        Score word based on letter frequency (Scrabble-like).

        Examples:
            >>> game = AnagramGame([])
            >>> game.score_anagram("quiz")
            22  # High value letters
        """
        # Scrabble letter values
        values = {
            'a': 1, 'e': 1, 'i': 1, 'o': 1, 'u': 1, 'l': 1, 'n': 1, 's': 1, 't': 1, 'r': 1,
            'd': 2, 'g': 2,
            'b': 3, 'c': 3, 'm': 3, 'p': 3,
            'f': 4, 'h': 4, 'v': 4, 'w': 4, 'y': 4,
            'k': 5,
            'j': 8, 'x': 8,
            'q': 10, 'z': 10
        }

        return sum(values.get(c.lower(), 0) for c in word)


# Comprehensive Test Suite
def test_valid_anagram():
    """Complete test suite for anagram validation."""
    validator = AnagramValidator()

    # Test 1: Basic anagrams
    assert validator.is_anagram_sorting("anagram", "nagaram") == True
    assert validator.is_anagram_sorting("rat", "car") == False

    # Test 2: Empty strings
    assert validator.is_anagram_sorting("", "") == True

    # Test 3: Different lengths
    assert validator.is_anagram_sorting("a", "ab") == False

    # Test 4: Single character
    assert validator.is_anagram_sorting("a", "a") == True

    # Test 5: Hash table method
    assert validator.is_anagram_hashtable("listen", "silent") == True

    # Test 6: Array method
    assert validator.is_anagram_array("anagram", "nagaram") == True

    # Test 7: Case insensitive
    assert validator.is_anagram_case_insensitive("Anagram", "Nagaram") == True

    # Test 8: Ignore spaces
    assert validator.is_anagram_ignore_spaces("conversation", "voices rant on") == True

    # Test 9: Unicode
    assert validator.is_anagram_unicode("你好", "好你") == True

    # Test 10: Find difference
    diff = validator.find_anagram_difference("aab", "aac")
    assert diff == {'b': 1, 'c': -1}

    # Test 11: Minimum swaps
    assert validator.minimum_swaps_to_anagram("abc", "bca") == 0
    assert validator.minimum_swaps_to_anagram("abc", "def") == -1

    # Test 12: Anagram game
    game = AnagramGame(["listen", "silent", "enlist", "hello", "world"])
    anagrams = game.get_valid_anagrams("silent")
    assert len(anagrams) == 3
    assert all(word in anagrams for word in ["listen", "silent", "enlist"])

    # Test 13: Scoring
    score = game.score_anagram("quiz")
    assert score == 22  # q=10, u=1, i=1, z=10

    print("All Valid Anagram tests passed!")


if __name__ == "__main__":
    test_valid_anagram()

Topic 61: Contains Duplicate - Complete Implementation

"""
Contains Duplicate - Check if array has duplicate elements
Time: O(n), Space: O(n)
"""

from typing import List, Set, Dict
from collections import Counter

class DuplicateDetector:
    """
    Complete implementation of duplicate detection with various constraints.

    Key Insight: Hash set provides O(1) duplicate checking
    """

    def contains_duplicate_set(self, nums: List[int]) -> bool:
        """
        Check if array contains any duplicates using set.

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers

        Returns:
            True if any value appears at least twice

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_set([1,2,3,1])
            True
            >>> detector.contains_duplicate_set([1,2,3,4])
            False
        """
        if not nums:
            return False

        return len(nums) != len(set(nums))

    def contains_duplicate_early_return(self, nums: List[int]) -> bool:
        """
        Check duplicates with early return (more efficient).

        Time: O(n), Space: O(n) worst case
        Best case: O(1) if duplicate found early

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_early_return([1,1,2,3])
            True  # Returns immediately on first duplicate
        """
        if not nums:
            return False

        seen = set()
        for num in nums:
            if num in seen:
                return True
            seen.add(num)

        return False

    def contains_duplicate_ii(self, nums: List[int], k: int) -> bool:
        """
        Check if duplicates exist within k indices of each other.
        LeetCode 219: Contains Duplicate II

        Time: O(n), Space: O(min(n, k))

        Args:
            nums: List of integers
            k: Maximum index difference

        Returns:
            True if duplicates exist with index difference <= k

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_ii([1,2,3,1], 3)
            True  # indices 0 and 3, diff = 3
            >>> detector.contains_duplicate_ii([1,0,1,1], 1)
            True  # indices 2 and 3, diff = 1
            >>> detector.contains_duplicate_ii([1,2,3,1,2,3], 2)
            False
        """
        if not nums or k <= 0:
            return False

        # Sliding window: maintain window of size k+1
        window = set()

        for i, num in enumerate(nums):
            if num in window:
                return True

            window.add(num)

            # Maintain window size
            if len(window) > k:
                window.remove(nums[i - k])

        return False

    def contains_duplicate_iii(self, nums: List[int], index_diff: int, value_diff: int) -> bool:
        """
        Check if duplicates exist with constraints on both index and value difference.
        LeetCode 220: Contains Duplicate III

        Find if there exist indices i and j such that:
        - abs(i - j) <= index_diff
        - abs(nums[i] - nums[j]) <= value_diff

        Time: O(n log k), Space: O(k)
        Using bucket approach: O(n), O(k)

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.contains_duplicate_iii([1,2,3,1], 3, 0)
            True  # Exact duplicate within range
            >>> detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3)
            False
        """
        if not nums or index_diff < 0 or value_diff < 0:
            return False

        # Bucket approach: each bucket has width value_diff + 1
        bucket_width = value_diff + 1
        buckets = {}

        for i, num in enumerate(nums):
            bucket_id = num // bucket_width

            # Check current bucket
            if bucket_id in buckets:
                return True

            # Check adjacent buckets
            if bucket_id - 1 in buckets and abs(num - buckets[bucket_id - 1]) <= value_diff:
                return True
            if bucket_id + 1 in buckets and abs(num - buckets[bucket_id + 1]) <= value_diff:
                return True

            # Add to bucket
            buckets[bucket_id] = num

            # Remove old bucket
            if i >= index_diff:
                del buckets[nums[i - index_diff] // bucket_width]

        return False

    def find_all_duplicates(self, nums: List[int]) -> List[int]:
        """
        Find all numbers that appear more than once.

        Time: O(n), Space: O(n)

        Examples:
            >>> detector = DuplicateDetector()
            >>> sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1]))
            [2, 3]
        """
        if not nums:
            return []

        counts = Counter(nums)
        return [num for num, count in counts.items() if count > 1]

    def find_duplicate_single(self, nums: List[int]) -> int:
        """
        Find the one duplicate in array where:
        - nums contains n+1 integers in range [1, n]
        - Only one number appears twice, all others appear once

        Floyd's cycle detection: O(n) time, O(1) space

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.find_duplicate_single([1,3,4,2,2])
            2
            >>> detector.find_duplicate_single([3,1,3,4,2])
            3
        """
        # Treat array as linked list: nums[i] points to nums[nums[i]]
        # Duplicate creates a cycle

        # Phase 1: Find intersection point in cycle
        slow = fast = nums[0]

        while True:
            slow = nums[slow]
            fast = nums[nums[fast]]
            if slow == fast:
                break

        # Phase 2: Find entrance to cycle (duplicate)
        slow = nums[0]
        while slow != fast:
            slow = nums[slow]
            fast = nums[fast]

        return slow

    def count_duplicates(self, nums: List[int]) -> int:
        """
        Count how many numbers appear more than once.

        Time: O(n), Space: O(n)

        Examples:
            >>> detector = DuplicateDetector()
            >>> detector.count_duplicates([1,2,3,1,2,4])
            2  # 1 and 2 are duplicates
        """
        if not nums:
            return 0

        counts = Counter(nums)
        return sum(1 for count in counts.values() if count > 1)

    def remove_duplicates_inplace(self, nums: List[int]) -> int:
        """
        Remove duplicates from sorted array in-place.
        Return new length.

        Time: O(n), Space: O(1)

        Examples:
            >>> detector = DuplicateDetector()
            >>> nums = [1,1,2,2,3,4,4]
            >>> length = detector.remove_duplicates_inplace(nums)
            >>> length
            4
            >>> nums[:length]
            [1, 2, 3, 4]
        """
        if not nums:
            return 0

        write_idx = 1

        for i in range(1, len(nums)):
            if nums[i] != nums[i - 1]:
                nums[write_idx] = nums[i]
                write_idx += 1

        return write_idx


class DuplicateAnalyzer:
    """
    Advanced duplicate analysis and statistics.
    """

    def __init__(self, nums: List[int]):
        """Initialize with number array."""
        self.nums = nums
        self.frequency = Counter(nums)

    def get_duplicate_stats(self) -> Dict:
        """
        Get comprehensive duplicate statistics.

        Returns:
            Dictionary with various statistics

        Examples:
            >>> analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
            >>> stats = analyzer.get_duplicate_stats()
            >>> stats['total_duplicates']
            2
            >>> stats['most_frequent']
            3
        """
        duplicates = {num: count for num, count in self.frequency.items() if count > 1}

        return {
            'total_elements': len(self.nums),
            'unique_elements': len(self.frequency),
            'total_duplicates': len(duplicates),
            'duplicate_elements': list(duplicates.keys()),
            'most_frequent': max(self.frequency, key=self.frequency.get) if self.frequency else None,
            'max_frequency': max(self.frequency.values()) if self.frequency else 0,
            'duplicate_count': sum(count - 1 for count in duplicates.values()),
        }

    def find_k_most_frequent(self, k: int) -> List[int]:
        """
        Find k most frequent elements.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> analyzer = DuplicateAnalyzer([1,1,1,2,2,3])
            >>> analyzer.find_k_most_frequent(2)
            [1, 2]
        """
        from heapq import nlargest
        return nlargest(k, self.frequency.keys(), key=self.frequency.get)


# Comprehensive Test Suite
def test_contains_duplicate():
    """Complete test suite for duplicate detection."""
    detector = DuplicateDetector()

    # Test 1: Basic duplicate
    assert detector.contains_duplicate_set([1,2,3,1]) == True
    assert detector.contains_duplicate_set([1,2,3,4]) == False

    # Test 2: Empty array
    assert detector.contains_duplicate_set([]) == False

    # Test 3: Single element
    assert detector.contains_duplicate_set([1]) == False

    # Test 4: All duplicates
    assert detector.contains_duplicate_set([1,1,1,1]) == True

    # Test 5: Early return optimization
    assert detector.contains_duplicate_early_return([1,1,2,3]) == True

    # Test 6: Contains Duplicate II
    assert detector.contains_duplicate_ii([1,2,3,1], 3) == True
    assert detector.contains_duplicate_ii([1,0,1,1], 1) == True
    assert detector.contains_duplicate_ii([1,2,3,1,2,3], 2) == False

    # Test 7: Contains Duplicate III
    assert detector.contains_duplicate_iii([1,2,3,1], 3, 0) == True
    assert detector.contains_duplicate_iii([1,5,9,1,5,9], 2, 3) == False

    # Test 8: Find all duplicates
    assert sorted(detector.find_all_duplicates([4,3,2,7,8,2,3,1])) == [2, 3]

    # Test 9: Find single duplicate (Floyd's algorithm)
    assert detector.find_duplicate_single([1,3,4,2,2]) == 2
    assert detector.find_duplicate_single([3,1,3,4,2]) == 3

    # Test 10: Count duplicates
    assert detector.count_duplicates([1,2,3,1,2,4]) == 2

    # Test 11: Remove duplicates in-place
    nums = [1,1,2,2,3,4,4]
    length = detector.remove_duplicates_inplace(nums)
    assert length == 4
    assert nums[:length] == [1, 2, 3, 4]

    # Test 12: Duplicate analyzer
    analyzer = DuplicateAnalyzer([1,2,2,3,3,3,4])
    stats = analyzer.get_duplicate_stats()
    assert stats['total_duplicates'] == 2
    assert stats['most_frequent'] == 3
    assert stats['max_frequency'] == 3

    # Test 13: K most frequent
    assert analyzer.find_k_most_frequent(2) == [3, 2]

    print("All Contains Duplicate tests passed!")


if __name__ == "__main__":
    test_contains_duplicate()

Topic 62: Longest Consecutive Sequence - Complete Implementation

"""
Longest Consecutive Sequence - Find length of longest consecutive elements sequence
Time: O(n), Space: O(n)
"""

from typing import List, Set, Dict

class ConsecutiveSequenceFinder:
    """
    Complete implementation of consecutive sequence problems.

    Key Insight: Hash set allows O(1) lookup for sequence building
    """

    def longest_consecutive_optimal(self, nums: List[int]) -> int:
        """
        Find longest consecutive sequence using hash set.
        Only start counting from sequence starts (num-1 not in set).

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers

        Returns:
            Length of longest consecutive sequence

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_optimal([100,4,200,1,3,2])
            4  # Sequence: [1,2,3,4]
            >>> finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1])
            9  # Sequence: [0,1,2,3,4,5,6,7,8]
        """
        if not nums:
            return 0

        num_set = set(nums)
        max_length = 0

        for num in num_set:
            # Only start counting from sequence beginnings
            if num - 1 not in num_set:
                current_num = num
                current_length = 1

                # Count consecutive numbers
                while current_num + 1 in num_set:
                    current_num += 1
                    current_length += 1

                max_length = max(max_length, current_length)

        return max_length

    def longest_consecutive_with_sequence(self, nums: List[int]) -> List[int]:
        """
        Return the actual longest consecutive sequence, not just length.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
            [1, 2, 3, 4]
        """
        if not nums:
            return []

        num_set = set(nums)
        longest_sequence = []

        for num in num_set:
            if num - 1 not in num_set:
                # Build sequence from this start
                sequence = []
                current = num

                while current in num_set:
                    sequence.append(current)
                    current += 1

                if len(sequence) > len(longest_sequence):
                    longest_sequence = sequence

        return longest_sequence

    def longest_consecutive_union_find(self, nums: List[int]) -> int:
        """
        Alternative approach using Union-Find data structure.
        Educational purpose - shows connection to graph connectivity.

        Time: O(n * α(n)) ≈ O(n), Space: O(n)
        where α is inverse Ackermann function (practically constant)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_union_find([100,4,200,1,3,2])
            4
        """
        if not nums:
            return 0

        class UnionFind:
            def __init__(self):
                self.parent = {}
                self.size = {}

            def add(self, x):
                if x not in self.parent:
                    self.parent[x] = x
                    self.size[x] = 1

            def find(self, x):
                if self.parent[x] != x:
                    self.parent[x] = self.find(self.parent[x])
                return self.parent[x]

            def union(self, x, y):
                root_x, root_y = self.find(x), self.find(y)
                if root_x == root_y:
                    return

                # Union by size
                if self.size[root_x] < self.size[root_y]:
                    root_x, root_y = root_y, root_x

                self.parent[root_y] = root_x
                self.size[root_x] += self.size[root_y]

            def get_max_size(self):
                return max(self.size.values()) if self.size else 0

        uf = UnionFind()
        num_set = set(nums)

        for num in num_set:
            uf.add(num)
            if num + 1 in num_set:
                uf.union(num, num + 1)

        return uf.get_max_size()

    def count_consecutive_sequences(self, nums: List[int]) -> int:
        """
        Count number of consecutive sequences.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.count_consecutive_sequences([100,4,200,1,3,2])
            3  # [100], [200], [1,2,3,4]
        """
        if not nums:
            return 0

        num_set = set(nums)
        count = 0

        for num in num_set:
            if num - 1 not in num_set:
                count += 1

        return count

    def all_consecutive_sequences(self, nums: List[int]) -> List[List[int]]:
        """
        Find all consecutive sequences.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> result = finder.all_consecutive_sequences([100,4,200,1,3,2])
            >>> sorted([sorted(seq) for seq in result])
            [[1, 2, 3, 4], [100], [200]]
        """
        if not nums:
            return []

        num_set = set(nums)
        sequences = []

        for num in num_set:
            if num - 1 not in num_set:
                # Build sequence
                sequence = []
                current = num

                while current in num_set:
                    sequence.append(current)
                    current += 1

                sequences.append(sequence)

        return sequences

    def longest_consecutive_range(self, nums: List[int]) -> tuple:
        """
        Return (start, end) of longest consecutive sequence.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.longest_consecutive_range([100,4,200,1,3,2])
            (1, 4)
        """
        if not nums:
            return (0, 0)

        num_set = set(nums)
        max_start, max_end = 0, 0
        max_length = 0

        for num in num_set:
            if num - 1 not in num_set:
                start = num
                end = num

                while end + 1 in num_set:
                    end += 1

                if end - start + 1 > max_length:
                    max_length = end - start + 1
                    max_start, max_end = start, end

        return (max_start, max_end)

    def consecutive_with_duplicates(self, nums: List[int]) -> int:
        """
        Handle duplicates explicitly in consecutive sequence.

        Time: O(n), Space: O(n)

        Examples:
            >>> finder = ConsecutiveSequenceFinder()
            >>> finder.consecutive_with_duplicates([1,2,2,3,3,3,4])
            4  # [1,2,3,4]
        """
        # Same as optimal since we use set
        return self.longest_consecutive_optimal(nums)


class BinaryArrayConsecutive:
    """
    Find longest consecutive sequence of 1s in binary array.
    Related problem showing hash table application to different domain.
    """

    def longest_consecutive_ones(self, nums: List[int]) -> int:
        """
        Find longest sequence of consecutive 1s.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = BinaryArrayConsecutive()
            >>> solver.longest_consecutive_ones([1,1,0,1,1,1])
            3
        """
        max_length = 0
        current_length = 0

        for num in nums:
            if num == 1:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0

        return max_length

    def longest_consecutive_ones_with_flip(self, nums: List[int], k: int) -> int:
        """
        Find longest sequence of 1s after flipping at most k 0s.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = BinaryArrayConsecutive()
            >>> solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2)
            6  # Flip two 0s: [1,1,1,0,0,0,1,1,1,1] -> [1,1,1,1,1,0,1,1,1,1]
        """
        left = 0
        zeros_count = 0
        max_length = 0

        for right in range(len(nums)):
            if nums[right] == 0:
                zeros_count += 1

            while zeros_count > k:
                if nums[left] == 0:
                    zeros_count -= 1
                left += 1

            max_length = max(max_length, right - left + 1)

        return max_length


# Comprehensive Test Suite
def test_longest_consecutive():
    """Complete test suite for consecutive sequence."""
    finder = ConsecutiveSequenceFinder()

    # Test 1: Basic case
    assert finder.longest_consecutive_optimal([100,4,200,1,3,2]) == 4

    # Test 2: Long sequence
    assert finder.longest_consecutive_optimal([0,3,7,2,5,8,4,6,0,1]) == 9

    # Test 3: Empty array
    assert finder.longest_consecutive_optimal([]) == 0

    # Test 4: Single element
    assert finder.longest_consecutive_optimal([1]) == 1

    # Test 5: No consecutive
    assert finder.longest_consecutive_optimal([1,3,5,7,9]) == 1

    # Test 6: All consecutive
    assert finder.longest_consecutive_optimal([1,2,3,4,5]) == 5

    # Test 7: Duplicates
    assert finder.longest_consecutive_optimal([1,2,0,1]) == 3

    # Test 8: Negative numbers
    assert finder.longest_consecutive_optimal([-1,0,1,2]) == 4

    # Test 9: Get actual sequence
    sequence = finder.longest_consecutive_with_sequence([100,4,200,1,3,2])
    assert sequence == [1,2,3,4]

    # Test 10: Union-Find approach
    assert finder.longest_consecutive_union_find([100,4,200,1,3,2]) == 4

    # Test 11: Count sequences
    assert finder.count_consecutive_sequences([100,4,200,1,3,2]) == 3

    # Test 12: All sequences
    sequences = finder.all_consecutive_sequences([100,4,200,1,3,2])
    assert len(sequences) == 3

    # Test 13: Get range
    assert finder.longest_consecutive_range([100,4,200,1,3,2]) == (1, 4)

    # Test 14: Binary array consecutive 1s
    binary_solver = BinaryArrayConsecutive()
    assert binary_solver.longest_consecutive_ones([1,1,0,1,1,1]) == 3

    # Test 15: With flips
    assert binary_solver.longest_consecutive_ones_with_flip([1,1,1,0,0,0,1,1,1,1,0], 2) == 6

    print("All Longest Consecutive Sequence tests passed!")


if __name__ == "__main__":
    test_longest_consecutive()

Topic 63: Top K Frequent Elements - Complete Implementation

"""
Top K Frequent Elements - Find k most frequent elements
Time: O(n log k) using heap, O(n) using bucket sort
Space: O(n)
"""

from typing import List, Dict, Tuple
from collections import Counter
import heapq

class TopKFrequentSolver:
    """
    Complete implementation of Top K Frequent Elements.

    Key Insight: Use frequency counting + heap or bucket sort
    """

    def top_k_frequent_heap(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent elements using min heap.

        Time: O(n log k), Space: O(n)

        Args:
            nums: List of integers
            k: Number of most frequent elements to return

        Returns:
            List of k most frequent elements

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.top_k_frequent_heap([1,1,1,2,2,3], 2))
            [1, 2]
            >>> sorted(solver.top_k_frequent_heap([1], 1))
            [1]
        """
        if not nums or k <= 0:
            return []

        # Count frequencies
        freq_map = Counter(nums)

        # Use min heap of size k
        # Heap stores (frequency, number)
        min_heap = []

        for num, freq in freq_map.items():
            heapq.heappush(min_heap, (freq, num))
            if len(min_heap) > k:
                heapq.heappop(min_heap)

        # Extract numbers from heap
        return [num for freq, num in min_heap]

    def top_k_frequent_bucket(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent using bucket sort.
        More efficient: O(n) time.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.top_k_frequent_bucket([1,1,1,2,2,3], 2))
            [1, 2]
        """
        if not nums or k <= 0:
            return []

        # Count frequencies
        freq_map = Counter(nums)

        # Bucket sort: index = frequency, value = list of numbers
        # Maximum frequency is len(nums)
        buckets = [[] for _ in range(len(nums) + 1)]

        for num, freq in freq_map.items():
            buckets[freq].append(num)

        # Collect k most frequent from highest frequency buckets
        result = []
        for i in range(len(buckets) - 1, -1, -1):
            if buckets[i]:
                result.extend(buckets[i])
                if len(result) >= k:
                    return result[:k]

        return result

    def top_k_frequent_quickselect(self, nums: List[int], k: int) -> List[int]:
        """
        Find k most frequent using Quickselect algorithm.
        Average O(n), worst case O(n²).

        Time: O(n) average, Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
            >>> len(result)
            2
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)
        unique_nums = list(freq_map.keys())

        def partition(left, right, pivot_idx):
            pivot_freq = freq_map[unique_nums[pivot_idx]]
            # Move pivot to end
            unique_nums[pivot_idx], unique_nums[right] = unique_nums[right], unique_nums[pivot_idx]

            store_idx = left
            for i in range(left, right):
                if freq_map[unique_nums[i]] < pivot_freq:
                    unique_nums[store_idx], unique_nums[i] = unique_nums[i], unique_nums[store_idx]
                    store_idx += 1

            # Move pivot to final position
            unique_nums[right], unique_nums[store_idx] = unique_nums[store_idx], unique_nums[right]
            return store_idx

        def quickselect(left, right, k_smallest):
            if left == right:
                return

            # Select random pivot
            import random
            pivot_idx = random.randint(left, right)
            pivot_idx = partition(left, right, pivot_idx)

            if k_smallest == pivot_idx:
                return
            elif k_smallest < pivot_idx:
                quickselect(left, pivot_idx - 1, k_smallest)
            else:
                quickselect(pivot_idx + 1, right, k_smallest)

        n = len(unique_nums)
        quickselect(0, n - 1, n - k)
        return unique_nums[n - k:]

    def top_k_frequent_with_counts(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
        """
        Return k most frequent elements with their frequencies.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
            [(1, 3), (2, 2)]
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)

        # Use max heap (negate frequencies for min heap)
        heap = []
        for num, freq in freq_map.items():
            heapq.heappush(heap, (-freq, num))

        result = []
        for _ in range(min(k, len(heap))):
            freq, num = heapq.heappop(heap)
            result.append((num, -freq))

        return result

    def top_k_frequent_sorted(self, nums: List[int], k: int) -> List[int]:
        """
        Simple approach using sorting.

        Time: O(n log n), Space: O(n)
        Not optimal but simple and clear.
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)
        # Sort by frequency (descending)
        sorted_items = sorted(freq_map.items(), key=lambda x: x[1], reverse=True)

        return [num for num, freq in sorted_items[:k]]

    def bottom_k_frequent(self, nums: List[int], k: int) -> List[int]:
        """
        Find k least frequent elements.

        Time: O(n log k), Space: O(n)

        Examples:
            >>> solver = TopKFrequentSolver()
            >>> sorted(solver.bottom_k_frequent([1,1,1,2,2,3], 2))
            [2, 3]
        """
        if not nums or k <= 0:
            return []

        freq_map = Counter(nums)

        # Use max heap to keep k smallest frequencies
        max_heap = []

        for num, freq in freq_map.items():
            heapq.heappush(max_heap, (-freq, num))
            if len(max_heap) > k:
                heapq.heappop(max_heap)

        return [num for freq, num in max_heap]


class FrequencyAnalyzer:
    """
    Advanced frequency analysis tools.
    """

    def __init__(self, nums: List[int]):
        """Initialize with number array."""
        self.nums = nums
        self.freq_map = Counter(nums)

    def get_frequency_distribution(self) -> Dict[int, List[int]]:
        """
        Get distribution: frequency -> list of numbers with that frequency.

        Time: O(n), Space: O(n)

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> dist = analyzer.get_frequency_distribution()
            >>> dist[3]
            [1]
            >>> sorted(dist[2])
            [2]
            >>> dist[1]
            [3]
        """
        distribution = {}
        for num, freq in self.freq_map.items():
            if freq not in distribution:
                distribution[freq] = []
            distribution[freq].append(num)

        return distribution

    def frequency_stats(self) -> Dict:
        """
        Get comprehensive frequency statistics.

        Returns:
            Dictionary with various statistics
        """
        if not self.freq_map:
            return {}

        frequencies = list(self.freq_map.values())

        return {
            'total_elements': len(self.nums),
            'unique_elements': len(self.freq_map),
            'min_frequency': min(frequencies),
            'max_frequency': max(frequencies),
            'avg_frequency': sum(frequencies) / len(frequencies),
            'mode': max(self.freq_map, key=self.freq_map.get),
            'mode_frequency': max(frequencies),
        }

    def elements_with_frequency(self, target_freq: int) -> List[int]:
        """
        Get all elements that appear exactly target_freq times.

        Time: O(n), Space: O(1)

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> analyzer.elements_with_frequency(2)
            [2]
        """
        return [num for num, freq in self.freq_map.items() if freq == target_freq]

    def frequency_percentile(self, num: int) -> float:
        """
        Calculate what percentile the frequency of num is at.

        Examples:
            >>> analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
            >>> analyzer.frequency_percentile(1)
            100.0  # Highest frequency
        """
        if num not in self.freq_map:
            return 0.0

        target_freq = self.freq_map[num]
        count_lower = sum(1 for freq in self.freq_map.values() if freq < target_freq)

        return (count_lower / len(self.freq_map)) * 100


# Comprehensive Test Suite
def test_top_k_frequent():
    """Complete test suite for Top K Frequent."""
    solver = TopKFrequentSolver()

    # Test 1: Basic case with heap
    result = solver.top_k_frequent_heap([1,1,1,2,2,3], 2)
    assert sorted(result) == [1, 2]

    # Test 2: Single element
    assert solver.top_k_frequent_heap([1], 1) == [1]

    # Test 3: K equals number of unique elements
    result = solver.top_k_frequent_heap([1,2,3], 3)
    assert len(result) == 3

    # Test 4: Bucket sort approach
    result = solver.top_k_frequent_bucket([1,1,1,2,2,3], 2)
    assert sorted(result) == [1, 2]

    # Test 5: Quickselect approach
    result = solver.top_k_frequent_quickselect([1,1,1,2,2,3], 2)
    assert len(result) == 2
    assert 1 in result and 2 in result

    # Test 6: With counts
    result = solver.top_k_frequent_with_counts([1,1,1,2,2,3], 2)
    assert len(result) == 2
    assert (1, 3) in result
    assert (2, 2) in result

    # Test 7: Sorted approach
    result = solver.top_k_frequent_sorted([1,1,1,2,2,3], 2)
    assert result == [1, 2]

    # Test 8: Bottom k frequent
    result = solver.bottom_k_frequent([1,1,1,2,2,3], 2)
    assert sorted(result) == [2, 3]

    # Test 9: Frequency analyzer
    analyzer = FrequencyAnalyzer([1,1,1,2,2,3])
    dist = analyzer.get_frequency_distribution()
    assert dist[3] == [1]
    assert 2 in dist[2]
    assert dist[1] == [3]

    # Test 10: Frequency stats
    stats = analyzer.frequency_stats()
    assert stats['total_elements'] == 6
    assert stats['unique_elements'] == 3
    assert stats['max_frequency'] == 3
    assert stats['mode'] == 1

    # Test 11: Elements with frequency
    assert analyzer.elements_with_frequency(2) == [2]

    # Test 12: Frequency percentile
    assert analyzer.frequency_percentile(1) == 100.0

    # Test 13: Large array performance
    import random
    large_array = [random.randint(1, 100) for _ in range(10000)]
    result = solver.top_k_frequent_bucket(large_array, 10)
    assert len(result) <= 10

    print("All Top K Frequent tests passed!")


if __name__ == "__main__":
    test_top_k_frequent()

Topic 64: Subarray Sum Equals K - Complete Implementation

"""
Subarray Sum Equals K - Count subarrays with sum equal to k
Time: O(n), Space: O(n)
"""

from typing import List, Dict, Tuple
from collections import defaultdict

class SubarraySumSolver:
    """
    Complete implementation of subarray sum problems.

    Key Insight: Use prefix sum + hash map for O(n) solution
    """

    def subarray_sum_count(self, nums: List[int], k: int) -> int:
        """
        Count number of continuous subarrays with sum equal to k.

        Uses prefix sum: if prefix[j] - prefix[i] = k, then subarray[i+1:j+1] sums to k
        Store prefix sums in hash map for O(1) lookup.

        Time: O(n), Space: O(n)

        Args:
            nums: List of integers
            k: Target sum

        Returns:
            Count of subarrays summing to k

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_count([1,1,1], 2)
            2  # [1,1] appears twice
            >>> solver.subarray_sum_count([1,2,3], 3)
            2  # [3] and [1,2]
        """
        if not nums:
            return 0

        # Map: prefix_sum -> count of occurrences
        prefix_count = defaultdict(int)
        prefix_count[0] = 1  # Empty prefix

        current_sum = 0
        count = 0

        for num in nums:
            current_sum += num

            # Check if (current_sum - k) exists
            # If yes, those positions form valid subarrays
            if current_sum - k in prefix_count:
                count += prefix_count[current_sum - k]

            prefix_count[current_sum] += 1

        return count

    def subarray_sum_indices(self, nums: List[int], k: int) -> List[Tuple[int, int]]:
        """
        Return all subarray indices (start, end) with sum equal to k.

        Time: O(n²) worst case (all subarrays sum to k)
        Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_indices([1,1,1], 2)
            [(0, 1), (1, 2)]
        """
        if not nums:
            return []

        # Map: prefix_sum -> list of indices
        prefix_indices = defaultdict(list)
        prefix_indices[0].append(-1)  # Before array starts

        current_sum = 0
        result = []

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_indices:
                for start_idx in prefix_indices[target]:
                    result.append((start_idx + 1, i))

            prefix_indices[current_sum].append(i)

        return result

    def subarray_sum_max_length(self, nums: List[int], k: int) -> int:
        """
        Find maximum length of subarray with sum equal to k.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_max_length([1,-1,5,-2,3], 3)
            4  # [1,-1,5,-2] or [-2,3] -> actually [-1,5,-2,3] = 5? Let me recalc
            # Actually [1,-1,5,-2] = 3, length 4
        """
        if not nums:
            return 0

        # Map: prefix_sum -> first index where it appears
        prefix_first = {0: -1}
        current_sum = 0
        max_length = 0

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_first:
                max_length = max(max_length, i - prefix_first[target])

            # Only store first occurrence of each prefix sum
            if current_sum not in prefix_first:
                prefix_first[current_sum] = i

        return max_length

    def subarray_sum_min_length(self, nums: List[int], k: int) -> int:
        """
        Find minimum length of subarray with sum equal to k.
        Returns -1 if no such subarray exists.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_min_length([1,2,3,4,5], 9)
            2  # [4,5]
        """
        if not nums:
            return -1

        # Map: prefix_sum -> last index where it appears
        prefix_last = {0: -1}
        current_sum = 0
        min_length = float('inf')

        for i, num in enumerate(nums):
            current_sum += num

            target = current_sum - k
            if target in prefix_last:
                min_length = min(min_length, i - prefix_last[target])

            # Always update to latest index
            prefix_last[current_sum] = i

        return min_length if min_length != float('inf') else -1

    def subarray_sum_equals_zero(self, nums: List[int]) -> bool:
        """
        Check if there exists a subarray with sum equal to 0.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.subarray_sum_equals_zero([1,-1,2,3])
            True  # [1,-1]
            >>> solver.subarray_sum_equals_zero([1,2,3])
            False
        """
        if not nums:
            return False

        seen_sums = {0}  # Empty prefix
        current_sum = 0

        for num in nums:
            current_sum += num
            if current_sum in seen_sums:
                return True
            seen_sums.add(current_sum)

        return False

    def longest_subarray_sum_zero(self, nums: List[int]) -> int:
        """
        Find length of longest subarray with sum 0.

        Time: O(n), Space: O(n)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23])
            5  # [-2,2,-8,1,7]
        """
        return self.subarray_sum_max_length(nums, 0)

    def count_subarrays_sum_divisible_by_k(self, nums: List[int], k: int) -> int:
        """
        Count subarrays with sum divisible by k.

        Key insight: (prefix[j] - prefix[i]) % k = 0
        means prefix[j] % k = prefix[i] % k

        Time: O(n), Space: O(k)

        Examples:
            >>> solver = SubarraySumSolver()
            >>> solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5)
            7
        """
        if not nums:
            return 0

        # Map: remainder -> count
        remainder_count = defaultdict(int)
        remainder_count[0] = 1  # Empty prefix

        current_sum = 0
        count = 0

        for num in nums:
            current_sum += num
            remainder = current_sum % k

            # Handle negative remainders
            if remainder < 0:
                remainder += k

            count += remainder_count[remainder]
            remainder_count[remainder] += 1

        return count


class RangeSumQuery:
    """
    Data structure for efficient range sum queries.
    Preprocess array to answer sum queries in O(1).
    """

    def __init__(self, nums: List[int]):
        """
        Preprocess array to build prefix sum array.

        Time: O(n), Space: O(n)
        """
        self.prefix = [0]
        for num in nums:
            self.prefix.append(self.prefix[-1] + num)

    def sum_range(self, left: int, right: int) -> int:
        """
        Return sum of elements from index left to right (inclusive).

        Time: O(1), Space: O(1)

        Examples:
            >>> query = RangeSumQuery([1,2,3,4,5])
            >>> query.sum_range(0, 2)
            6  # 1+2+3
            >>> query.sum_range(2, 4)
            12  # 3+4+5
        """
        return self.prefix[right + 1] - self.prefix[left]

    def count_range_sum_k(self, k: int) -> int:
        """
        Count all ranges with sum equal to k.

        Time: O(n²), Space: O(1)
        """
        count = 0
        n = len(self.prefix) - 1

        for i in range(n):
            for j in range(i, n):
                if self.sum_range(i, j) == k:
                    count += 1

        return count


class SubarrayProductSolver:
    """
    Related problem: Product instead of sum.
    Shows how hash map pattern applies to different operations.
    """

    def subarray_product_less_than_k(self, nums: List[int], k: int) -> int:
        """
        Count subarrays where product of elements is less than k.

        Note: All elements must be positive for this to work.

        Time: O(n), Space: O(1)

        Examples:
            >>> solver = SubarrayProductSolver()
            >>> solver.subarray_product_less_than_k([10,5,2,6], 100)
            8
        """
        if k <= 1:
            return 0

        count = 0
        product = 1
        left = 0

        for right in range(len(nums)):
            product *= nums[right]

            while product >= k and left <= right:
                product //= nums[left]
                left += 1

            # All subarrays ending at right and starting from left to right
            count += right - left + 1

        return count


# Comprehensive Test Suite
def test_subarray_sum():
    """Complete test suite for subarray sum."""
    solver = SubarraySumSolver()

    # Test 1: Basic count
    assert solver.subarray_sum_count([1,1,1], 2) == 2
    assert solver.subarray_sum_count([1,2,3], 3) == 2

    # Test 2: Empty array
    assert solver.subarray_sum_count([], 5) == 0

    # Test 3: Single element
    assert solver.subarray_sum_count([5], 5) == 1
    assert solver.subarray_sum_count([5], 3) == 0

    # Test 4: Negative numbers
    assert solver.subarray_sum_count([1,-1,0], 0) == 3

    # Test 5: Get indices
    indices = solver.subarray_sum_indices([1,1,1], 2)
    assert len(indices) == 2
    assert (0, 1) in indices and (1, 2) in indices

    # Test 6: Max length
    assert solver.subarray_sum_max_length([1,-1,5,-2,3], 3) == 4

    # Test 7: Min length
    assert solver.subarray_sum_min_length([1,2,3,4,5], 9) == 2
    assert solver.subarray_sum_min_length([1,2,3], 10) == -1

    # Test 8: Sum equals zero
    assert solver.subarray_sum_equals_zero([1,-1,2,3]) == True
    assert solver.subarray_sum_equals_zero([1,2,3]) == False

    # Test 9: Longest zero sum
    assert solver.longest_subarray_sum_zero([15,-2,2,-8,1,7,10,23]) == 5

    # Test 10: Divisible by k
    assert solver.count_subarrays_sum_divisible_by_k([4,5,0,-2,-3,1], 5) == 7

    # Test 11: Range sum query
    query = RangeSumQuery([1,2,3,4,5])
    assert query.sum_range(0, 2) == 6
    assert query.sum_range(2, 4) == 12
    assert query.sum_range(0, 4) == 15

    # Test 12: Product less than k
    product_solver = SubarrayProductSolver()
    assert product_solver.subarray_product_less_than_k([10,5,2,6], 100) == 8

    # Test 13: All elements sum to k
    assert solver.subarray_sum_count([1,1,1,1], 4) == 1

    # Test 14: No subarray sums to k
    assert solver.subarray_sum_count([1,2,3], 10) == 0

    print("All Subarray Sum tests passed!")


if __name__ == "__main__":
    test_subarray_sum()

Explore Unread

Great job! You've read all available articles

Comprehensive Guide to Hash Tables and Hashing

1. Foundational Understanding

1.1 What, Why, and When Questions

Basic Definition and Scope

Purpose and Context

Historical Context

Historical Failures and Lessons

Philosophical and Conceptual

1.2 Classification and Categorization

Type and Family

Domain and Application

2. Theory and Principles

2.1 Core Concepts and Structure

Fundamental Ideas

Invariants

Architecture and Design

Mathematical Foundations

Models and Representations

2.2 Logical Framework

Invariants and Properties

Correctness and Proofs

Assumptions and Requirements

3. Implementation and Mechanics

3.1 How It Works

Internal Workings

Operation Mechanics

Algorithm Execution

State and Transitions

3.2 Dependencies and Interactions

Internal Dependencies

Causal Chains

Conditional Factors

Emergent Properties

5. Complexity Analysis

5.1 Time Complexity

Operation-Wise Analysis

Overall Algorithm Complexity

Conditional Complexity

Complexity Trade-offs

5.2 Space Complexity

Memory Usage

Memory Layout

Space-Time Trade-offs

Recursion Depth (where applicable)

5.3 Other Complexity Measures

Cache Efficiency

Practical Performance

6. Optimization and Trade-offs

6.1 Performance Optimization

Optimization Techniques

Bottleneck Analysis

Ideal Conditions

6.2 Design Trade-offs

Fundamental Trade-offs

Alternative Approaches

Balancing Competing Factors

6.3 Advanced Optimizations

Amortization

Parallelization

Approximation

4. Problem-Solving and Applications

4.1 Use Cases and Applications

Where and When to Use

Problem Categories

Real-World Applications

Scale and Performance Requirements

4.2 Problem-Solving Approaches

Recognition Patterns

Application Strategy

Combination with Other Techniques

7. Comparative Analysis

7.1 Comparison with Alternatives

Feature-by-Feature Comparison

Contextual Comparison

7.2 Evolution and Variants

Historical Progression

Variant Analysis

7.3 Related Concepts

Closely Related Structures/Algorithms

Cross-Domain Connections