Bucket Sort
Core Concepts
Bucket — a container (list) that collects elements whose value or key falls within a designated range or equals a designated index. Buckets are indexed so that iterating them in order produces sorted output.
Bucket size — the width of the value range each bucket covers, computed as (max_val - min_val) / n for n buckets over a known range. Choosing size too large collapses all elements into one bucket; too small wastes memory without gain.
Frequency bucket — a bucket indexed by frequency rather than value. Element with frequency f goes into buckets[f]. Enables O(n) retrieval of the k most or least frequent elements because frequency is bounded by n.
Pigeonhole principle — if n items are distributed into fewer than n containers, at least one container holds more than one item. In maximum gap: if n numbers span range R, and we use n−1 buckets of size ceil(R/(n−1)), the maximum gap cannot be within a single bucket — it must cross a bucket boundary (or be between consecutive non-empty buckets).
Range partitioning — mapping a value v to a bucket index via (v - min_val) // bucket_size. This is the fundamental operation; its correctness requires that bucket_size > 0 (at least two distinct values).
Uniform distribution assumption — standard bucket sort achieves O(n) average time only when inputs are uniformly distributed across the range. Adversarial or skewed inputs degrade inner-bucket sorts to O(n²) worst case.
Counting sort — a degenerate bucket sort where bucket size = 1, each bucket holds all occurrences of one value, and no inner sort is needed. Requires integer keys in a small known range.
See counting-sort.md if that file exists; otherwise treat counting sort as a special case here.
Structural Invariants
| Invariant | What it guarantees | What breaks if violated |
|---|---|---|
| Bucket index order matches value order | Concatenating buckets 0, 1, … k in order produces a sorted sequence | If bucket assignment is non-monotone, the output is unsorted even if each bucket is sorted internally |
min_val placed in bucket 0, max_val placed in bucket n−1 | Range boundaries are inclusive; no value falls outside [0, n−1] | index = (max_val - min_val) // bucket_size may equal n; must clamp with min(index, n-1) |
| Bucket size > 0 | Division is defined; at least two distinct values exist | Zero bucket size causes ZeroDivisionError; must special-case when all values are equal |
| Frequency bucket index ≤ len(array) | Frequency of any element is at most n | An index out of range n+1 corrupts the bucket array; allocate buckets of size n+1 |
Types / Variants
| Variant | What it is | When to use | Key property |
|---|---|---|---|
| Standard bucket sort | Partition floats/ints into equal-width buckets; sort each; concatenate | Input is roughly uniform over a known range; want O(n) average | O(n) average, O(n²) worst; requires known range |
| Frequency bucket | Index buckets by element frequency; collect top-k by iterating from high to low index | "Top k frequent elements", "sort by frequency" problems | O(n) always; frequency ≤ n bounds the index |
| Pigeonhole / maximum gap | Use n−1 buckets to prove the max gap straddles bucket boundaries; track only min/max per bucket | "Maximum gap" variant; need the largest difference between consecutive sorted elements | O(n) time and space; avoids full sort |
| Counting sort (degenerate) | One bucket per distinct integer value; no inner sort needed | Small integer range (e.g., 0–100), values are keys directly | O(n + k) where k = range size; unstable or stable depending on scatter pass |
Key Algorithms
Standard Bucket Sort
Distribute n elements into n equal-width buckets over the known range, sort each bucket with a comparison sort, then concatenate.
Key insight: if elements are uniformly distributed, each bucket holds O(1) elements on average, making the inner sort O(1) per bucket and O(n) overall.
Time O(n) average, O(n²) worst; Space O(n).
bucket_size = (max_val - min_val) / n
idx = min(int((v - min_val) / bucket_size), n - 1)
The min(..., n-1) clamp handles max_val exactly hitting a bucket boundary.
Frequency Bucket (Top-K Frequent Elements)
Count element frequencies, then place each element into buckets[freq]. Iterate buckets from index n down to 1, collecting elements until k are gathered.
Key insight: frequency is bounded by n, so the bucket array has fixed size n+1; no comparison sort needed at all.
Time O(n); Space O(n).
freq = Counter(nums)
buckets = [[] for _ in range(len(nums) + 1)]
for val, f in freq.items():
buckets[f].append(val)
Collect by iterating for b in reversed(buckets): result.extend(b) until len(result) == k.
Maximum Gap (Pigeonhole Bucket)
Given n unsorted numbers, find the maximum difference between consecutive elements in their sorted order — in O(n) time without a full sort.
Key insight: with n−1 buckets of width ceil((max_val - min_val) / (n−1)), the maximum gap must be between elements in different buckets (gaps within a bucket are at most bucket_size). Store only (bucket_min, bucket_max) per bucket; scan adjacent non-empty buckets for the largest bucket_min[i] − bucket_max[i−1].
Time O(n); Space O(n).
size = max(1, (hi - lo) // (n - 1))
idx = (v - lo) // size # maps v into bucket index
Edge case: n < 2 → return 0. All equal values → return 0 (hi == lo, handled by max(1, ...)).
Sort Characters by Frequency
Count character frequencies, place characters into buckets[freq], then build the result string by iterating from highest-frequency bucket downward.
Key insight: identical to frequency bucket but the "values" are characters; bucket index is the character's count. Output is char * freq repeated for each bucket entry.
Time O(n); Space O(n) where n = string length.
H-Index via Counting Sort
Given citation counts, find the largest h such that the researcher has at least h papers with ≥ h citations.
Key insight: treat it as a counting sort where bucket[c] = number of papers with exactly c citations; clamp citations > n into bucket n. A suffix-sum scan from high to low finds the first index where cumulative count ≥ index.
Time O(n); Space O(n).
buckets = [0] * (n + 1)
for c in citations:
buckets[min(c, n)] += 1
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Top-k by frequency | Return k elements (or characters) with highest occurrence | Frequency bucket; iterate from index n downward |
| Maximum consecutive gap | Largest gap between sorted-order neighbors | Pigeonhole bucket; store only min/max per bucket |
| Sort by derived key | Sort by frequency, then lexicographically, etc. | Frequency bucket with tie-breaking on concatenation |
| H-index / threshold count | Largest h satisfying a count-threshold condition | Counting bucket with suffix sum |
| Rank / percentile in stream | Given a value, find its rank among all seen values | Counting sort bucket; prefix sum for rank query |
Problem Signal → Technique
| Signal in problem statement | Likely technique |
|---|---|
| "top k frequent" / "k most frequent" | Frequency bucket (O(n), avoids heap) |
| "maximum gap" / "largest difference between consecutive sorted" | Pigeonhole bucket |
| "sort by frequency" | Frequency bucket; iterate high to low |
| "h-index" / "at least h papers with ≥ h citations" | Counting bucket + suffix sum |
| Values in range [0, n] or [0, k] with k small | Counting sort (degenerate bucket) |
| "O(n) sort" / "linear time sort" with bounded range | Bucket or counting sort |
| "sort colors" / three-way partition | Dutch national flag (not bucket sort; see two-pointers) |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Frequency-then-bucket | Need elements ordered by how often they appear | Build Counter, allocate buckets[0..n], scatter by freq, collect in reverse |
| Clamp-to-last-bucket | max_val maps to index n but only n−1 or n buckets exist | idx = min(computed_idx, num_buckets - 1) after division |
| Min-max-only bucket | Only the extremes within each bucket matter (max gap) | Store (lo, hi) tuple per bucket instead of a list; O(1) per insertion |
| Suffix-sum over count array | Count occurrences first, then accumulate from the right | total += counts[i]; if total >= i: return i (H-index pattern) |
| Two-pass count-then-place | Stable counting sort output order | First pass fills counts; second pass computes start positions; third pass places elements |
Edge Cases & Pitfalls
-
All values identical:
max_val == min_val→bucket_size = 0→ ZeroDivisionError. Fix: check before computing bucket size and return the array as-is (already sorted). -
Single element: trivially sorted; bucket size computation may still divide by zero if not guarded. Always check
n < 2early. -
max_val lands exactly on a computed boundary:
(max_val - min_val) / bucket_sizeequals an integer, which may be exactlynum_buckets, causing an out-of-bounds write. Fix: clamp withmin(index, num_buckets - 1). -
Frequency bucket size off-by-one: if the input has n elements, a single element can appear n times, so
bucketsmust be of sizen + 1(indices 0 through n). Allocating size n causes an IndexError on the highest-frequency element. -
Negative values in range-based bucket:
(v - min_val) // bucket_sizeis always non-negative as long asmin_valis the true minimum. Ifmin_valis not updated correctly (e.g., empty input), the formula produces negative indices. Always recomputemin_val/max_valfrom the actual input. -
Maximum gap with n < 2: there are no consecutive pairs; return 0 immediately. Attempting to compute bucket size on a single element divides by zero.
-
Floating-point bucket assignment drift: using
/(true division) instead of//(floor division) for bucket index may place a value in the wrong bucket due to float rounding. Prefer integer arithmetic when input values are integers. -
Assuming uniform distribution for correctness: bucket sort's O(n) average time requires roughly uniform distribution. For correctness, all elements must still be sorted within each bucket; skipping the inner sort breaks the algorithm, not just the complexity.
Implementation Notes
- Python's
Counter(iterable)fromcollectionsreturns a dict of{element: frequency};.most_common(k)returns the top-k pairs already sorted — use it when a heap-based O(n log k) solution is acceptable, but prefer frequency bucket for strict O(n). buckets = [[] for _ in range(n + 1)]allocates n+1 separate lists; do not use[[]] * (n+1)— that creates n+1 references to the same list, so appending to one appends to all.- Integer division
//in Python floors toward negative infinity, not toward zero. For non-negative inputs this matches C-style truncation, but for mixed-sign ranges always ensurev - min_val ≥ 0before dividing. - Python has no built-in bucket sort; the pattern is always manually constructed.
sorted()uses Timsort (O(n log n)); use it for inner-bucket sorts. - Recursion depth is not a concern for bucket sort (it is iterative), but very large counting arrays (range k >> n) waste memory — prefer a hash-based approach for sparse integer ranges.
Cross-Topic Relations
| Related topic | Specific connection |
|---|---|
| Counting Sort | Degenerate case of bucket sort with bucket size = 1; same O(n + k) complexity; counting sort is a prerequisite concept |
| Radix Sort | Applies bucket sort digit-by-digit from least to most significant; bucket sort is the inner pass of radix sort |
| Hash Table | Frequency buckets use a Counter (hash map) to build the frequency index before scattering into the bucket array |
| Heap / Priority Queue | Alternative for top-k frequent problems: heap gives O(n log k) vs. bucket's O(n); bucket sort is preferred when k is close to n |
| Two Pointers / Dutch National Flag | Sort Colors problem looks like 3-bucket sort but is solved optimally with three-way partition (in-place); bucket sort would require extra space |
| Prefix Sum | H-index and rank queries use suffix/prefix sums over the count array produced by bucket sort |
| Sorting (general) | Bucket sort is a non-comparison sort; its O(n) average breaks the Ω(n log n) comparison-sort lower bound by exploiting value distribution |
Interview Reference
Must-Know Problems
Frequency Bucket (Top-K)
- Top K Frequent Elements (LeetCode 347) — canonical frequency bucket problem
- Sort Characters By Frequency (LeetCode 451) — same pattern with character output
- Top K Frequent Words (LeetCode 692) — frequency bucket with lexicographic tie-breaking
Maximum Gap
- Maximum Gap (LeetCode 164) — canonical pigeonhole bucket problem; O(n) required
H-Index / Threshold Count
- H-Index (LeetCode 274) — counting bucket with suffix sum
- H-Index II (LeetCode 275) — sorted input variant (binary search preferred, but bucket solution instructive)
Counting Sort Variants
- Sort Colors (LeetCode 75) — three-value sort; Dutch national flag is optimal but counting sort works
- Relative Sort Array (LeetCode 1122) — counting sort with custom order for elements outside the order list
Clarification Questions to Ask
- "Are values integers, or can they be floats?" (float range partitioning requires care with division)
- "What is the value range — is it bounded or arbitrarily large?" (unbounded range makes counting sort infeasible; use comparison-based instead)
- "Are there negative values?" (bucket index formula assumes
v − min_val ≥ 0; confirm min) - "Is stability required?" (standard bucket sort is stable if inner sort is stable and elements are placed left-to-right; confirm if output order among equal-frequency elements matters)
- "What defines 'frequent'?" (for top-k: is k guaranteed < n? can there be ties at the k-th frequency?)
Common Interview Mistakes
- Allocating
[[]] * ninstead of[[] for _ in range(n)]— all buckets share the same list object; every append corrupts all buckets simultaneously. - Forgetting to handle
max_val == min_valbefore computing bucket size — causes ZeroDivisionError on uniform inputs. - Off-by-one on frequency bucket size: allocating
nslots instead ofn + 1, missing the case where one element appears n times. - Using
sorted()on the full array instead of building frequency buckets — gets the right answer but is O(n log n), failing an explicit O(n) requirement. - In maximum gap: storing all elements per bucket (O(n) per bucket in worst case) instead of just
(min, max)— conceptually correct but misses the O(n) space insight. - Applying bucket sort when the range is much larger than n (e.g., values up to 10^9 with n = 100) — the bucket array itself would require 10^9 slots; use a comparison sort instead.
Typical Follow-Up Escalations
- "Now solve it in O(1) extra space" → Dutch national flag / counting in-place for sort colors; for top-k this is generally not possible without relaxing constraints.
- "What if k can equal n?" → frequency bucket handles this naturally; heap-based solution degrades to O(n log n).
- "What if there are ties at the k-th frequency?" → clarify whether any valid subset is acceptable or all tied elements must be included; adjust stopping condition in bucket iteration.
- "Can you do maximum gap without extra space?" → not in O(n) time while preserving input; the pigeonhole argument requires storing per-bucket extremes.
- "What if the input is a stream and you must answer online?" → bucket sort is offline (requires knowing min/max first); switch to a sorted structure or segment tree for online range queries.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles