Reservoir Sampling
Core Concepts
Reservoir Sampling — a family of randomized algorithms for selecting a uniform random sample of k items from a stream of n items when n is unknown or too large to store. The sample is chosen such that each item has exactly k/n probability of being selected at the end of the stream.
Reservoir — the in-memory array of k selected items maintained as the stream is processed. It holds the current candidate sample and is updated as new items arrive.
Online Algorithm — processes each item exactly once as it arrives, making an irrevocable accept/reject decision with no revisiting. Reservoir sampling is the canonical online sampling technique; it requires O(k) space regardless of stream length.
Uniform Sampling Guarantee — after processing the i-th item (1-indexed), every item seen so far has exactly k/i probability of being in the reservoir. This invariant holds at each step, so when the stream ends it holds globally.
Algorithm R — the classic reservoir sampling algorithm by Vitter (1985). Fills the reservoir with the first k items, then for each subsequent item at position i, includes it with probability k/i by replacing a uniformly random reservoir slot.
Weighted Reservoir Sampling — generalises Algorithm R to non-uniform weights. Each item is assigned a key u^(1/w) where u is uniform random in (0,1] and w is the item's weight. The k items with the largest keys form the weighted sample.
Rejection Sampling — a related but distinct technique: repeatedly draw candidates until one satisfies a condition (used in random point in shapes). Not reservoir sampling; it has variable runtime. Do not conflate the two.
Types / Variants
| Variant | What it is | When to use | Key property |
|---|---|---|---|
| Size-1 reservoir | Select one item uniformly at random from a stream | Unknown-length linked list or stream; want one random item | Item at position i kept with probability 1/i |
| Size-k reservoir | Select k items uniformly at random without replacement | Sample k rows from large file; k known upfront | Each item kept with probability k/i at step i |
| Weighted reservoir | Select k items proportional to assigned weights | Weighted random sampling from stream | Key = random() ** (1/w); keep k largest keys |
| Distributed / parallel | Merge two independent reservoir samples | MapReduce / multi-shard streams | Merge by treating combined stream as single pass |
Key Algorithms
Algorithm R (Size-1 Special Case)
Select one uniform random item from an unknown-length stream. At position i (1-indexed), replace the current choice with the new item with probability 1/i.
Key insight: by induction, after i steps every item has seen probability 1/i of being selected. Adding item i+1 with probability 1/(i+1) and displacing the current item with that same probability preserves the invariant.
chosen = None
for i, item in enumerate(stream, 1):
if random.randint(1, i) == 1:
chosen = item
Time O(n), Space O(1).
Algorithm R (Size-k General Case)
Fill the reservoir with the first k items. For each subsequent item at 1-indexed position i > k, include it with probability k/i by replacing a uniformly random position in the reservoir.
Key insight: at step i, the probability that any specific earlier item is still in the reservoir is k/i. Including the new item with probability k/i and displacing a uniform random slot maintains this invariant.
reservoir = stream[:k]
for i, item in enumerate(stream[k:], k + 1):
j = random.randint(0, i - 1)
if j < k:
reservoir[j] = item
Time O(n), Space O(k).
Weighted Reservoir Sampling (Algorithm A-Res)
Each item with weight w generates key random() ** (1.0 / w). Maintain a min-heap of size k on these keys. An item enters the reservoir if its key exceeds the current minimum; it displaces the minimum-key item.
Key insight: the key transformation converts a weight-proportional selection problem into a uniform-key competition. Items with higher weight produce stochastically larger keys.
import heapq, random
heap = [] # min-heap of (key, item)
for item, w in stream:
key = random.random() ** (1.0 / w)
if len(heap) < k or key > heap[0][0]:
heapq.heapreplace(heap, (key, item)) if len(heap) == k else heapq.heappush(heap, (key, item))
Time O(n log k), Space O(k).
Random Node from Linked List
Apply size-1 Algorithm R to a singly linked list with unknown length. Traverse once; at node i, select it with probability 1/i.
This is the direct application of Algorithm R where the "stream" is the linked list traversal. No separate pass to count length is needed or desired — that would make it offline.
Time O(n) per query if list changes; O(n) one-time pass, O(1) space.
Random Pick Index (Multiple Indices with Same Value)
Given an array with duplicates, return a uniform random index of a target value. Traverse the array treating only positions where nums[i] == target as the stream; apply size-1 Algorithm R over that sub-stream.
Key insight: counting occurrences then calling random.randint requires a full pass anyway; reservoir sampling is equally correct and also works if the array is a stream.
count, chosen = 0, -1
for i, v in enumerate(nums):
if v == target:
count += 1
if random.randint(1, count) == 1:
chosen = i
Time O(n) per query, Space O(1).
Random Point in Non-Overlapping Rectangles
Select a point uniformly at random from the union of axis-aligned rectangles. This is weighted reservoir sampling where each rectangle's weight equals its area (number of integer points). Use prefix sums of areas and a binary search to select the rectangle, then pick a uniform point inside it.
Key insight: this is offline weighted selection — weights are known upfront, so prefix-sum + binary search is more efficient than streaming A-Res.
areas = [(r[2]-r[0]+1)*(r[3]-r[1]+1) for r in rects]
prefix = list(itertools.accumulate(areas))
idx = bisect.bisect_left(prefix, random.randint(1, prefix[-1]))
Time O(n) build, O(log n) per query, Space O(n).
Problem Taxonomy
By Problem Category
| Problem Type | What it asks | Key Technique |
|---|---|---|
| Random node / element | Return a uniform random element from an unknown-length or non-indexable structure | Size-1 Algorithm R |
| Random index with duplicates | Return a uniform random index matching a target value | Algorithm R applied to filtered stream |
| Random point in geometric region | Return a uniform random point within a shape or union of shapes | Weighted selection by area + uniform point within chosen shape |
| Streaming k-sample | Maintain a uniform random sample of k items from an ongoing stream | Size-k Algorithm R |
Problem Signal → Technique
| Signal in problem statement | Likely technique |
|---|---|
| "linked list" + "random" + no length given | Size-1 Algorithm R (single traversal) |
| "unknown stream length" + "uniform random" | Reservoir sampling (Algorithm R) |
| "random index" + array has duplicates + "equal probability" | Algorithm R over filtered positions |
| "random point" + multiple rectangles / regions | Weighted selection by area (prefix sum + binary search) |
| "without replacement" + "stream" | Size-k Algorithm R |
| Weights given + "proportional probability" + stream | Weighted reservoir (A-Res with key = u^(1/w)) |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Single-pass counter | Selecting one uniform random item with O(1) extra space | Maintain count and chosen; at step i replace with probability 1/count |
| Heap-maintained reservoir | Selecting k items, especially weighted | Min-heap of size k on (key, item); evict minimum when a larger key arrives |
| Prefix-sum + binary search | Offline weighted selection where all weights known upfront | Compute cumulative weights; binary search a uniform random draw into the prefix array |
| Uniform point inside rectangle | After selecting a rectangle, pick x and y independently uniform | x = random.randint(r[0], r[2]), y = random.randint(r[1], r[3]) |
| Lazy population counting | Counting how many valid items exist without a pre-pass | Let the reservoir algorithm's internal counter serve as the running count |
Edge Cases & Pitfalls
-
Off-by-one in probability: using
random.randint(0, i-1) == 0vs.random.randint(1, i) == 1— both are correct for size-1, but mixing 0-indexed and 1-indexed logic in the same implementation causes subtle bias. Pick one convention and stick to it throughout. -
Empty stream / empty list: if the stream has zero elements,
chosenremainsNone(or uninitialized). Always check for this before returning; the reservoir is undefined for an empty input. -
k greater than stream length: if the stream ends before the reservoir is full, return however many items arrived. Do not pad with
Noneunless the problem guarantees n ≥ k. -
Weighted sampling with zero-weight items:
random() ** (1/0)raises ZeroDivisionError. Skip zero-weight items entirely — they should never be selected. -
Integer vs. float keys in weighted sampling:
random.random()returns [0, 1). A key of exactly 0 raised to any power is 0 and will always be evicted. This is mathematically negligible but means items are never selected with probability exactly zero — acceptable in practice. -
Reservoir index range for size-k: the replacement index must be uniform in
[0, i-1](full history), not[0, k-1](just the reservoir). Using[0, k-1]biases toward recent items. -
Area calculation for integer grid points: the number of integer points in rectangle
[x1,y1,x2,y2]is(x2-x1+1)*(y2-y1+1), not(x2-x1)*(y2-y1). Forgetting the +1 leads to systematic underweighting of boundary points.
Implementation Notes
random.randint(a, b)in Python is inclusive on both ends:randint(1, i)returns values in [1, i]. This differs from most languages where the upper bound is exclusive.random.random()returns a float in [0.0, 1.0). For weighted sampling keys, userandom.random()notrandom.uniform(0,1)— they are equivalent butrandom()is marginally faster.- Python's
randommodule is not cryptographically secure. For security-sensitive sampling usesecretsmodule, but it lacksrandint-equivalent convenience; computesecrets.randbelow(i) + 1instead. heapqin Python is a min-heap. For weighted reservoir sampling, store(key, item)tuples directly — Python compares tuples lexicographically, so the key drives ordering correctly as long as keys are unique enough (float collisions are astronomically rare).- For linked list problems, the head reference is typically passed at construction time. Store it; do not re-traverse from outside the class.
Cross-Topic Relations
| Related Topic | Connection |
|---|---|
| Randomized Algorithms | Reservoir sampling is the canonical example of an online randomized algorithm; correctness is proved by probability theory, not by deterministic invariants |
| Binary Search | Used in offline weighted selection (prefix-sum array lookup) to find the target rectangle/bucket in O(log n) |
| Heap / Priority Queue | Size-k reservoir and weighted reservoir use a min-heap to maintain the current best-k items efficiently |
| Prefix Sum | Offline weighted sampling over known collections reduces to a prefix-sum construction followed by a single random draw |
| Math / Probability | The correctness proof is inductive; understanding why 1/i probability at each step yields uniform distribution requires basic probability |
See prefix-sum.md for full coverage of prefix sums used in offline weighted selection.
Interview Reference
Must-Know Problems
Core Reservoir Sampling
- LC 382 — Linked List Random Node (size-1 Algorithm R on linked list)
- LC 398 — Random Pick Index (Algorithm R over filtered positions)
Weighted / Geometric Selection
- LC 497 — Random Point in Non-Overlapping Rectangles (prefix-sum weighted selection + uniform point)
- LC 528 — Random Pick with Weight (offline weighted selection, prefix-sum + binary search)
Clarification Questions to Ask
- Is the input stream finite and known upfront, or truly online/unknown-length? (Determines if offline prefix-sum approach is viable.)
- Should sampling be with or without replacement? (Without replacement = reservoir; with replacement = independent draws.)
- Are weights integers or floats? Are they guaranteed positive? (Zero weights break the key formula.)
- For geometric problems: are coordinates integers or reals? (Affects how "uniform random point" is computed.)
- Will
getRandom()be called once or many times? (If many times and the data is static, precompute; if data changes, reservoir approach per-query is appropriate.)
Common Interview Mistakes
- Forgetting to handle the empty input case before running the algorithm, causing an uninitialized return value.
- Using the wrong index range for the replacement step in size-k sampling (
[0, k-1]instead of[0, i-1]), which biases the sample toward recent items. - Pre-computing the length of the linked list in a separate pass, then drawing a random index — this is correct but misses the point of the problem and fails if the list is truly a stream.
- Conflating rejection sampling (variable-time, used for continuous distributions within shapes) with reservoir sampling (fixed single-pass). Using rejection sampling for the "pick random node" problem is incorrect.
- In weighted sampling, computing key as
random() * winstead ofrandom() ** (1/w)— the former does not produce a correct proportional sample.
Typical Follow-Up Escalations
- "What if the list changes between calls (nodes added/deleted)?" — Re-run the single-pass Algorithm R per call for correctness; or maintain a count and random-access structure if random access is available.
- "What if you need k random nodes instead of 1?" — Switch to size-k Algorithm R: fill reservoir with first k, then for item i > k replace slot
randint(0, i-1)if that index falls in[0, k-1]. - "What if nodes have weights and you want weighted random selection?" — Use weighted reservoir sampling (A-Res): assign key
random() ** (1/w)per node, maintain a min-heap of size k on keys. - "Can you do this in O(log n) per update instead of O(n) per query?" — Build a balanced BST or segment tree over the values to support O(log n) random selection by rank, at the cost of O(n) space and O(log n) insert/delete.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles