Reservoir Sampling

Core Concepts

Reservoir Sampling — a family of randomized algorithms for selecting a uniform random sample of k items from a stream of n items when n is unknown or too large to store. The sample is chosen such that each item has exactly k/n probability of being selected at the end of the stream.

Reservoir — the in-memory array of k selected items maintained as the stream is processed. It holds the current candidate sample and is updated as new items arrive.

Online Algorithm — processes each item exactly once as it arrives, making an irrevocable accept/reject decision with no revisiting. Reservoir sampling is the canonical online sampling technique; it requires O(k) space regardless of stream length.

Uniform Sampling Guarantee — after processing the i-th item (1-indexed), every item seen so far has exactly k/i probability of being in the reservoir. This invariant holds at each step, so when the stream ends it holds globally.

Algorithm R — the classic reservoir sampling algorithm by Vitter (1985). Fills the reservoir with the first k items, then for each subsequent item at position i, includes it with probability k/i by replacing a uniformly random reservoir slot.

Weighted Reservoir Sampling — generalises Algorithm R to non-uniform weights. Each item is assigned a key u^(1/w) where u is uniform random in (0,1] and w is the item's weight. The k items with the largest keys form the weighted sample.

Rejection Sampling — a related but distinct technique: repeatedly draw candidates until one satisfies a condition (used in random point in shapes). Not reservoir sampling; it has variable runtime. Do not conflate the two.

Types / Variants

Variant	What it is	When to use	Key property
Size-1 reservoir	Select one item uniformly at random from a stream	Unknown-length linked list or stream; want one random item	Item at position i kept with probability 1/i
Size-k reservoir	Select k items uniformly at random without replacement	Sample k rows from large file; k known upfront	Each item kept with probability k/i at step i
Weighted reservoir	Select k items proportional to assigned weights	Weighted random sampling from stream	Key = `random() ** (1/w)`; keep k largest keys
Distributed / parallel	Merge two independent reservoir samples	MapReduce / multi-shard streams	Merge by treating combined stream as single pass

Key Algorithms

Algorithm R (Size-1 Special Case)

Select one uniform random item from an unknown-length stream. At position i (1-indexed), replace the current choice with the new item with probability 1/i.

Key insight: by induction, after i steps every item has seen probability 1/i of being selected. Adding item i+1 with probability 1/(i+1) and displacing the current item with that same probability preserves the invariant.

chosen = None
for i, item in enumerate(stream, 1):
    if random.randint(1, i) == 1:
        chosen = item

Time O(n), Space O(1).

Algorithm R (Size-k General Case)

Fill the reservoir with the first k items. For each subsequent item at 1-indexed position i > k, include it with probability k/i by replacing a uniformly random position in the reservoir.

Key insight: at step i, the probability that any specific earlier item is still in the reservoir is k/i. Including the new item with probability k/i and displacing a uniform random slot maintains this invariant.

reservoir = stream[:k]
for i, item in enumerate(stream[k:], k + 1):
    j = random.randint(0, i - 1)
    if j < k:
        reservoir[j] = item

Time O(n), Space O(k).

Weighted Reservoir Sampling (Algorithm A-Res)

Each item with weight w generates key random() ** (1.0 / w). Maintain a min-heap of size k on these keys. An item enters the reservoir if its key exceeds the current minimum; it displaces the minimum-key item.

Key insight: the key transformation converts a weight-proportional selection problem into a uniform-key competition. Items with higher weight produce stochastically larger keys.

import heapq, random
heap = []  # min-heap of (key, item)
for item, w in stream:
    key = random.random() ** (1.0 / w)
    if len(heap) < k or key > heap[0][0]:
        heapq.heapreplace(heap, (key, item)) if len(heap) == k else heapq.heappush(heap, (key, item))

Time O(n log k), Space O(k).

Random Node from Linked List

Apply size-1 Algorithm R to a singly linked list with unknown length. Traverse once; at node i, select it with probability 1/i.

This is the direct application of Algorithm R where the "stream" is the linked list traversal. No separate pass to count length is needed or desired — that would make it offline.

Time O(n) per query if list changes; O(n) one-time pass, O(1) space.

Random Pick Index (Multiple Indices with Same Value)

Given an array with duplicates, return a uniform random index of a target value. Traverse the array treating only positions where nums[i] == target as the stream; apply size-1 Algorithm R over that sub-stream.

Key insight: counting occurrences then calling random.randint requires a full pass anyway; reservoir sampling is equally correct and also works if the array is a stream.

count, chosen = 0, -1
for i, v in enumerate(nums):
    if v == target:
        count += 1
        if random.randint(1, count) == 1:
            chosen = i

Time O(n) per query, Space O(1).

Random Point in Non-Overlapping Rectangles

Select a point uniformly at random from the union of axis-aligned rectangles. This is weighted reservoir sampling where each rectangle's weight equals its area (number of integer points). Use prefix sums of areas and a binary search to select the rectangle, then pick a uniform point inside it.

Key insight: this is offline weighted selection — weights are known upfront, so prefix-sum + binary search is more efficient than streaming A-Res.

areas = [(r[2]-r[0]+1)*(r[3]-r[1]+1) for r in rects]
prefix = list(itertools.accumulate(areas))
idx = bisect.bisect_left(prefix, random.randint(1, prefix[-1]))

Time O(n) build, O(log n) per query, Space O(n).

Problem Taxonomy

By Problem Category

Problem Type	What it asks	Key Technique
Random node / element	Return a uniform random element from an unknown-length or non-indexable structure	Size-1 Algorithm R
Random index with duplicates	Return a uniform random index matching a target value	Algorithm R applied to filtered stream
Random point in geometric region	Return a uniform random point within a shape or union of shapes	Weighted selection by area + uniform point within chosen shape
Streaming k-sample	Maintain a uniform random sample of k items from an ongoing stream	Size-k Algorithm R

Problem Signal → Technique

Signal in problem statement	Likely technique
"linked list" + "random" + no length given	Size-1 Algorithm R (single traversal)
"unknown stream length" + "uniform random"	Reservoir sampling (Algorithm R)
"random index" + array has duplicates + "equal probability"	Algorithm R over filtered positions
"random point" + multiple rectangles / regions	Weighted selection by area (prefix sum + binary search)
"without replacement" + "stream"	Size-k Algorithm R
Weights given + "proportional probability" + stream	Weighted reservoir (A-Res with key = `u^(1/w)`)

Common Patterns

Pattern	When it applies	Mechanism
Single-pass counter	Selecting one uniform random item with O(1) extra space	Maintain `count` and `chosen`; at step i replace with probability 1/count
Heap-maintained reservoir	Selecting k items, especially weighted	Min-heap of size k on (key, item); evict minimum when a larger key arrives
Prefix-sum + binary search	Offline weighted selection where all weights known upfront	Compute cumulative weights; binary search a uniform random draw into the prefix array
Uniform point inside rectangle	After selecting a rectangle, pick x and y independently uniform	`x = random.randint(r[0], r[2])`, `y = random.randint(r[1], r[3])`
Lazy population counting	Counting how many valid items exist without a pre-pass	Let the reservoir algorithm's internal counter serve as the running count

Edge Cases & Pitfalls

Off-by-one in probability: using random.randint(0, i-1) == 0 vs. random.randint(1, i) == 1 — both are correct for size-1, but mixing 0-indexed and 1-indexed logic in the same implementation causes subtle bias. Pick one convention and stick to it throughout.
Empty stream / empty list: if the stream has zero elements, chosen remains None (or uninitialized). Always check for this before returning; the reservoir is undefined for an empty input.
k greater than stream length: if the stream ends before the reservoir is full, return however many items arrived. Do not pad with None unless the problem guarantees n ≥ k.
Weighted sampling with zero-weight items: random() ** (1/0) raises ZeroDivisionError. Skip zero-weight items entirely — they should never be selected.
Integer vs. float keys in weighted sampling: random.random() returns [0, 1). A key of exactly 0 raised to any power is 0 and will always be evicted. This is mathematically negligible but means items are never selected with probability exactly zero — acceptable in practice.
Reservoir index range for size-k: the replacement index must be uniform in [0, i-1] (full history), not [0, k-1] (just the reservoir). Using [0, k-1] biases toward recent items.
Area calculation for integer grid points: the number of integer points in rectangle [x1,y1,x2,y2] is (x2-x1+1)*(y2-y1+1), not (x2-x1)*(y2-y1). Forgetting the +1 leads to systematic underweighting of boundary points.

Implementation Notes

random.randint(a, b) in Python is inclusive on both ends: randint(1, i) returns values in [1, i]. This differs from most languages where the upper bound is exclusive.
random.random() returns a float in [0.0, 1.0). For weighted sampling keys, use random.random() not random.uniform(0,1) — they are equivalent but random() is marginally faster.
Python's random module is not cryptographically secure. For security-sensitive sampling use secrets module, but it lacks randint-equivalent convenience; compute secrets.randbelow(i) + 1 instead.
heapq in Python is a min-heap. For weighted reservoir sampling, store (key, item) tuples directly — Python compares tuples lexicographically, so the key drives ordering correctly as long as keys are unique enough (float collisions are astronomically rare).
For linked list problems, the head reference is typically passed at construction time. Store it; do not re-traverse from outside the class.

Cross-Topic Relations

Related Topic	Connection
Randomized Algorithms	Reservoir sampling is the canonical example of an online randomized algorithm; correctness is proved by probability theory, not by deterministic invariants
Binary Search	Used in offline weighted selection (prefix-sum array lookup) to find the target rectangle/bucket in O(log n)
Heap / Priority Queue	Size-k reservoir and weighted reservoir use a min-heap to maintain the current best-k items efficiently
Prefix Sum	Offline weighted sampling over known collections reduces to a prefix-sum construction followed by a single random draw
Math / Probability	The correctness proof is inductive; understanding why 1/i probability at each step yields uniform distribution requires basic probability

See prefix-sum.md for full coverage of prefix sums used in offline weighted selection.

Interview Reference

Must-Know Problems

Core Reservoir Sampling

LC 382 — Linked List Random Node (size-1 Algorithm R on linked list)
LC 398 — Random Pick Index (Algorithm R over filtered positions)

Weighted / Geometric Selection

LC 497 — Random Point in Non-Overlapping Rectangles (prefix-sum weighted selection + uniform point)
LC 528 — Random Pick with Weight (offline weighted selection, prefix-sum + binary search)

Clarification Questions to Ask

Is the input stream finite and known upfront, or truly online/unknown-length? (Determines if offline prefix-sum approach is viable.)
Should sampling be with or without replacement? (Without replacement = reservoir; with replacement = independent draws.)
Are weights integers or floats? Are they guaranteed positive? (Zero weights break the key formula.)
For geometric problems: are coordinates integers or reals? (Affects how "uniform random point" is computed.)
Will getRandom() be called once or many times? (If many times and the data is static, precompute; if data changes, reservoir approach per-query is appropriate.)

Common Interview Mistakes

Forgetting to handle the empty input case before running the algorithm, causing an uninitialized return value.
Using the wrong index range for the replacement step in size-k sampling ([0, k-1] instead of [0, i-1]), which biases the sample toward recent items.
Pre-computing the length of the linked list in a separate pass, then drawing a random index — this is correct but misses the point of the problem and fails if the list is truly a stream.
Conflating rejection sampling (variable-time, used for continuous distributions within shapes) with reservoir sampling (fixed single-pass). Using rejection sampling for the "pick random node" problem is incorrect.
In weighted sampling, computing key as random() * w instead of random() ** (1/w) — the former does not produce a correct proportional sample.

Typical Follow-Up Escalations

"What if the list changes between calls (nodes added/deleted)?" — Re-run the single-pass Algorithm R per call for correctness; or maintain a count and random-access structure if random access is available.
"What if you need k random nodes instead of 1?" — Switch to size-k Algorithm R: fill reservoir with first k, then for item i > k replace slot randint(0, i-1) if that index falls in [0, k-1].
"What if nodes have weights and you want weighted random selection?" — Use weighted reservoir sampling (A-Res): assign key random() ** (1/w) per node, maintain a min-heap of size k on keys.
"Can you do this in O(log n) per update instead of O(n) per query?" — Build a balanced BST or segment tree over the values to support O(log n) random selection by rank, at the cost of O(n) space and O(log n) insert/delete.

Explore Unread

Great job! You've read all available articles

Reservoir Sampling

Core Concepts

Reservoir — the in-memory array of k selected items maintained as the stream is processed. It holds the current candidate sample and is updated as new items arrive.

Types / Variants

Variant	What it is	When to use	Key property
Size-1 reservoir	Select one item uniformly at random from a stream	Unknown-length linked list or stream; want one random item	Item at position i kept with probability 1/i
Size-k reservoir	Select k items uniformly at random without replacement	Sample k rows from large file; k known upfront	Each item kept with probability k/i at step i
Weighted reservoir	Select k items proportional to assigned weights	Weighted random sampling from stream	Key = `random() ** (1/w)`; keep k largest keys
Distributed / parallel	Merge two independent reservoir samples	MapReduce / multi-shard streams	Merge by treating combined stream as single pass

Key Algorithms

Algorithm R (Size-1 Special Case)

Select one uniform random item from an unknown-length stream. At position i (1-indexed), replace the current choice with the new item with probability 1/i.

chosen = None
for i, item in enumerate(stream, 1):
    if random.randint(1, i) == 1:
        chosen = item

Time O(n), Space O(1).

Algorithm R (Size-k General Case)

Fill the reservoir with the first k items. For each subsequent item at 1-indexed position i > k, include it with probability k/i by replacing a uniformly random position in the reservoir.

reservoir = stream[:k]
for i, item in enumerate(stream[k:], k + 1):
    j = random.randint(0, i - 1)
    if j < k:
        reservoir[j] = item

Time O(n), Space O(k).

Weighted Reservoir Sampling (Algorithm A-Res)

Key insight: the key transformation converts a weight-proportional selection problem into a uniform-key competition. Items with higher weight produce stochastically larger keys.

import heapq, random
heap = []  # min-heap of (key, item)
for item, w in stream:
    key = random.random() ** (1.0 / w)
    if len(heap) < k or key > heap[0][0]:
        heapq.heapreplace(heap, (key, item)) if len(heap) == k else heapq.heappush(heap, (key, item))

Time O(n log k), Space O(k).

Random Node from Linked List

Apply size-1 Algorithm R to a singly linked list with unknown length. Traverse once; at node i, select it with probability 1/i.

This is the direct application of Algorithm R where the "stream" is the linked list traversal. No separate pass to count length is needed or desired — that would make it offline.

Time O(n) per query if list changes; O(n) one-time pass, O(1) space.

Random Pick Index (Multiple Indices with Same Value)

Key insight: counting occurrences then calling random.randint requires a full pass anyway; reservoir sampling is equally correct and also works if the array is a stream.

count, chosen = 0, -1
for i, v in enumerate(nums):
    if v == target:
        count += 1
        if random.randint(1, count) == 1:
            chosen = i

Time O(n) per query, Space O(1).

Random Point in Non-Overlapping Rectangles

Key insight: this is offline weighted selection — weights are known upfront, so prefix-sum + binary search is more efficient than streaming A-Res.

areas = [(r[2]-r[0]+1)*(r[3]-r[1]+1) for r in rects]
prefix = list(itertools.accumulate(areas))
idx = bisect.bisect_left(prefix, random.randint(1, prefix[-1]))

Time O(n) build, O(log n) per query, Space O(n).

Problem Taxonomy

By Problem Category

Problem Type	What it asks	Key Technique
Random node / element	Return a uniform random element from an unknown-length or non-indexable structure	Size-1 Algorithm R
Random index with duplicates	Return a uniform random index matching a target value	Algorithm R applied to filtered stream
Random point in geometric region	Return a uniform random point within a shape or union of shapes	Weighted selection by area + uniform point within chosen shape
Streaming k-sample	Maintain a uniform random sample of k items from an ongoing stream	Size-k Algorithm R

Problem Signal → Technique

Signal in problem statement	Likely technique
"linked list" + "random" + no length given	Size-1 Algorithm R (single traversal)
"unknown stream length" + "uniform random"	Reservoir sampling (Algorithm R)
"random index" + array has duplicates + "equal probability"	Algorithm R over filtered positions
"random point" + multiple rectangles / regions	Weighted selection by area (prefix sum + binary search)
"without replacement" + "stream"	Size-k Algorithm R
Weights given + "proportional probability" + stream	Weighted reservoir (A-Res with key = `u^(1/w)`)

Common Patterns

Pattern	When it applies	Mechanism
Single-pass counter	Selecting one uniform random item with O(1) extra space	Maintain `count` and `chosen`; at step i replace with probability 1/count
Heap-maintained reservoir	Selecting k items, especially weighted	Min-heap of size k on (key, item); evict minimum when a larger key arrives
Prefix-sum + binary search	Offline weighted selection where all weights known upfront	Compute cumulative weights; binary search a uniform random draw into the prefix array
Uniform point inside rectangle	After selecting a rectangle, pick x and y independently uniform	`x = random.randint(r[0], r[2])`, `y = random.randint(r[1], r[3])`
Lazy population counting	Counting how many valid items exist without a pre-pass	Let the reservoir algorithm's internal counter serve as the running count

Edge Cases & Pitfalls

Off-by-one in probability: using random.randint(0, i-1) == 0 vs. random.randint(1, i) == 1 — both are correct for size-1, but mixing 0-indexed and 1-indexed logic in the same implementation causes subtle bias. Pick one convention and stick to it throughout.
Empty stream / empty list: if the stream has zero elements, chosen remains None (or uninitialized). Always check for this before returning; the reservoir is undefined for an empty input.
k greater than stream length: if the stream ends before the reservoir is full, return however many items arrived. Do not pad with None unless the problem guarantees n ≥ k.
Weighted sampling with zero-weight items: random() ** (1/0) raises ZeroDivisionError. Skip zero-weight items entirely — they should never be selected.
Integer vs. float keys in weighted sampling: random.random() returns [0, 1). A key of exactly 0 raised to any power is 0 and will always be evicted. This is mathematically negligible but means items are never selected with probability exactly zero — acceptable in practice.
Reservoir index range for size-k: the replacement index must be uniform in [0, i-1] (full history), not [0, k-1] (just the reservoir). Using [0, k-1] biases toward recent items.
Area calculation for integer grid points: the number of integer points in rectangle [x1,y1,x2,y2] is (x2-x1+1)*(y2-y1+1), not (x2-x1)*(y2-y1). Forgetting the +1 leads to systematic underweighting of boundary points.

Implementation Notes

random.randint(a, b) in Python is inclusive on both ends: randint(1, i) returns values in [1, i]. This differs from most languages where the upper bound is exclusive.
random.random() returns a float in [0.0, 1.0). For weighted sampling keys, use random.random() not random.uniform(0,1) — they are equivalent but random() is marginally faster.
Python's random module is not cryptographically secure. For security-sensitive sampling use secrets module, but it lacks randint-equivalent convenience; compute secrets.randbelow(i) + 1 instead.
heapq in Python is a min-heap. For weighted reservoir sampling, store (key, item) tuples directly — Python compares tuples lexicographically, so the key drives ordering correctly as long as keys are unique enough (float collisions are astronomically rare).
For linked list problems, the head reference is typically passed at construction time. Store it; do not re-traverse from outside the class.

Cross-Topic Relations

Related Topic	Connection
Randomized Algorithms	Reservoir sampling is the canonical example of an online randomized algorithm; correctness is proved by probability theory, not by deterministic invariants
Binary Search	Used in offline weighted selection (prefix-sum array lookup) to find the target rectangle/bucket in O(log n)
Heap / Priority Queue	Size-k reservoir and weighted reservoir use a min-heap to maintain the current best-k items efficiently
Prefix Sum	Offline weighted sampling over known collections reduces to a prefix-sum construction followed by a single random draw
Math / Probability	The correctness proof is inductive; understanding why 1/i probability at each step yields uniform distribution requires basic probability

See prefix-sum.md for full coverage of prefix sums used in offline weighted selection.

Interview Reference

Must-Know Problems

Core Reservoir Sampling

LC 382 — Linked List Random Node (size-1 Algorithm R on linked list)
LC 398 — Random Pick Index (Algorithm R over filtered positions)

Weighted / Geometric Selection

LC 497 — Random Point in Non-Overlapping Rectangles (prefix-sum weighted selection + uniform point)
LC 528 — Random Pick with Weight (offline weighted selection, prefix-sum + binary search)

Clarification Questions to Ask

Is the input stream finite and known upfront, or truly online/unknown-length? (Determines if offline prefix-sum approach is viable.)
Should sampling be with or without replacement? (Without replacement = reservoir; with replacement = independent draws.)
Are weights integers or floats? Are they guaranteed positive? (Zero weights break the key formula.)
For geometric problems: are coordinates integers or reals? (Affects how "uniform random point" is computed.)
Will getRandom() be called once or many times? (If many times and the data is static, precompute; if data changes, reservoir approach per-query is appropriate.)

Common Interview Mistakes

Forgetting to handle the empty input case before running the algorithm, causing an uninitialized return value.
Using the wrong index range for the replacement step in size-k sampling ([0, k-1] instead of [0, i-1]), which biases the sample toward recent items.
Pre-computing the length of the linked list in a separate pass, then drawing a random index — this is correct but misses the point of the problem and fails if the list is truly a stream.
Conflating rejection sampling (variable-time, used for continuous distributions within shapes) with reservoir sampling (fixed single-pass). Using rejection sampling for the "pick random node" problem is incorrect.
In weighted sampling, computing key as random() * w instead of random() ** (1/w) — the former does not produce a correct proportional sample.

Typical Follow-Up Escalations

"What if the list changes between calls (nodes added/deleted)?" — Re-run the single-pass Algorithm R per call for correctness; or maintain a count and random-access structure if random access is available.
"What if you need k random nodes instead of 1?" — Switch to size-k Algorithm R: fill reservoir with first k, then for item i > k replace slot randint(0, i-1) if that index falls in [0, k-1].
"What if nodes have weights and you want weighted random selection?" — Use weighted reservoir sampling (A-Res): assign key random() ** (1/w) per node, maintain a min-heap of size k on keys.
"Can you do this in O(log n) per update instead of O(n) per query?" — Build a balanced BST or segment tree over the values to support O(log n) random selection by rank, at the cost of O(n) space and O(log n) insert/delete.

Explore Unread

Great job! You've read all available articles

Reservoir Sampling

Core Concepts

Types / Variants

Key Algorithms

Algorithm R (Size-1 Special Case)

Algorithm R (Size-k General Case)

Weighted Reservoir Sampling (Algorithm A-Res)

Random Node from Linked List

Random Pick Index (Multiple Indices with Same Value)

Random Point in Non-Overlapping Rectangles

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread

Reservoir Sampling

Core Concepts

Types / Variants

Key Algorithms

Algorithm R (Size-1 Special Case)

Algorithm R (Size-k General Case)

Weighted Reservoir Sampling (Algorithm A-Res)

Random Node from Linked List

Random Pick Index (Multiple Indices with Same Value)

Random Point in Non-Overlapping Rectangles

Problem Taxonomy

By Problem Category

Problem Signal → Technique

Common Patterns

Edge Cases & Pitfalls

Implementation Notes

Cross-Topic Relations

Interview Reference

Must-Know Problems

Clarification Questions to Ask

Common Interview Mistakes

Typical Follow-Up Escalations

Read Next

Explore Unread