Data Stream
Topic type: Technique-Based (online algorithm design)
Data Stream problems require processing elements one at a time as they arrive, maintaining a data structure so each query after each insertion is answered correctly and efficiently. The defining constraint is that future elements are unknown — you cannot batch-sort or two-pass.
Core Concepts
Online algorithm — processes each element upon arrival; cannot defer computation until all input is seen. Opposite of offline algorithms that sort or index the full input first.
Streaming invariant — after every add(val) call, the data structure's internal state must already satisfy whatever query follows (median, kth largest, window max). Lazy rebuilding is not allowed.
Amortized O(log n) per operation — the standard bar. A solution that occasionally does O(n) work to rebalance violates the streaming contract unless bounded tightly.
Bounded stream (sliding window) — only the last k elements are relevant; older elements are evicted. The window size k is given upfront or implied by a time constraint.
Unbounded stream — all elements seen so far matter; the structure grows indefinitely.
Order statistic — a query asking for the r-th smallest/largest in the current stream. The median is the (n/2)-th order statistic.
Frequency statistic — a query asking for the most/least frequent element, or elements with frequency ≥ threshold.
Structural Invariants
The two-heap approach (the canonical hard technique) maintains two invariants simultaneously:
| Invariant | What it requires | Consequence of violation |
|---|---|---|
| Size balance | len(max_heap) == len(min_heap) or len(max_heap) == len(min_heap) + 1 | Median formula produces wrong result |
| Heap ordering | Every element in max_heap ≤ every element in min_heap | Median is drawn from the wrong partition |
These two invariants are independent; an insert can violate either or both and must restore both before returning.
For a min-heap-of-size-k (kth largest):
| Invariant | What it requires |
|---|---|
| Heap is never larger than k | Pop when size exceeds k |
| Heap root is the kth largest seen so far | Maintained automatically by the eviction rule |
Types / Variants
| Variant | Query type | Window | Core structure |
|---|---|---|---|
| Running median | Exact median after each insert | Unbounded | Two heaps (max + min) |
| Kth largest | Exact kth largest after each insert | Unbounded | Min-heap of size k |
| Moving average | Mean of last k elements | Bounded (size k) | Fixed-size queue |
| Window maximum | Max of last k elements | Bounded (size k) | Monotonic deque |
| Running span | How many consecutive prior elements ≤ current | Unbounded | Monotonic stack |
| Random sample | Uniform random sample of size k from stream | Unbounded | Reservoir array |
| Stream membership / matching | Does any suffix match a pattern set | Unbounded | Trie (Aho-Corasick) |
Key Algorithms
Two-Heap Running Median
Maintains the lower half of elements in a max-heap and the upper half in a min-heap. After each insert, the median is either the max of the lower heap (odd total) or the average of both heap tops (even total). O(log n) insert, O(1) query.
Key insight: an element may enter the wrong heap; cross-heap rebalancing restores the ordering invariant.
# insertion skeleton — non-obvious part
heappush(lo, -val) # lo is max-heap (negate to simulate)
heappush(hi, -heappop(lo)) # move lo's max to hi
if len(hi) > len(lo):
heappush(lo, -heappop(hi))
Time: O(log n) per add, O(1) per findMedian. Space: O(n).
Min-Heap of Size K (Kth Largest)
Keep exactly k elements in a min-heap; the root is always the kth largest seen. On each insert, push then pop if size exceeds k. O(log k) per insert, O(1) query.
Time: O(log k) per add. Space: O(k).
Fixed-Size Queue (Moving Average / Last K Sum)
Maintain a deque of the last k values and a running sum. On insert, append right; if length exceeds k, pop left and subtract from sum. O(1) per operation.
Time: O(1) per add. Space: O(k).
Monotonic Deque (Sliding Window Maximum)
Maintain a deque of indices in decreasing value order. On each insert at index i, pop from the back while arr[back] ≤ arr[i]; evict the front if it is outside the window. The front is always the window maximum. O(1) amortized per insert.
See sliding-window.md and monotonic-queue.md for full coverage.
Time: O(n) total, O(1) amortized per element. Space: O(k).
Monotonic Stack (Stock Span)
For each new price, pop elements from a stack while they are ≤ current price. The span equals current_index − index_of_new_stack_top. O(1) amortized per insert because each element is pushed and popped at most once.
See monotonic-stack.md for full coverage.
Time: O(n) total. Space: O(n) worst case.
Reservoir Sampling (Random K from Stream)
Keep the first k elements as the reservoir. For the i-th element (i > k), with probability k/i, replace a uniformly random reservoir slot. The reservoir always contains a uniform random sample of size k. O(1) per element, O(k) space.
Key insight: the accept probability k/i combined with the prior state's uniform distribution maintains the invariant by induction.
See reservoir-sampling.md for full coverage.
Ordered Set / SortedList (Rank Queries)
When queries ask for arbitrary percentile ranks or the median with deletions, use Python's sortedcontainers.SortedList. Supports O(log n) insert, O(log n) delete, O(log n) rank query. The two-heap approach cannot handle deletions without additional bookkeeping; SortedList handles them cleanly.
Time: O(log n) per insert/delete/rank. Space: O(n).
Advanced Techniques
Lazy Deletion Heap
What it solves: Problems where elements must be removed from a heap mid-stream (e.g., sliding window median where the left edge must be evicted from either heap).
Mechanism: Maintain a to_delete counter-map. Instead of removing from the heap immediately (O(n)), mark it deleted. When popping from the heap, skip elements whose count in to_delete is nonzero.
When to use over simpler alternatives: Only when deletion from a heap is required and a bounded window makes a SortedList overkill. For arbitrary deletions from a dynamic sorted set, SortedList is cleaner.
Time: O(log n) amortized per operation. Space: O(n) for the deletion map.
Binary Indexed Tree on Compressed Coordinates (BIT Stream Rank)
What it solves: Queries asking how many elements in the stream so far are ≤ x, or the exact kth smallest, when the value range is bounded or compressible.
Mechanism: Compress values to ranks [1..m] offline (or online with a pre-defined range). Maintain a BIT over those ranks. Each insert is a point update; rank query is a prefix sum query; kth smallest is a binary search on the BIT.
When to use over simpler alternatives: Only when the value range is known in advance (all values given at construction) or small enough to not compress. For unknown/unbounded ranges, SortedList is preferable. For median only, two heaps are simpler.
Time: O(log m) per insert and rank query. Space: O(m).
Trie-Based Stream Matching (Aho-Corasick)
What it solves: Given a stream of characters, after each character determine if any word from a dictionary is a suffix of the stream seen so far.
Mechanism: Build an Aho-Corasick automaton from the dictionary. Maintain a current state pointer; advance it on each new character following failure links. Any terminal state signals a match.
When to use: Only when multiple patterns must be matched simultaneously. For a single pattern, a rolling hash or KMP suffix is simpler.
See trie.md for automaton construction details.
Time: O(sum of pattern lengths) build, O(1) per stream character. Space: O(sum of pattern lengths × alphabet size).
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Running median | Exact median after each insert | Two heaps |
| Running kth order statistic | Exact kth largest/smallest in all elements so far | Min-heap size k (kth largest) or two heaps (arbitrary k) |
| Sliding window aggregate | Mean/max/min of exactly last k elements | Fixed queue (mean), monotonic deque (max/min) |
| Sliding window median | Median of last k elements | Two heaps + lazy deletion, or SortedList |
| Span query | Consecutive prior elements ≤/≥ current | Monotonic stack |
| Recent activity count | How many events in last t seconds | Queue with timestamp eviction |
| Frequency tracking | Most frequent element after each insert | Max-heap + freq map, or frequency-indexed bucket |
| Stream random sample | Uniform random k-sample from stream | Reservoir sampling |
| Stream pattern match | Does any dictionary word end at current position | Trie / Aho-Corasick |
Problem Signal → Technique
| Signal in problem statement | Likely technique |
|---|---|
| "find median", "median after each" | Two heaps |
| "kth largest in a stream" | Min-heap of size k |
| "moving average", "average of last k" | Fixed-size deque + running sum |
| "sliding window maximum" | Monotonic deque |
| "span", "how many consecutive days" | Monotonic stack |
| "last t seconds" / "last k requests" | Queue with eviction |
| "random pick", "random sample" | Reservoir sampling |
| "search in stream of characters" | Trie + Aho-Corasick |
| "add and remove" from ordered structure | SortedList or BIT on compressed coords |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Dual-heap partition | Maintain lower and upper halves for median | Push to lo, always move lo's max to hi, rebalance sizes |
| Eviction queue | Sliding window with fixed size | Append right, pop left when len > k |
| Timestamp eviction | Sliding window defined by time range | Pop front while front_time < now − window |
| Lazy deletion | Need to evict from middle of a heap | Counter-map of to-delete; skip on pop |
| Running aggregate | Sum/count/mean over unbounded stream | Single variable, update in O(1) per insert |
| Frequency bucket | O(1) max-frequency tracking | Dict of freq → set of elements; track current max freq |
Edge Cases & Pitfalls
-
Empty stream query. Calling
findMedian()before anyaddNum()is undefined; guard with a check or document that at least one call precedes queries. -
Even vs. odd total count (median). When stream length is odd, median = top of lo heap; when even, median = average of both tops. Mixing these returns wrong values. Make the size relationship explicit in a comment.
-
Max-heap negation in Python. Python's
heapqis min-heap only. Negate values before pushing into lo; negate again when reading. Forgetting to negate on one side produces incorrect comparisons silently. -
Kth largest with duplicates. A min-heap of size k handles duplicates correctly since the same value may occupy multiple heap slots. No special casing needed.
-
Reservoir sampling — first k elements. The first k elements are added unconditionally. The probability formula k/i applies only for i > k. Applying it to the first k elements (e.g., always replacing) breaks uniformity.
-
Monotonic stack span at stream start. If the stack is empty when computing span, the span is
current_index + 1(all prior elements are smaller). Missing this case produces span = 0 or an index error. -
Timestamp window boundary. "Last t milliseconds" — decide if t is inclusive or exclusive. Off-by-one in the eviction condition (
< now − tvs.<= now − t) inverts the boundary. -
SortedList index for median. After n insertions,
sl[n//2]is the upper-median for even n. Some problems define median as lower-median or average. Read the problem's definition before indexing.
Implementation Notes
- Python's
heapqhas no decrease-key or arbitrary delete; use lazy deletion or switch tosortedcontainers.SortedListwhen deletions are needed. sortedcontainers.SortedListis not in the standard library; it is available on LeetCode but must be imported:from sortedcontainers import SortedList.SortedList.addis O(log n);SortedList[i]index access is O(log n), not O(1) — do not call it in a loop to build a sorted copy.- Reservoir sampling requires a random integer in
[0, i); userandom.randint(0, i)in Python (inclusive on both ends) means the range must berandom.randint(0, i-1)orrandom.randrange(i). - For the two-heap median, lo stores negated values; printing or comparing lo[0] must negate again. A helper
lo_max = -lo[0]avoids confusion. collections.dequewithmaxlen=kauto-evicts the left element on append; this is a clean alternative to manualpopleft()for moving average but does not maintain a running sum (you still need one).
Cross-Topic Relations
| Related topic | Specific connection |
|---|---|
| Heap / Priority Queue | Two-heap and kth-largest patterns are direct applications; all stream heap solutions depend on sift-up/sift-down. See heap-priority-queue.md |
| Monotonic Stack | Stock span and "next greater element in stream" are monotonic stack problems framed as streams. See monotonic-stack.md |
| Sliding Window | Bounded-stream problems (moving average, window max) are sliding window problems with an online constraint. See sliding-window.md |
| Binary Indexed Tree | BIT on compressed coordinates is the hard alternative to SortedList for rank queries. See binary-indexed-tree.md |
| Trie | Aho-Corasick stream matching is built on a Trie with failure links. See trie.md |
| Reservoir Sampling | One-pass stream sampling; has its own LeetCode tag. See reservoir-sampling.md |
| Design | Many stream problems appear under the Design tag; the interface (class with add/query) is a design pattern, not just an algorithm choice |
Interview Reference
Must-Know Problems
Two-Heap (Running Median)
- Find Median from Data Stream (Hard) — canonical; implement and recite the two invariants
- Sliding Window Median (Hard) — same heaps + lazy deletion for eviction
Kth Order Statistic
- Kth Largest Element in a Stream (Easy) — min-heap size k; build and explain initialization
- IPO (Hard) — two-heap variant where you greedily pick max-profit available project
Bounded Window
- Moving Average from Data Stream (Easy) — fixed-size deque; trivial but confirms window thinking
- Online Stock Span (Medium) — monotonic stack framed as stream; classic
Event / Timestamp Window
- Number of Recent Calls (Easy) — queue with timestamp eviction
- Design Hit Counter (Medium) — same pattern, slightly harder eviction logic
Frequency Tracking
- Maximum Frequency Stack (Hard) — frequency-indexed bucket + stack; requires freq map and bucket map
String Stream
- Stream of Characters (Hard) — Aho-Corasick; hardest in tag
Clarification Questions to Ask
- Is the stream bounded (fixed k) or unbounded? Changes structure entirely.
- Are deletions ever needed from the stream, or only insertions?
- What is the value range? Determines whether BIT compression is viable.
- What is the exact definition of median for even-length streams — lower, upper, or average?
- Is the kth largest by rank (with duplicates counting separately) or by distinct value?
- For time-window problems: is the boundary inclusive or exclusive?
Common Interview Mistakes
- Forgetting to negate when using Python
heapqas a max-heap; values read back are wrong but no exception is raised. - Implementing the two-heap but only checking one invariant (size or ordering); failing to restore both after each insert.
- Using a sorted list for kth largest when a size-k heap is O(log k) vs. O(log n) — stating the wrong complexity.
- For sliding window median, attempting to remove from a heap with
heap.remove(val)which is O(n); not knowing lazy deletion. - Not handling the case where the stream has fewer than k elements when asked for the kth largest; returning
heap[0]from an empty heap raises an exception. - Initializing reservoir sampling with
random.randint(0, i)(inclusive) which gives a slightly wrong distribution — should berandom.randrange(i).
Typical Follow-Up Escalations
- "Now support deletion from the stream" → lazy deletion heap or switch to SortedList.
- "Now find the median of the last k elements only (sliding window median)" → lazy deletion + eviction tracking on both heaps.
- "Now the stream has up to 10^9 distinct values — how does your BIT scale?" → offline coordinate compression or switch to SortedList.
- "Can you do kth largest in O(1) query instead of O(log k)?" → maintain the heap; the root already is O(1) — confirm you're not confusing query with insert.
- "What if k is not fixed and the user can change it at runtime?" → requires re-building the heap; or maintain a SortedList and index into it dynamically.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles