Streaming & Bounded-Memory Algorithms

The 'now optimize it for a stream with limited memory' follow-up: one-pass algorithms and sketches — top-K with a heap, reservoir sampling, Boyer-Moore majority, Count-Min, Bloom filters, HyperLogLog, running median, and external sort for data bigger than RAM.

The follow-up that trips people

You solve the array problem, then the interviewer adds: "now the data arrives as a stream and you can't store it all — do it in one pass with bounded memory." This is its own toolkit. The streaming model: elements arrive one at a time, you get one pass (or few), and you must answer using memory sublinear in the input — often O(k), O(log n), or O(1). When exact answers need too much memory, you trade a little accuracy for a lot of space with a sketch.

The unlock: ask "what do I actually need to keep?" For many questions it's far less than the whole stream.

💡 Visualizing the streaming concepts:

Reservoir Sampling (Bridge Spotting): Imagine standing on a bridge watching cars pass. You want to pick exactly 3 random cars from all that pass today, but you have no idea how many total cars will cross. You start by recording the first 3 cars. When the 4th car passes, you roll a 4-sided die. If it lands on 4, you replace one of your 3 cars with it. When the 5th passes, you roll a 5-sided die, and so on. Amazingly, every car has an equal chance of being chosen at the end.

Boyer-Moore Voting (Political Cancellation): Imagine a room full of people shouting their political party. If a party holds a strict majority (>50%), they can pick one person from their party and pair them up with a person from any other party, and both leave the room. Because the majority party has more members than all other parties combined, they will always have at least one member remaining at the end.

Bloom Filter (VIP List Bouncer): Think of a bouncer at a nightclub VIP list. If your name is not on the list, the bouncer knows instantly (no false negatives). If your name is on the list, the bouncer might let you in, but sometimes they might get confused because two names sounded similar (false positive), requiring a secondary check.

The toolkit

Question on a stream	Technique	Memory	Exact?
K largest / Kth largest so far	Min-heap of size K	O(k)	exact
Running median	Two heaps (max-heap + min-heap)	O(n)*	exact
Uniform random sample of k	Reservoir sampling	O(k)	exact (uniform)
Majority (> n/2) element	Boyer–Moore voting	O(1)	exact
Top frequent items (heavy hitters)	Misra–Gries / Count-Min Sketch	O(k)	approximate
"Have I seen x?"	Bloom filter	O(bits)	approx (no false negatives)
Count of distinct items	HyperLogLog	O(log log n)	approximate
Window stat over the stream	Ring buffer / monotonic deque	O(window)	exact
Sort data bigger than RAM	External merge sort (k-way)	O(buffer)	exact

*Two-heaps keeps everything if you need the true median of all seen; cap it (a sliding window or a sketch) when memory is bounded.

Top-K with a heap — the workhorse

To keep the K largest of a stream, hold a min-heap of size K. Each new element: if the heap has < K, push it; else if it beats the heap's minimum, pop-min and push. The root is always the Kth largest, and you never store more than K. Same structure answers "Kth largest in a stream" in O(log k) per element. Watch a heap maintain its invariant as values stream in:

Min-heap — push all, then extract-mintime O(log n) per opspace O(1) aux

(empty)

1/22Min-heap stored in an array: parent of i is ⌊(i−1)/2⌋, children are 2i+1 and 2i+2. Parent ≤ children, always.

push order

Reservoir sampling — a fair sample without the size

You want k uniformly-random elements but don't know the stream length and can't buffer it. Keep the first k; for the i-th element (i > k), keep it with probability k/i, evicting a random current member. Every element ends up with probability k/n — provably uniform — in O(k) memory and one pass. This is how you sample logs or events at scale.

Boyer–Moore — a majority in one variable

If an element appears more than n/2 times, you can find it with one counter and one candidate. Same element → count++; different → count--; count hits 0 → adopt the current element as candidate. The survivor is the only possible majority (verify with a second pass if "majority" isn't guaranteed). O(1) memory — a beautiful example of "keep almost nothing."

Sketches — pay accuracy, save memory

When even O(k) exact state is too much at true scale, sketches answer approximately in tiny space:

Count-Min Sketch — a few hash functions into counter arrays; estimates an item's frequency with a bounded over-count. Heavy-hitters/top-K at firehose scale.
Bloom filter — a bit array + hashes for set membership. "Definitely not seen" or "probably seen" — no false negatives, tunable false-positive rate. Front a cache/DB to kill lookups for keys that don't exist (cache penetration).
HyperLogLog — counts distinct items (unique visitors, distinct IPs) in ~1.5 KB with a couple-percent error, versus a hash set that grows without bound.

Name the trade-off out loud

The senior move isn't reciting a sketch — it's saying "exact needs O(n) memory I don't have, so I'll trade ~1% error for O(log log n) with HyperLogLog," and knowing when approximate is unacceptable (billing, correctness) versus fine (dashboards, monitoring).

Data bigger than RAM → external sort

Can't sort 100 GB in 8 GB of RAM? External merge sort: read chunks that do fit, sort each in memory, write sorted runs to disk, then k-way merge the runs with a min-heap (one element per run in the heap). The same k-way merge powers log merging and merging sorted shards — bounded memory, sequential I/O.

Back to the interview's R1

The literal follow-up — "longest substring with at most K distinct, but streaming" — is gentler than it sounds: a sliding window is already bounded memory (you hold the window + a frequency map of size ≤ K+1, not the whole stream), so it streams as-is. The genuinely hard streaming versions are the aggregate ones — count distinct elements over an unbounded stream (→ HyperLogLog) or top-K frequent items (→ Count-Min + a heap) — where exact state is unbounded and a sketch is the only way to fit memory.

Think it through

The chapter's whole mindset is "what do I actually need to keep?" — and no algorithm embodies it like Boyer–Moore, which finds a majority element in a single counter. Reason it out before revealing.

Think it through: Majority Element (Boyer–Moore vote)Easy — LeetCode 1690/5 stages

PROBLEMOne value appears MORE than n/2 times. Find it in O(1) extra memory and a single pass. [2,2,1,1,1,2,2] → 2.

1
Restate & the easy solutions
“What works if memory is free — and why is it disallowed here?”
2
The cancellation insight
“What happens if you pair up each majority element with a different non-majority one and cancel both?”
unlocks after the stage above
3
The algorithm
“Turn 'cancel pairs' into a one-counter sweep.”
unlocks after the stage above
4
Code it
“One candidate, one counter, one pass.”
unlocks after the stage above
5
Why it works & cost
“Cost, and the one-line proof?”
unlocks after the stage above

Python

# Python — Boyer-Moore Majority Vote
def majority_element(nums):
    candidate, count = None, 0
    for x in nums:
        if count == 0:
            candidate = x
        count += 1 if x == candidate else -1
    return candidate

Java

// Java — Boyer-Moore Majority Vote
public int majorityElement(int[] nums) {
    int candidate = 0;
    int count = 0;
    for (int num : nums) {
        if (count == 0) {
            candidate = num;
        }
        count += (num == candidate) ? 1 : -1;
    }
    return candidate;
}

C++

// C++ — Boyer-Moore Majority Vote
#include <vector>

int majorityElement(const std::vector<int>& nums) {
    int candidate = 0;
    int count = 0;
    for (int num : nums) {
        if (count == 0) {
            candidate = num;
        }
        count += (num == candidate) ? 1 : -1;
    }
    return candidate;
}

Python

# Python — Reservoir Sampling
import random

def reservoir_sample(stream, k):
    reservoir = stream[:k]
    for i in range(k, len(stream)):
        j = random.randint(0, i)
        if j < k:
            reservoir[j] = stream[i]
    return reservoir

Java

// Java — Reservoir Sampling
import java.util.Random;

public int[] reservoirSample(int[] stream, int k) {
    int[] reservoir = new int[k];
    // Keep the first k elements
    for (int i = 0; i < k; i++) {
        reservoir[i] = stream[i];
    }

    Random rand = new Random();
    // For the remaining elements, keep with decreasing probability
    for (int i = k; i < stream.length; i++) {
        int j = rand.nextInt(i + 1); // random index from 0 to i
        if (j < k) {
            reservoir[j] = stream[i];
        }
    }
    return reservoir;
}

C++

// C++ — Reservoir Sampling
#include <vector>
#include <random>

std::vector<int> reservoirSample(const std::vector<int>& stream, int k) {
    std::vector<int> reservoir(k);
    for (int i = 0; i < k; i++) {
        reservoir[i] = stream[i];
    }

    std::random_device rd;
    std::mt19937 gen(rd());

    for (int i = k; i < stream.size(); i++) {
        std::uniform_int_distribution<> dis(0, i);
        int j = dis(gen);
        if (j < k) {
            reservoir[j] = stream[i];
        }
    }
    return reservoir;
}

Python

# Python — Find Median from Data Stream (Running Median)
import heapq

class MedianFinder:
    def __init__(self):
        self.small = []  # max-heap (stored as negative numbers)
        self.large = []  # min-heap

    def addNum(self, num):
        heapq.heappush(self.small, -num)
        heapq.heappush(self.large, -heapq.heappop(self.small))
        if len(self.small) < len(self.large):
            heapq.heappush(self.small, -heapq.heappop(self.large))

    def findMedian(self):
        if len(self.small) > len(self.large):
            return -self.small[0]
        return (-self.small[0] + self.large[0]) / 2.0

Java

// Java — Find Median from Data Stream (Running Median)
import java.util.PriorityQueue;

class MedianFinder {
    private final PriorityQueue<Integer> small = new PriorityQueue<>((a, b) -> b - a); // max-heap
    private final PriorityQueue<Integer> large = new PriorityQueue<>(); // min-heap

    public void addNum(int num) {
        small.add(num);
        large.add(small.poll());
        if (small.size() < large.size()) {
            small.add(large.poll());
        }
    }

    public double findMedian() {
        if (small.size() > large.size()) {
            return small.peek();
        }
        return (small.peek() + large.peek()) / 2.0;
    }
}

C++

// C++ — Find Median from Data Stream (Running Median)
#include <queue>
#include <vector>

class MedianFinder {
private:
    std::priority_queue<int> small; // max-heap
    std::priority_queue<int, std::vector<int>, std::greater<int>> large; // min-heap

public:
    void addNum(int num) {
        small.push(num);
        large.push(small.top());
        small.pop();
        if (small.size() < large.size()) {
            small.push(large.top());
            large.pop();
        }
    }

    double findMedian() {
        if (small.size() > large.size()) {
            return small.top();
        }
        return (small.top() + large.top()) / 2.0;
    }
};

Check yourself

Check yourself0/3 answered

1. In Reservoir Sampling, how is uniform randomness guaranteed when we select k items from a stream of unknown size N?

2. What is the key trade-off when using a Bloom Filter?

3. For finding the running median of a stream of numbers, how do two heaps work together?

Practice — level up

Every problem here is "answer a question about a stream without storing it" — the exact muscle the follow-up tests.

Practice ladder: Streaming & bounded memory0/6 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up — keep almost nothing

Moving Average from Data StreamEasy
A ring buffer of the last N — bounded window over an unbounded stream.
Kth Largest Element in a StreamEasy
A size-K min-heap — the streaming top-K workhorse.

Core — heaps & windows on a stream

Find Median from Data StreamHard
Two heaps balanced around the middle — running order statistics.
Top K Frequent ElementsMedium
Frequency map + heap — the exact version of heavy-hitters.

Stretch — windowed counting at scale

Number of Recent CallsEasy
Count events in a sliding time window — bounded state over time.
Design Hit CounterMedium
Aggregate the last N minutes in fixed memory — the sketch mindset, exact form.