Union-Find · PrepDeck

Disjoint sets in ~10 lines — union by rank, path compression, and the 'are these connected?' questions it wins instantly.

Why We Need Union-Find & How to Think About It

Imagine you are representing relationships in a social network or tracking connected computer networks. You want to ask: "Are person A and person B connected in the same group?" and "Can we merge these two groups?" Using standard BFS/DFS, answering "are they connected?" requires traversing the graph, which costs $O(V+E)$ time.

The Union-Find (Disjoint Set Union, DSU) data structure solves this by grouping nodes under a hierarchical tree structure. Each group has a single "representative leader". Answering "are they connected?" becomes checking if find(A) == find(B), which runs in near $O(1)$ time!

When to Use Union-Find

Connectivity Checks: Finding connected components, merging groups dynamically.
- LeetCode Link: LeetCode 547: Number of Provinces
Undirected Cycle Detection: Checking if adding an edge connects two nodes that are already in the same group.
- LeetCode Link: LeetCode 684: Redundant Connection
Kruskal's MST Algorithm: Sorting edges and checking if adding them would close a cycle.

The two-operation structure

Union-Find (also Disjoint Set Union, DSU) tracks a collection of items partitioned into non-overlapping groups, and supports exactly two operations — both, after optimizations, in effectively O(1):

find(x) — which group is x in?
union(x, y) — merge x's and y's groups.

That's the entire API. Its question is "are these two things in the same group?" — find(x) == find(y) — and an astonishing number of problems are secretly that question:

Are these two computers connected, given these network links?
Do these two people share a friend circle?
Does adding this road create a cycle (i.e., were its endpoints already connected)?
How many islands / clusters / connected components exist?

BFS/DFS can answer connectivity too — but they re-explore the world per query. Union-Find shines when connections arrive incrementally and you ask connectivity between additions: it maintains the answer as it goes.

How it works: forests of pointers

Each group is a tree (not the BST kind — no order, just "who's my parent"), and the root represents the group. Every item holds a parent pointer; roots point to themselves:

groups {0,1,2} and {3,4}:        parent array: [0, 0, 0, 3, 3]

      0          3               find(2): 2 → 0. root = 0
     / \         |               find(4): 4 → 3. root = 3
    1   2        4               same group? 0 ≠ 3 → no
                                 union(2,4): root 3 now points to root 0

Python

class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))   # everyone is their own root
        self.rank = [0] * n            # tree-height estimate, for balance

    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])  # ① path compression
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return False               # already connected — useful signal!
        if self.rank[rx] < self.rank[ry]:
            rx, ry = ry, rx            # ② union by rank: short tree under tall
        self.parent[ry] = rx
        if self.rank[rx] == self.rank[ry]:
            self.rank[rx] += 1
        return True

Java

// Java — same structure, iterative find
int find(int x) {
    while (parent[x] != x) {
        parent[x] = parent[parent[x]];   // path halving — same effect
        x = parent[x];
    }
    return x;
}

C++

// C++
int find(int x) { return parent[x] == x ? x : parent[x] = find(parent[x]); }

The two optimizations (and the magic complexity)

Naively, trees can degenerate into chains — find becomes O(n), the unbalanced-BST disease again. Two one-line cures:

Path compression — while finding the root, repoint every visited node directly at it. The tree flattens itself with use.
Union by rank/size — always hang the shorter tree under the taller one, so height grows only when equals merge.

Together: amortized O(α(n)) per operation, where α is the inverse Ackermann function — ≤ 4 for any n that fits in this universe. Say "effectively constant time" and you're correct at any interview level; naming α(n) is the flourish.

Watch it run

Every item carries one arrow to its "boss"; follow arrows up and you find the group's leader. Watch unions merge trees with a single re-pointed arrow — and watch path compression flatten a chain the first time it's walked, so it never needs walking again.

Union-Find — groups, leaders & path compressiontime ≈O(1) per opspace O(n)

parent

1/258 items, 8 groups: everyone starts as their own leader (each arrow points to itself).

operations (union a b / find a)

The signature move: cycle detection

union returning False means "these were already connected" — so adding that edge would close a cycle. That single bit powers the famous applications:

Python

def has_cycle(n, edges):
    uf = UnionFind(n)
    for a, b in edges:
        if not uf.union(a, b):     # already connected → this edge closes a loop
            return True
    return False

Kruskal's minimum spanning tree (Level 3 greedy): sort edges by weight, take each edge unless it forms a cycle — Union-Find is that check. Cheapest network connecting all cities, in 15 lines total.
Counting components: start with n groups; every successful union decrements the count. "Number of islands II" (land appears cell by cell) is exactly this, and is brutal without DSU.
Accounts merge / friend circles / equation satisfiability — anything where equivalences trickle in and must be consolidated.

Production perspective

Network/infra tooling uses DSU for reachability under churn — clusters merging as links come up.
Image processing: connected-component labeling (which pixels form one blob) — DSU over the pixel grid.
Compilers & type checkers: unification of type variables — the "equations arrive incrementally" pattern precisely.
The deeper lesson it teaches: representative elements. Many systems pick one canonical member to stand for an equivalence class (leader election, canonical URLs, dedup keys) — DSU is that idea as a data structure.

Common mistakes

Skipping both optimizations — correct but O(n) per op; on 10⁵ unions, the difference between instant and timeout.
Comparing items instead of roots — parent[x] == parent[y] is wrong (parents may be mid-tree); always compare find(x) == find(y).
Union-ing the items, not the roots — parent[y] = x instead of parent[ry] = rx corrupts the forest invariant.
Forgetting it can't un-merge. DSU has no split — deletions/ disconnections break it (offline tricks exist; out of scope — but know the limitation before proposing DSU in design discussions).
Strings/objects as items — map them to indices first (a dict: name → id), or use a parent-dict directly.

Think it through

You've seen the cycle-detection one-liner. Now derive the other DSU superpower — counting groups — and meet the honest case where plain DFS is just as good. Think before revealing.

Think it through: Number of ProvincesMedium — LeetCode 5470/5 stages

PROBLEMisConnected is an n×n matrix; isConnected[i][j] = 1 means city i and city j are directly connected. A province is a set of cities connected directly or indirectly. Count the provinces. [[1,1,0],[1,1,0],[0,0,1]] → 2.

1
Restate & edges
“Strip the matrix wording — what graph question is this really?”
2
Consider both tools
“What's the non-DSU solution, and is it actually worse here?”
unlocks after the stage above
3
The DSU pattern
“Start with n separate groups. What happens to the count on each merge?”
unlocks after the stage above
4
Code the template
“Why does uf.union returning a boolean make the count update trivial?”
unlocks after the stage above
5
Cost & edge check
“Cost, and when would DSU actually beat the DFS version?”
unlocks after the stage above

Check yourself

Check yourself0/4 answered

1. To check whether two items are in the same group, you compare:

2. union(x, y) returns False. What does that tell you?

3. Path compression plus union by rank give what per-operation complexity?

4. You have a FIXED graph given upfront and must return the actual PATH between two nodes. Union-Find or BFS/DFS?

Interview perspective

Practice

Build: implement UnionFind from memory with both optimizations — it's ~15 lines. Test: 10 items, union pairs, verify groups; check that union of an already-joined pair returns False.
Components: n computers and a list of cables — how many isolated clusters remain? (Count = n − successful unions.) Then: what's the minimum number of new cables to connect everything?
Cycle: given a tree-plus-one-extra-edge (LeetCode "Redundant Connection"), find the edge to remove — pure DSU.
Capstone: Kruskal's MST on a small weighted graph by hand — sort edges, take-unless-cycle — then code it. You've now combined sorting, greedy (Level 3), and DSU into a real algorithm.

That completes the Level 2 data structures. Heaps and graphs round out the structures; then the Level 3 patterns start composing them into solutions.

Practice — climb the ladder

The trigger: dynamic connectivity — "are these in the same group?" while edges keep arriving. If edges are fixed, plain DFS also works; say that.

Practice ladder: Union-Find0/6 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up

the API doing its one job

Find if Path Exists in GraphEasy
union(u,v) every edge, then one find(a)==find(b) — connectivity in five lines.

Core

components, cycles, and merging by key

Number of ProvincesMedium
Count components — start with n, every successful union decrements.
Redundant ConnectionMedium
Cycle detection — the first edge whose endpoints are ALREADY connected.
Accounts MergeMedium
Union by shared email — non-obvious modeling, the real interview skill.

Stretch

union-find in disguise

Satisfiability of Equality EquationsMedium
Union the == constraints, then check every != — constraints as connectivity.
Most Stones Removed with Same Row or ColumnMedium
Rows and columns as nodes — answer = stones − components; modeling at full stretch.