Union-Find

Disjoint sets in ~10 lines — union by rank, path compression, and the 'are these connected?' questions it wins instantly.

union-finddsugraphsfundamentals

The two-operation structure

Union-Find (also Disjoint Set Union, DSU) tracks a collection of items partitioned into non-overlapping groups, and supports exactly two operations — both, after optimizations, in effectively O(1):

  • find(x) — which group is x in?
  • union(x, y) — merge x's and y's groups.

That's the entire API. Its question is "are these two things in the same group?"find(x) == find(y) — and an astonishing number of problems are secretly that question:

  • Are these two computers connected, given these network links?
  • Do these two people share a friend circle?
  • Does adding this road create a cycle (i.e., were its endpoints already connected)?
  • How many islands / clusters / connected components exist?

BFS/DFS can answer connectivity too — but they re-explore the world per query. Union-Find shines when connections arrive incrementally and you ask connectivity between additions: it maintains the answer as it goes.

How it works: forests of pointers

Each group is a tree (not the BST kind — no order, just "who's my parent"), and the root represents the group. Every item holds a parent pointer; roots point to themselves:

groups {0,1,2} and {3,4}:        parent array: [0, 0, 0, 3, 3]

      0          3               find(2): 2 → 0. root = 0
     / \         |               find(4): 4 → 3. root = 3
    1   2        4               same group? 0 ≠ 3 → no
                                 union(2,4): root 3 now points to root 0
Python
class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))   # everyone is their own root
        self.rank = [0] * n            # tree-height estimate, for balance

    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])  # ① path compression
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return False               # already connected — useful signal!
        if self.rank[rx] < self.rank[ry]:
            rx, ry = ry, rx            # ② union by rank: short tree under tall
        self.parent[ry] = rx
        if self.rank[rx] == self.rank[ry]:
            self.rank[rx] += 1
        return True
Java
// Java — same structure, iterative find
int find(int x) {
    while (parent[x] != x) {
        parent[x] = parent[parent[x]];   // path halving — same effect
        x = parent[x];
    }
    return x;
}
C++
// C++
int find(int x) { return parent[x] == x ? x : parent[x] = find(parent[x]); }

The two optimizations (and the magic complexity)

Naively, trees can degenerate into chains — find becomes O(n), the unbalanced-BST disease again. Two one-line cures:

  1. Path compression — while finding the root, repoint every visited node directly at it. The tree flattens itself with use.
  2. Union by rank/size — always hang the shorter tree under the taller one, so height grows only when equals merge.

Together: amortized O(α(n)) per operation, where α is the inverse Ackermann function — ≤ 4 for any n that fits in this universe. Say "effectively constant time" and you're correct at any interview level; naming α(n) is the flourish.

Watch it run

Every item carries one arrow to its "boss"; follow arrows up and you find the group's leader. Watch unions merge trees with a single re-pointed arrow — and watch path compression flatten a chain the first time it's walked, so it never needs walking again.

Union-Find — groups, leaders & path compressiontime ≈O(1) per opspace O(n)
01234567
parent
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7

1/258 items, 8 groups: everyone starts as their own leader (each arrow points to itself).

The signature move: cycle detection

union returning False means "these were already connected" — so adding that edge would close a cycle. That single bit powers the famous applications:

Python
def has_cycle(n, edges):
    uf = UnionFind(n)
    for a, b in edges:
        if not uf.union(a, b):     # already connected → this edge closes a loop
            return True
    return False
  • Kruskal's minimum spanning tree (Level 3 greedy): sort edges by weight, take each edge unless it forms a cycle — Union-Find is that check. Cheapest network connecting all cities, in 15 lines total.
  • Counting components: start with n groups; every successful union decrements the count. "Number of islands II" (land appears cell by cell) is exactly this, and is brutal without DSU.
  • Accounts merge / friend circles / equation satisfiability — anything where equivalences trickle in and must be consolidated.

Production perspective

  • Network/infra tooling uses DSU for reachability under churn — clusters merging as links come up.
  • Image processing: connected-component labeling (which pixels form one blob) — DSU over the pixel grid.
  • Compilers & type checkers: unification of type variables — the "equations arrive incrementally" pattern precisely.
  • The deeper lesson it teaches: representative elements. Many systems pick one canonical member to stand for an equivalence class (leader election, canonical URLs, dedup keys) — DSU is that idea as a data structure.

Common mistakes

  • Skipping both optimizations — correct but O(n) per op; on 10⁵ unions, the difference between instant and timeout.
  • Comparing items instead of rootsparent[x] == parent[y] is wrong (parents may be mid-tree); always compare find(x) == find(y).
  • Union-ing the items, not the rootsparent[y] = x instead of parent[ry] = rx corrupts the forest invariant.
  • Forgetting it can't un-merge. DSU has no split — deletions/ disconnections break it (offline tricks exist; out of scope — but know the limitation before proposing DSU in design discussions).
  • Strings/objects as items — map them to indices first (a dict: name → id), or use a parent-dict directly.

Interview perspective

Practice

  1. Build: implement UnionFind from memory with both optimizations — it's ~15 lines. Test: 10 items, union pairs, verify groups; check that union of an already-joined pair returns False.
  2. Components: n computers and a list of cables — how many isolated clusters remain? (Count = n − successful unions.) Then: what's the minimum number of new cables to connect everything?
  3. Cycle: given a tree-plus-one-extra-edge (LeetCode "Redundant Connection"), find the edge to remove — pure DSU.
  4. Capstone: Kruskal's MST on a small weighted graph by hand — sort edges, take-unless-cycle — then code it. You've now combined sorting, greedy (Level 3), and DSU into a real algorithm.

That completes the Level 2 data structures. Heaps and graphs round out the structures; then the Level 3 patterns start composing them into solutions.

Practice — climb the ladder

The trigger: dynamic connectivity — "are these in the same group?" while edges keep arriving. If edges are fixed, plain DFS also works; say that.

Practice ladder: Union-Find0/6 solved

Climb in order — every rung assumes the one above it. Solve on LeetCode, then tick it here; progress is saved on this device.

Warm-up

the API doing its one job
  1. union(u,v) every edge, then one find(a)==find(b) — connectivity in five lines.

Core

components, cycles, and merging by key
  1. Count components — start with n, every successful union decrements.

  2. Cycle detection — the first edge whose endpoints are ALREADY connected.

  3. Union by shared email — non-obvious modeling, the real interview skill.

Stretch

union-find in disguise
  1. Union the == constraints, then check every != — constraints as connectivity.

  2. Rows and columns as nodes — answer = stones − components; modeling at full stretch.