Union-Find (Disjoint Set Union)
Type: Data Structure LeetCode tag: Union Find (~80 problems)
Core Concepts
Disjoint Set — a partition of a universe of elements into non-overlapping groups where each element belongs to exactly one group.
Component / Set — one group in the partition; all elements that are mutually connected or equivalent.
Representative / Root — the canonical element chosen to identify its component; any element in the set can be the representative, but all members agree on who it is.
Find — the operation that returns the representative of the component containing a given element; path compression makes repeated finds near O(1) amortized.
Union — the operation that merges two components into one by re-rooting one representative under the other; rank/size heuristics keep the tree flat.
Inverse Ackermann function α(n) — the amortized cost per operation with both path compression and union by rank; effectively constant (≤ 5) for all practical n.
Rank — an upper bound on the height of the tree rooted at a node; never decreases and only increases when two trees of equal rank are merged.
Size — an alternative to rank that tracks the exact number of nodes in each component; union by size is slightly simpler and equally effective in practice.
Weighted Union-Find — augments each node with a weight relative to its parent, enabling queries like "what is the ratio between two elements?" without enumerating the full path.
Structural Invariants
| Invariant | What it requires | What breaks if violated |
|---|---|---|
| Path compression | After find(x), every node on the path from x to root points directly to the root | Find degenerates to O(tree height); subsequent operations slower |
| Union by rank | The root of the smaller-rank tree is attached under the root of the larger-rank tree | Tree can become a chain; O(n) find in the worst case |
| Rank monotonicity | rank[root] only increases, never decreases | Rank no longer bounds tree height; union heuristic becomes incorrect |
| Unique root | Exactly one node in each component satisfies parent[x] == x | find loops infinitely or returns wrong representative |
| Relative weight consistency | weight[x] encodes the ratio/difference from x to parent[x]; composed along path equals x-to-root | Weighted queries return incorrect values |
Internal Representation
Array-based (standard): Two arrays parent[n] and rank[n] (or size[n]). parent[i] = i initially (each node is its own root). All operations are index arithmetic — no heap allocation per node. Space: O(n).
Dictionary-based: Replace arrays with parent = {} and rank = {} for problems where nodes are strings, tuples, or non-contiguous integers. find must call parent.setdefault(x, x) to lazily initialise. Slightly slower constant but handles arbitrary key types.
With component count: Maintain an integer count initialised to n; decrement by 1 in union only when the two elements were in different components. Enables O(1) query of the total number of components.
With extra payload: Add arrays for size[root], weight[node], or custom aggregates. Only the root's slot in size/weight is meaningful after path compression — reading a non-root's slot gives stale data.
Types / Variants
| Variant | What it adds | When to use |
|---|---|---|
| Basic (rank/size) | find + union + connected | Component counting, cycle detection, connectivity queries |
| Weighted / Potential | Per-node weight relative to parent; find accumulates product or sum | Ratio queries ("a/b = 2, b/c = 3, what is a/c?"), difference constraints |
| Virtual-node | Introduce a synthetic node connected to a group | Efficiently union an entire row/col, connect nodes matching a property without O(n²) unions |
| Bipartite-aware | Each node has an "opposite" node; union x with opposite(y) | Checking 2-colorability, odd-cycle detection (e.g., Equations Satisfiability) |
| Offline with rollback (Link-Cut / DSU with undo) | Stores union operations on a stack; supports undo | Dynamic connectivity offline, offline queries where edges are added and removed |
Topic-Specific Operations
find(x) — with path compression
Returns the root of x's component. Path compression flattens the tree so every visited node points directly to the root after the call. Time: O(α(n)) amortized.
def find(x):
if parent[x] != x: parent[x] = find(parent[x])
return parent[x]
Iterative two-pass alternative: walk to root, then walk again setting all parent[i] = root. Avoids recursion depth issues for large n.
union(x, y) — by rank
Merges the components of x and y. Attaches the root with smaller rank under the root with larger rank; increments rank only when ranks are equal. Returns True if x and y were in different components (useful for cycle detection and counting). Time: O(α(n)) amortized.
def union(x, y):
rx, ry = find(x), find(y)
if rx == ry: return False
if rank[rx] < rank[ry]: rx, ry = ry, rx
parent[ry] = rx
if rank[rx] == rank[ry]: rank[rx] += 1
return True
connected(x, y)
Returns find(x) == find(y). No structural change. O(α(n)).
component_count
Not a function call — maintain as a field: initialise to n, decrement in union when it returns True.
Weighted find
Accumulates the product (or sum) of weights along the path to the root while compressing.
def find(x):
if parent[x] != x:
parent[x], weight[x] = find(parent[x]), weight[x] * weight[parent[x]]
return parent[x]
Key Algorithms
Kruskal's Minimum Spanning Tree
Build an MST by greedily adding edges sorted by weight, skipping any edge that would form a cycle (i.e., both endpoints already in the same component). Union-Find provides the cycle check and merge in O(α(n)) per edge.
Time: O(E log E) dominated by sorting. Space: O(V).
The key insight: an edge creates a cycle iff its endpoints share a root before the union.
Connected Components Count
Initialise UF with n nodes. For each edge (u, v), call union(u, v). After all edges, component_count is the answer.
Time: O(E·α(n)). Space: O(V).
Cycle Detection in Undirected Graph
For each edge (u, v), if connected(u, v) is already true, a cycle exists; otherwise union(u, v). For directed graphs, use DFS with coloring instead — UF is not appropriate there.
Accounts Merge / String-keyed grouping
Map strings to integer indices (or use dict-based UF). Union any two accounts sharing an email. After all unions, group by find(index). The pattern: build a mapping layer on top of UF when raw keys are strings or compound objects.
Satisfiability of Equality Equations
Two-pass: first union all a == b pairs; then check all a != b pairs — if any pair shares a root, return False. Generalises to bipartite: union x with opposite(y) for a != b constraints, then check find(x) == find(opposite(x)).
Redundant Connection
Process edges one by one; the first edge where both endpoints are already connected is the redundant edge. Return it immediately.
Advanced Techniques
Virtual / Sentinel Nodes
Solves: Problems requiring "union all nodes in a row", "connect all nodes satisfying property P", or "connect boundary cells to a super-node."
Mechanism: Allocate extra node indices (e.g., index n and n+1 for two super-nodes). Union every qualifying node to the super-node in O(1) instead of unioning pairs, which would be O(k²).
When to prefer over alternatives: Use when multiple elements must be grouped to a single representative for efficient "is anything connected to boundary?" queries (e.g., Surrounded Regions). Do not use if BFS/DFS is simpler — virtual nodes add complexity only worth it when iterative grouping is needed.
Complexity: O(n·α(n)) vs. O(n²) pairwise unions.
Offline Processing with Sorted Edge Order
Solves: "Minimum cost to connect all nodes", "maximum bandwidth path", queries of the form "what is the earliest time all nodes are connected?"
Mechanism: Sort edges by weight, then run UF treating union events as checkpoints. Because edges are processed in order, the component state at any point reflects the minimum-weight subgraph that achieves that connectivity.
When to prefer: Use when the problem has a monotone cost structure and asks for a threshold (e.g., minimum bottleneck). Do not reach for this when the graph is dense and Prim's would be more cache-friendly or when edges arrive online.
Complexity: O(E log E + E·α(V)).
DSU on Tree (Small-to-Large Merging)
Solves: "For each node, how many distinct values in its subtree?", merging sets from children up to the root.
Mechanism: Each node maintains a data structure (set, counter). When merging children, always iterate over the smaller child's structure and insert into the larger. Each element is moved O(log n) times total.
When to prefer over alternatives: Use when a DFS-based approach would require re-processing subtree elements. Do not use when a single DFS pass with a global array suffices (e.g., simple subtree sums).
Complexity: O(n log² n) or O(n log n) depending on inner structure.
This technique is distinct from path-compressed UF. It uses the "union by size" intuition applied to arbitrary data structures on trees.
Problem Taxonomy
By Problem Category
| Category | What it asks | Key technique |
|---|---|---|
| Component counting | How many connected groups exist? | UF with component counter |
| Cycle detection | Does this graph contain a cycle? | Union returns False when endpoints already connected |
| MST / minimum cost connectivity | Minimum total edge weight to connect all nodes | Kruskal's: sort edges + UF cycle check |
| Dynamic connectivity (offline) | After adding edges, are x and y connected at query time? | Offline UF: process in sorted order |
| Grouping / merging | Merge entities sharing a common attribute | UF with string/dict keys or index mapping |
| Redundant edge removal | Which edge is extra and can be removed? | First edge where connected is already true |
| Bipartite / 2-coloring | Can nodes be split into two groups with no intra-group edges? | Bipartite UF with virtual opposite nodes |
| Ratio / difference constraints | Given pairwise ratios, find the ratio between two nodes | Weighted UF accumulating products |
Problem Signal → Technique
| Signal in problem statement | Likely technique |
|---|---|
| "number of connected components" | UF with component counter |
| "find if path exists between x and y" | UF connected query |
| "merge accounts / emails / groups" | UF with dict keys + grouping by root |
| "minimum cost to connect all nodes" / "minimum spanning tree" | Kruskal's with UF |
| "redundant connection" / "extra edge" | Union until cycle detected |
| "equations like a == b and a != b" | Two-pass equality UF |
| "a/b = k, find a/c" | Weighted UF with product accumulation |
| "surrounded regions" / "capture cells" / "connected to border" | Virtual boundary super-node |
| "online queries: union then find" | Standard UF; no sorting needed |
| "offline queries sorted by time/weight" | Sort + UF checkpoints |
Common Patterns
| Pattern | When it applies | Mechanism |
|---|---|---|
| Component counter field | Any problem asking "how many groups" at any point | Initialise to n; decrement on successful union |
| Index remapping | Keys are strings, coordinates, or non-contiguous integers | Map keys to [0, n) with a dict before initialising UF |
| Virtual super-node | All nodes matching a property must be in one component | Allocate extra index; union all matching nodes to it |
| Two-pass (union then query) | Equality constraints followed by inequality checks | First pass: all == unions; second pass: check != pairs |
| Build UF from grid | Grid cells as nodes; edges between adjacent cells | Flatten 2D index: node = r * cols + c |
| Kruskal's template | Minimum cost to connect | Sort edges by weight; union if not already connected |
| Rollback stack | Offline dynamic connectivity with edge deletions | Push parent/rank state before union; pop to undo |
Edge Cases & Pitfalls
-
Self-union: Calling
union(x, x)on a single node should be a no-op. Iffind(x) == find(y)is checked first, this is handled automatically. Forgetting this causes component count to decrement incorrectly. -
Index out of bounds: Problems that add virtual nodes require allocating
parentandrankarrays of sizen + k(k virtual nodes), not n. Off-by-one here causes silent corruption. -
Reading stale size/weight on non-root: After path compression,
size[x]andweight[x]are only valid whenxis a root. Readingsize[non-root]gives the pre-compression value. Always read aggregate data viasize[find(x)]. -
Directed graph cycle detection: UF detects cycles in undirected graphs. For directed graphs, using UF will produce false positives (a back edge in DFS is not always a true cycle in the UF sense). Use DFS with white/gray/black coloring instead.
-
Forgetting path compression in iterative find: The two-pointer iterative find (walk to root) does not compress unless a second pass is added. Without compression, repeated finds on deep trees are O(n).
-
Union by rank broken by size shortcuts: Mixing rank and size heuristics in the same implementation breaks the rank invariant. Choose one and use it consistently throughout.
-
Weighted UF: wrong accumulation order: When composing weights during path compression, the order matters.
weight[x] * weight[parent[x]]is not the same asweight[parent[x]] * weight[x]for non-commutative operations (e.g., matrix products). For ratios (commutative multiplication), order does not matter, but document the convention. -
Component count off by one: Initialising
count = nthen decrementing on every call tounion(including failed ones) undercounts. Only decrement whenfind(x) != find(y)before the union.
Implementation Notes
- Python's default recursion limit is 1000; recursive
findwith path compression can exceed this on star-shaped initial graphs with n > 1000. Use the iterative two-pass find or callsys.setrecursionlimitexplicitly. parentandrankare module-level or closure variables in competitive-style code; in class-based code, initialise them in__init__to avoid shared state across test cases.- Python
dict.setdefault(x, x)in dict-basedfindhandles lazy initialisation but adds overhead; pre-initialise all keys when the full set of nodes is known upfront. - When flattening a 2D grid, verify that
r * cols + cusescols(number of columns), notrows. Mixing the two produces valid indices that map to wrong cells silently. - For Kruskal's, sort edges by weight ascending for MST; sort descending for maximum spanning tree. The union logic is identical.
Cross-Topic Relations
| Related topic | Connection |
|---|---|
| Graphs | UF is a graph algorithm; every UF problem has an equivalent BFS/DFS solution — UF is preferred when only connectivity (not path structure) is needed |
| Minimum Spanning Tree | Kruskal's algorithm is the canonical MST algorithm built on UF; Prim's is the BFS/heap alternative |
| Depth-First Search | Cycle detection, connected components, bipartite checking — all solvable with DFS; UF is preferred for online union-query workloads |
| Topological Sort | For DAG cycle detection and ordering, use topological sort (Kahn's or DFS); UF handles undirected cycles only |
| Segment Tree / BIT | Both are advanced data structures over arrays; UF and Segment Tree are sometimes combined in offline dynamic connectivity |
| Hash Table | Dict-based UF uses a hash table as the parent store; string-keyed UF problems require index remapping via a hash table |
Interview Reference
Must-Know Problems
Component Counting / Basic Connectivity
- Number of Connected Components in an Undirected Graph (LC 323) — core UF pattern
- Graph Valid Tree (LC 261) — connected + acyclic check
- Find if Path Exists in Graph (LC 1971) — simplest UF use case
Cycle Detection / Redundant Edges
- Redundant Connection (LC 684) — first edge that closes a cycle
- Redundant Connection II (LC 685) — directed graph, requires case analysis
Grouping / Merging
- Accounts Merge (LC 721) — string-keyed UF + grouping by root
- Sentence Similarity II (LC 737) — transitive similarity via UF
Grid-Based Connectivity
- Number of Islands (LC 200) — grid UF or DFS; UF version counts components
- Surrounded Regions (LC 130) — virtual boundary super-node
- Number of Islands II (LC 305) — online island additions; classic incremental UF
Constraint Satisfaction
- Satisfiability of Equality Equations (LC 990) — two-pass UF
- Evaluate Division (LC 399) — weighted UF with product accumulation
MST / Weighted
- Min Cost to Connect All Points (LC 1584) — Kruskal's on complete graph
- Find Critical and Pseudo-Critical Edges in MST (LC 1489) — hard; requires Kruskal's + edge exclusion/inclusion
Additional LeetCode Coverage
- Longest Consecutive Sequence (LC 128)
- Making A Large Island (LC 827)
- Similar String Groups (LC 839)
- Most Stones Removed with Same Row or Column (LC 947)
- Number of Enclaves (LC 1020)
- Lexicographically Smallest Equivalent String (LC 1061)
- Smallest String With Swaps (LC 1202)
- Number of Closed Islands (LC 1254)
- Number of Operations to Make Network Connected (LC 1319)
- Reachable Nodes With Restrictions (LC 2368)
Clarification Questions to Ask
- Are nodes 0-indexed or 1-indexed? (Determines array size and initialisation)
- Can there be self-loops or parallel edges? (Self-loops are instant cycle detections)
- Is the graph directed or undirected? (UF only handles undirected correctly)
- Are queries online (interleaved with unions) or offline (all unions then all queries)?
- For weighted problems: is the weight a ratio, difference, or something else? (Determines accumulation operation)
- Can n be 0 or 1? (Edge cases for component count initialisation)
Common Interview Mistakes
- Using UF for directed cycle detection — it produces false positives; use DFS with coloring.
- Forgetting to initialise
parent[i] = ifor all nodes including virtual ones — uninitialized slots return 0 silently, causing incorrect merges. - Decrementing component count unconditionally in
unioninstead of only when the two nodes were previously disconnected. - Reading
size[x]for a non-root node after path compression — the value is stale; always querysize[find(x)]. - Solving a weighted UF problem without tracking weight during path compression — the relative weights become inconsistent after the first compression.
- Allocating
parentof size n when virtual nodes require size n+k.
Typical Follow-Up Escalations
- "Now nodes can also be disconnected (edge deletions)" → UF does not support deletion; use offline processing (reverse the timeline: process deletions as additions in reverse) or Link-Cut Trees for online deletions.
- "Now return the actual path between x and y, not just connectivity" → UF cannot reconstruct paths; switch to BFS/DFS and store parent pointers.
- "Now also track the size of the component containing x" → Add
size[root]array; update inunionwithsize[new_root] += size[old_root]; query viasize[find(x)]. - "Now answer minimum bottleneck between x and y" → Sort edges by weight and union incrementally; the answer is the weight of the edge that first connects x and y (offline Kruskal's).
- "n is up to 10^9 but edges are sparse (≤ 10^5)" → Switch to dict-based UF with lazy initialisation; array-based UF would allocate 10^9 entries.
Read Next
Start reading to get personalized recommendations
Explore Unread
Great job! You've read all available articles