Union-Find (Disjoint Set Union)

Type: Data Structure LeetCode tag: Union Find (~80 problems)

Core Concepts

Disjoint Set — a partition of a universe of elements into non-overlapping groups where each element belongs to exactly one group.

Component / Set — one group in the partition; all elements that are mutually connected or equivalent.

Representative / Root — the canonical element chosen to identify its component; any element in the set can be the representative, but all members agree on who it is.

Find — the operation that returns the representative of the component containing a given element; path compression makes repeated finds near O(1) amortized.

Union — the operation that merges two components into one by re-rooting one representative under the other; rank/size heuristics keep the tree flat.

Inverse Ackermann function α(n) — the amortized cost per operation with both path compression and union by rank; effectively constant (≤ 5) for all practical n.

Rank — an upper bound on the height of the tree rooted at a node; never decreases and only increases when two trees of equal rank are merged.

Size — an alternative to rank that tracks the exact number of nodes in each component; union by size is slightly simpler and equally effective in practice.

Weighted Union-Find — augments each node with a weight relative to its parent, enabling queries like "what is the ratio between two elements?" without enumerating the full path.

Structural Invariants

Invariant	What it requires	What breaks if violated
Path compression	After `find(x)`, every node on the path from x to root points directly to the root	Find degenerates to O(tree height); subsequent operations slower
Union by rank	The root of the smaller-rank tree is attached under the root of the larger-rank tree	Tree can become a chain; O(n) find in the worst case
Rank monotonicity	`rank[root]` only increases, never decreases	Rank no longer bounds tree height; union heuristic becomes incorrect
Unique root	Exactly one node in each component satisfies `parent[x] == x`	`find` loops infinitely or returns wrong representative
Relative weight consistency	`weight[x]` encodes the ratio/difference from x to `parent[x]`; composed along path equals x-to-root	Weighted queries return incorrect values

Internal Representation

Array-based (standard): Two arrays parent[n] and rank[n] (or size[n]). parent[i] = i initially (each node is its own root). All operations are index arithmetic — no heap allocation per node. Space: O(n).

Dictionary-based: Replace arrays with parent = {} and rank = {} for problems where nodes are strings, tuples, or non-contiguous integers. find must call parent.setdefault(x, x) to lazily initialise. Slightly slower constant but handles arbitrary key types.

With component count: Maintain an integer count initialised to n; decrement by 1 in union only when the two elements were in different components. Enables O(1) query of the total number of components.

With extra payload: Add arrays for size[root], weight[node], or custom aggregates. Only the root's slot in size/weight is meaningful after path compression — reading a non-root's slot gives stale data.

Types / Variants

Variant	What it adds	When to use
Basic (rank/size)	`find` + `union` + `connected`	Component counting, cycle detection, connectivity queries
Weighted / Potential	Per-node weight relative to parent; `find` accumulates product or sum	Ratio queries ("a/b = 2, b/c = 3, what is a/c?"), difference constraints
Virtual-node	Introduce a synthetic node connected to a group	Efficiently union an entire row/col, connect nodes matching a property without O(n²) unions
Bipartite-aware	Each node has an "opposite" node; union `x` with `opposite(y)`	Checking 2-colorability, odd-cycle detection (e.g., Equations Satisfiability)
Offline with rollback (Link-Cut / DSU with undo)	Stores union operations on a stack; supports undo	Dynamic connectivity offline, offline queries where edges are added and removed

Topic-Specific Operations

find(x) — with path compression

Returns the root of x's component. Path compression flattens the tree so every visited node points directly to the root after the call. Time: O(α(n)) amortized.

def find(x):
    if parent[x] != x: parent[x] = find(parent[x])
    return parent[x]

Iterative two-pass alternative: walk to root, then walk again setting all parent[i] = root. Avoids recursion depth issues for large n.

union(x, y) — by rank

Merges the components of x and y. Attaches the root with smaller rank under the root with larger rank; increments rank only when ranks are equal. Returns True if x and y were in different components (useful for cycle detection and counting). Time: O(α(n)) amortized.

def union(x, y):
    rx, ry = find(x), find(y)
    if rx == ry: return False
    if rank[rx] < rank[ry]: rx, ry = ry, rx
    parent[ry] = rx
    if rank[rx] == rank[ry]: rank[rx] += 1
    return True

connected(x, y)

Returns find(x) == find(y). No structural change. O(α(n)).

component_count

Not a function call — maintain as a field: initialise to n, decrement in union when it returns True.

Weighted find

Accumulates the product (or sum) of weights along the path to the root while compressing.

def find(x):
    if parent[x] != x:
        parent[x], weight[x] = find(parent[x]), weight[x] * weight[parent[x]]
    return parent[x]

Key Algorithms

Kruskal's Minimum Spanning Tree

Build an MST by greedily adding edges sorted by weight, skipping any edge that would form a cycle (i.e., both endpoints already in the same component). Union-Find provides the cycle check and merge in O(α(n)) per edge.

Time: O(E log E) dominated by sorting. Space: O(V).

The key insight: an edge creates a cycle iff its endpoints share a root before the union.

Connected Components Count

Initialise UF with n nodes. For each edge (u, v), call union(u, v). After all edges, component_count is the answer.

Time: O(E·α(n)). Space: O(V).

Cycle Detection in Undirected Graph

For each edge (u, v), if connected(u, v) is already true, a cycle exists; otherwise union(u, v). For directed graphs, use DFS with coloring instead — UF is not appropriate there.

Accounts Merge / String-keyed grouping

Map strings to integer indices (or use dict-based UF). Union any two accounts sharing an email. After all unions, group by find(index). The pattern: build a mapping layer on top of UF when raw keys are strings or compound objects.

Satisfiability of Equality Equations

Two-pass: first union all a == b pairs; then check all a != b pairs — if any pair shares a root, return False. Generalises to bipartite: union x with opposite(y) for a != b constraints, then check find(x) == find(opposite(x)).

Redundant Connection

Process edges one by one; the first edge where both endpoints are already connected is the redundant edge. Return it immediately.

Advanced Techniques

Virtual / Sentinel Nodes

Solves: Problems requiring "union all nodes in a row", "connect all nodes satisfying property P", or "connect boundary cells to a super-node."

Mechanism: Allocate extra node indices (e.g., index n and n+1 for two super-nodes). Union every qualifying node to the super-node in O(1) instead of unioning pairs, which would be O(k²).

When to prefer over alternatives: Use when multiple elements must be grouped to a single representative for efficient "is anything connected to boundary?" queries (e.g., Surrounded Regions). Do not use if BFS/DFS is simpler — virtual nodes add complexity only worth it when iterative grouping is needed.

Complexity: O(n·α(n)) vs. O(n²) pairwise unions.

Offline Processing with Sorted Edge Order

Solves: "Minimum cost to connect all nodes", "maximum bandwidth path", queries of the form "what is the earliest time all nodes are connected?"

Mechanism: Sort edges by weight, then run UF treating union events as checkpoints. Because edges are processed in order, the component state at any point reflects the minimum-weight subgraph that achieves that connectivity.

When to prefer: Use when the problem has a monotone cost structure and asks for a threshold (e.g., minimum bottleneck). Do not reach for this when the graph is dense and Prim's would be more cache-friendly or when edges arrive online.

Complexity: O(E log E + E·α(V)).

DSU on Tree (Small-to-Large Merging)

Solves: "For each node, how many distinct values in its subtree?", merging sets from children up to the root.

Mechanism: Each node maintains a data structure (set, counter). When merging children, always iterate over the smaller child's structure and insert into the larger. Each element is moved O(log n) times total.

When to prefer over alternatives: Use when a DFS-based approach would require re-processing subtree elements. Do not use when a single DFS pass with a global array suffices (e.g., simple subtree sums).

Complexity: O(n log² n) or O(n log n) depending on inner structure.

This technique is distinct from path-compressed UF. It uses the "union by size" intuition applied to arbitrary data structures on trees.

Problem Taxonomy

By Problem Category

Category	What it asks	Key technique
Component counting	How many connected groups exist?	UF with component counter
Cycle detection	Does this graph contain a cycle?	Union returns False when endpoints already connected
MST / minimum cost connectivity	Minimum total edge weight to connect all nodes	Kruskal's: sort edges + UF cycle check
Dynamic connectivity (offline)	After adding edges, are x and y connected at query time?	Offline UF: process in sorted order
Grouping / merging	Merge entities sharing a common attribute	UF with string/dict keys or index mapping
Redundant edge removal	Which edge is extra and can be removed?	First edge where `connected` is already true
Bipartite / 2-coloring	Can nodes be split into two groups with no intra-group edges?	Bipartite UF with virtual opposite nodes
Ratio / difference constraints	Given pairwise ratios, find the ratio between two nodes	Weighted UF accumulating products

Problem Signal → Technique

Signal in problem statement	Likely technique
"number of connected components"	UF with component counter
"find if path exists between x and y"	UF `connected` query
"merge accounts / emails / groups"	UF with dict keys + grouping by root
"minimum cost to connect all nodes" / "minimum spanning tree"	Kruskal's with UF
"redundant connection" / "extra edge"	Union until cycle detected
"equations like a == b and a != b"	Two-pass equality UF
"a/b = k, find a/c"	Weighted UF with product accumulation
"surrounded regions" / "capture cells" / "connected to border"	Virtual boundary super-node
"online queries: union then find"	Standard UF; no sorting needed
"offline queries sorted by time/weight"	Sort + UF checkpoints

Common Patterns

Pattern	When it applies	Mechanism
Component counter field	Any problem asking "how many groups" at any point	Initialise to n; decrement on successful union
Index remapping	Keys are strings, coordinates, or non-contiguous integers	Map keys to `[0, n)` with a dict before initialising UF
Virtual super-node	All nodes matching a property must be in one component	Allocate extra index; union all matching nodes to it
Two-pass (union then query)	Equality constraints followed by inequality checks	First pass: all `==` unions; second pass: check `!=` pairs
Build UF from grid	Grid cells as nodes; edges between adjacent cells	Flatten 2D index: `node = r * cols + c`
Kruskal's template	Minimum cost to connect	Sort edges by weight; union if not already connected
Rollback stack	Offline dynamic connectivity with edge deletions	Push parent/rank state before union; pop to undo

Edge Cases & Pitfalls

Self-union: Calling union(x, x) on a single node should be a no-op. If find(x) == find(y) is checked first, this is handled automatically. Forgetting this causes component count to decrement incorrectly.
Index out of bounds: Problems that add virtual nodes require allocating parent and rank arrays of size n + k (k virtual nodes), not n. Off-by-one here causes silent corruption.
Reading stale size/weight on non-root: After path compression, size[x] and weight[x] are only valid when x is a root. Reading size[non-root] gives the pre-compression value. Always read aggregate data via size[find(x)].
Directed graph cycle detection: UF detects cycles in undirected graphs. For directed graphs, using UF will produce false positives (a back edge in DFS is not always a true cycle in the UF sense). Use DFS with white/gray/black coloring instead.
Forgetting path compression in iterative find: The two-pointer iterative find (walk to root) does not compress unless a second pass is added. Without compression, repeated finds on deep trees are O(n).
Union by rank broken by size shortcuts: Mixing rank and size heuristics in the same implementation breaks the rank invariant. Choose one and use it consistently throughout.
Weighted UF: wrong accumulation order: When composing weights during path compression, the order matters. weight[x] * weight[parent[x]] is not the same as weight[parent[x]] * weight[x] for non-commutative operations (e.g., matrix products). For ratios (commutative multiplication), order does not matter, but document the convention.
Component count off by one: Initialising count = n then decrementing on every call to union (including failed ones) undercounts. Only decrement when find(x) != find(y) before the union.

Implementation Notes

Python's default recursion limit is 1000; recursive find with path compression can exceed this on star-shaped initial graphs with n > 1000. Use the iterative two-pass find or call sys.setrecursionlimit explicitly.
parent and rank are module-level or closure variables in competitive-style code; in class-based code, initialise them in __init__ to avoid shared state across test cases.
Python dict.setdefault(x, x) in dict-based find handles lazy initialisation but adds overhead; pre-initialise all keys when the full set of nodes is known upfront.
When flattening a 2D grid, verify that r * cols + c uses cols (number of columns), not rows. Mixing the two produces valid indices that map to wrong cells silently.
For Kruskal's, sort edges by weight ascending for MST; sort descending for maximum spanning tree. The union logic is identical.

Cross-Topic Relations

Related topic	Connection
Graphs	UF is a graph algorithm; every UF problem has an equivalent BFS/DFS solution — UF is preferred when only connectivity (not path structure) is needed
Minimum Spanning Tree	Kruskal's algorithm is the canonical MST algorithm built on UF; Prim's is the BFS/heap alternative
Depth-First Search	Cycle detection, connected components, bipartite checking — all solvable with DFS; UF is preferred for online union-query workloads
Topological Sort	For DAG cycle detection and ordering, use topological sort (Kahn's or DFS); UF handles undirected cycles only
Segment Tree / BIT	Both are advanced data structures over arrays; UF and Segment Tree are sometimes combined in offline dynamic connectivity
Hash Table	Dict-based UF uses a hash table as the parent store; string-keyed UF problems require index remapping via a hash table

Interview Reference

Must-Know Problems

Component Counting / Basic Connectivity

Number of Connected Components in an Undirected Graph (LC 323) — core UF pattern
Graph Valid Tree (LC 261) — connected + acyclic check
Find if Path Exists in Graph (LC 1971) — simplest UF use case

Cycle Detection / Redundant Edges

Redundant Connection (LC 684) — first edge that closes a cycle
Redundant Connection II (LC 685) — directed graph, requires case analysis

Grouping / Merging

Accounts Merge (LC 721) — string-keyed UF + grouping by root
Sentence Similarity II (LC 737) — transitive similarity via UF

Grid-Based Connectivity

Number of Islands (LC 200) — grid UF or DFS; UF version counts components
Surrounded Regions (LC 130) — virtual boundary super-node
Number of Islands II (LC 305) — online island additions; classic incremental UF

Constraint Satisfaction

Satisfiability of Equality Equations (LC 990) — two-pass UF
Evaluate Division (LC 399) — weighted UF with product accumulation

MST / Weighted

Min Cost to Connect All Points (LC 1584) — Kruskal's on complete graph
Find Critical and Pseudo-Critical Edges in MST (LC 1489) — hard; requires Kruskal's + edge exclusion/inclusion

Additional LeetCode Coverage

Longest Consecutive Sequence (LC 128)
Making A Large Island (LC 827)
Similar String Groups (LC 839)
Most Stones Removed with Same Row or Column (LC 947)
Number of Enclaves (LC 1020)
Lexicographically Smallest Equivalent String (LC 1061)
Smallest String With Swaps (LC 1202)
Number of Closed Islands (LC 1254)
Number of Operations to Make Network Connected (LC 1319)
Reachable Nodes With Restrictions (LC 2368)

Clarification Questions to Ask

Are nodes 0-indexed or 1-indexed? (Determines array size and initialisation)
Can there be self-loops or parallel edges? (Self-loops are instant cycle detections)
Is the graph directed or undirected? (UF only handles undirected correctly)
Are queries online (interleaved with unions) or offline (all unions then all queries)?
For weighted problems: is the weight a ratio, difference, or something else? (Determines accumulation operation)
Can n be 0 or 1? (Edge cases for component count initialisation)

Common Interview Mistakes

Using UF for directed cycle detection — it produces false positives; use DFS with coloring.
Forgetting to initialise parent[i] = i for all nodes including virtual ones — uninitialized slots return 0 silently, causing incorrect merges.
Decrementing component count unconditionally in union instead of only when the two nodes were previously disconnected.
Reading size[x] for a non-root node after path compression — the value is stale; always query size[find(x)].
Solving a weighted UF problem without tracking weight during path compression — the relative weights become inconsistent after the first compression.
Allocating parent of size n when virtual nodes require size n+k.

Typical Follow-Up Escalations

"Now nodes can also be disconnected (edge deletions)" → UF does not support deletion; use offline processing (reverse the timeline: process deletions as additions in reverse) or Link-Cut Trees for online deletions.
"Now return the actual path between x and y, not just connectivity" → UF cannot reconstruct paths; switch to BFS/DFS and store parent pointers.
"Now also track the size of the component containing x" → Add size[root] array; update in union with size[new_root] += size[old_root]; query via size[find(x)].
"Now answer minimum bottleneck between x and y" → Sort edges by weight and union incrementally; the answer is the weight of the edge that first connects x and y (offline Kruskal's).
"n is up to 10^9 but edges are sparse (≤ 10^5)" → Switch to dict-based UF with lazy initialisation; array-based UF would allocate 10^9 entries.

Explore Unread

Great job! You've read all available articles