Hierarchy

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size \((n-1) \times 4\) representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node \(n\)).

Paris

class sknetwork.hierarchy.Paris(weights: str = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

The similarity between nodes \(i,j\) is \(\dfrac{A_{ij}}{w_i w_j}\) where

\(A_{ij}\) is the weight of edge \(i,j\),
\(w_i, w_j\) are the weights of nodes \(i,j\)

If the input matrix \(B\) is a biadjacency matrix (i.e., rectangular), the algorithm is applied to the corresponding adjacency matrix \(A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}\)

Parameters:

weights (str) – Weights of nodes. 'degree' (default) or 'uniform'.
reorder (bool) – If True (default), reorder the dendrogram in non-decreasing order of height.

Variables:

dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_predict(adjacency)
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
       [1.        , 0.        , 0.25      , 2.        ],
       [6.        , 4.        , 0.31      , 3.        ],
       [7.        , 5.        , 0.67      , 5.        ]])

Notes

Each row of the dendrogram = \(i, j\), distance, size of cluster \(i + j\).

Louvain

class sknetwork.hierarchy.LouvainHierarchy(resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Hierarchical clustering by Louvain (bottom-up).

Each level corresponds to an aggregation step of the Louvain algorithm.

Parameters:

resolution (float) – Resolution parameter.
tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.
shuffle_nodes (bool) – If True, shuffle nodes before optimization.
random_state (int) – Random number generator or random seed. If None, numpy.random is used.
verbose (bool) – Verbose mode.

Variables:

dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
>>> adjacency = house()
>>> louvain.fit_predict(adjacency)
array([[3., 2., 1., 2.],
       [4., 1., 1., 2.],
       [6., 0., 1., 3.],
       [5., 7., 2., 5.]])

Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

See also

scipy.cluster.hierarchy.dendrogram, sknetwork.clustering.Louvain

fit(input_matrix: csr_matrix | ndarray) → LouvainHierarchy[source]

Fit algorithm to data.

Parameters:: input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.
Returns:: self
Return type:: LouvainHierarchy

fit_predict(*args, **kwargs) → ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns:: dendrogram – Dendrogram.
Return type:: np.ndarray

fit_transform(*args, **kwargs) → ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns:: dendrogram – Dendrogram.
Return type:: np.ndarray

get_params()

Get parameters as dictionary.

Returns:: params – Parameters of the algorithm.
Return type:: dict

predict(columns: bool = False) → ndarray

Return the dendrogram predicted by the algorithm.

Parameters:: columns (bool) – If True, return the prediction for columns.
Returns:: dendrogram – Dendrogram.
Return type:: np.ndarray

set_params(params: dict) → Algorithm

Set parameters of the algorithm.

Parameters:: params (dict) – Parameters of the algorithm.
Returns:: self
Return type:: Algorithm

transform() → ndarray

Return the dendrogram predicted by the algorithm.

Returns:: dendrogram – Dendrogram.
Return type:: np.ndarray

class sknetwork.hierarchy.LouvainIteration(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = -1, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

Parameters:

depth (int) – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).
resolution (float) – Resolution parameter.
tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.
n_aggregations (int) – Maximum number of aggregations. A negative value is interpreted as no limit.
shuffle_nodes (bool) – If True, shuffle nodes before optimization.
random_state (int) – Random number generator or random seed. If None, numpy.random is used.
verbose (bool) – Verbose mode.

Variables:

dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainIteration
>>> from sknetwork.data import house
>>> louvain = LouvainIteration()
>>> adjacency = house()
>>> louvain.fit_predict(adjacency)
array([[3., 2., 1., 2.],
       [4., 1., 1., 2.],
       [6., 0., 1., 3.],
       [5., 7., 2., 5.]])

Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

Metrics

sknetwork.hierarchy.dasgupta_cost(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform', normalized: bool = False) → float[source]

Dasgupta’s cost of a hierarchy.

Expected size (weights = 'uniform') or expected volume (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters:

adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes. 'degree' or 'uniform' (default).
normalized – If True, normalized cost (between 0 and 1).

Returns:

cost – Cost.

Return type:

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> cost = dasgupta_cost(adjacency, dendrogram)
>>> np.round(cost, 2)
3.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform') → float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

Defined as 1 - normalized Dasgupta’s cost.

Parameters:

adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns:

score – Score.

Return type:

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = dasgupta_score(adjacency, dendrogram)
>>> np.round(score, 2)
0.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'degree', normalized: bool = True) → float[source]

Tree sampling divergence of a hierarchy (quality metric).

Parameters:

adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes. 'degree' (default) or 'uniform'.
normalized – If True, normalized score (between 0 and 1).

Returns:

score – Score.

Return type:

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = tree_sampling_divergence(adjacency, dendrogram)
>>> np.round(score, 2)
0.05

References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

Cuts

sknetwork.hierarchy.cut_straight(dendrogram: ndarray, n_clusters: int | None = None, threshold: None | float = None, sort_clusters: bool = True, return_dendrogram: bool = False) → ndarray | Tuple[ndarray, ndarray][source]

Cut a dendrogram and return the corresponding clustering.

Parameters:

dendrogram (np.ndarray) – Dendrogram.
n_clusters (int) – Number of clusters (optional). The number of clusters can be larger than n_clusters in case of equal heights in the dendrogram.
threshold (float) – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.
sort_clusters (bool) – If True, sorts clusters in decreasing order of size.
return_dendrogram (bool) – If True, returns the dendrogram formed by the clusters up to the root.

Returns:

labels (np.ndarray) – Cluster of each node.
dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])

sknetwork.hierarchy.cut_balanced(dendrogram: ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) → ndarray | Tuple[ndarray, ndarray][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters:

dendrogram (np.ndarray) – Dendrogram
max_cluster_size (int) – Maximum size of each cluster.
sort_clusters (bool) – If True, sort labels in decreasing order of cluster size.
return_dendrogram (bool) – If True, returns the dendrogram formed by the clusters up to the root.

Returns:

labels (np.ndarray) – Label of each node.
dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])

Dendrograms

sknetwork.hierarchy.aggregate_dendrogram(dendrogram: ndarray, n_clusters: int = 2, return_counts: bool = False) → ndarray | Tuple[ndarray, ndarray][source]

Aggregate a dendrogram in order to get a certain number of leaves. The leaves in the output dendrogram correspond to subtrees in the input one.

Parameters:

dendrogram (np.ndarray) – The input to aggregate.
n_clusters (int) – Number of clusters (or leaves) to keep.
return_counts (bool) – If True, returns an array of counts corresponding to the sizes of the merged subtrees. The sum of the counts is equal to the number of samples in the input dendrogram.

Returns:

new_dendrogram (np.ndarray) – Aggregated dendrogram. The nodes are reindexed from 0.
counts (np.ndarray) – Size of the subtrees corresponding to each leaf in new_dendrogram.

sknetwork.hierarchy.reorder_dendrogram(dendrogram: ndarray) → ndarray[source]: Reorder the dendrogram in non-decreasing order of height.