# Hierarchy

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size $$(n-1) \times 4$$ representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node $$n$$).

## Paris

class sknetwork.hierarchy.Paris(weights: str = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

The similarity between nodes $$i,j$$ is $$\dfrac{A_{ij}}{w_i w_j}$$ where

• $$A_{ij}$$ is the weight of edge $$i,j$$,

• $$w_i, w_j$$ are the weights of nodes $$i,j$$

If the input matrix $$B$$ is a biadjacency matrix (i.e., rectangular), the algorithm is applied to the corresponding adjacency matrix $$A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}$$

Parameters:
• weights (str) – Weights of nodes. 'degree' (default) or 'uniform'.

• reorder (bool) – If True (default), reorder the dendrogram in non-decreasing order of height.

Variables:
• dendrogram (np.ndarray) – Dendrogram of the graph.

• dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
[1.        , 0.        , 0.25      , 2.        ],
[6.        , 4.        , 0.31      , 3.        ],
[7.        , 5.        , 0.67      , 5.        ]])


Notes

Each row of the dendrogram = $$i, j$$, distance, size of cluster $$i + j$$.

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(input_matrix: csr_matrix | ndarray)

Agglomerative clustering using the nearest neighbor chain.

Parameters:

input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.

Returns:

self

Return type:

Paris

fit_predict(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns: bool = False) ndarray

Return the dendrogram predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform() ndarray

Return the dendrogram predicted by the algorithm.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

## Louvain

class sknetwork.hierarchy.LouvainHierarchy(resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Hierarchical clustering by Louvain (bottom-up).

Each level corresponds to an aggregation step of the Louvain algorithm.

Parameters:
• resolution (float) – Resolution parameter.

• tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.

• shuffle_nodes (bool) – If True, shuffle nodes before optimization.

• random_state (int) – Random number generator or random seed. If None, numpy.random is used.

• verbose (bool) – Verbose mode.

Variables:
• dendrogram (np.ndarray) – Dendrogram of the graph.

• dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
array([[3., 2., 1., 2.],
[4., 1., 1., 2.],
[6., 0., 1., 3.],
[5., 7., 2., 5.]])


Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

scipy.cluster.hierarchy.dendrogram, sknetwork.clustering.Louvain

fit(input_matrix: csr_matrix | ndarray) [source]

Fit algorithm to data.

Parameters:

input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.

Returns:

self

Return type:

LouvainHierarchy

fit_predict(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns: bool = False) ndarray

Return the dendrogram predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform() ndarray

Return the dendrogram predicted by the algorithm.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

class sknetwork.hierarchy.LouvainIteration(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = -1, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

Parameters:
• depth (int) – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

• resolution (float) – Resolution parameter.

• tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations (int) – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes (bool) – If True, shuffle nodes before optimization.

• random_state (int) – Random number generator or random seed. If None, numpy.random is used.

• verbose (bool) – Verbose mode.

Variables:
• dendrogram (np.ndarray) – Dendrogram of the graph.

• dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainIteration
>>> from sknetwork.data import house
>>> louvain = LouvainIteration()
array([[3., 2., 1., 2.],
[4., 1., 1., 2.],
[6., 0., 1., 3.],
[5., 7., 2., 5.]])


Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

scipy.cluster.hierarchy.dendrogram, sknetwork.clustering.Louvain

fit(input_matrix: csr_matrix | ndarray) [source]

Fit algorithm to data.

Parameters:

input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.

Returns:

self

Return type:

LouvainIteration

fit_predict(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns: bool = False) ndarray

Return the dendrogram predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform() ndarray

Return the dendrogram predicted by the algorithm.

Returns:

dendrogram – Dendrogram.

Return type:

np.ndarray

## Metrics

sknetwork.hierarchy.dasgupta_cost(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform', normalized: bool = False) float[source]

Dasgupta’s cost of a hierarchy.

Expected size (weights = 'uniform') or expected volume (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters:

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

• normalized – If True, normalized cost (between 0 and 1).

Returns:

cost – Cost.

Return type:

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(cost, 2)
3.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform') float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

Defined as 1 - normalized Dasgupta’s cost.

Parameters:

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns:

score – Score.

Return type:

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'degree', normalized: bool = True) float[source]

Tree sampling divergence of a hierarchy (quality metric).

Parameters:

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• normalized – If True, normalized score (between 0 and 1).

Returns:

score – Score.

Return type:

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.05


References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

## Cuts

sknetwork.hierarchy.cut_straight(dendrogram: ndarray, n_clusters: int | None = None, threshold: None | float = None, sort_clusters: bool = True, return_dendrogram: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Cut a dendrogram and return the corresponding clustering.

Parameters:
• dendrogram (np.ndarray) – Dendrogram.

• n_clusters (int) – Number of clusters (optional). The number of clusters can be larger than n_clusters in case of equal heights in the dendrogram.

• threshold (float) – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.

• sort_clusters (bool) – If True, sorts clusters in decreasing order of size.

• return_dendrogram (bool) – If True, returns the dendrogram formed by the clusters up to the root.

Returns:

• labels (np.ndarray) – Cluster of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])

sknetwork.hierarchy.cut_balanced(dendrogram: ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters:
• dendrogram (np.ndarray) – Dendrogram

• max_cluster_size (int) – Maximum size of each cluster.

• sort_clusters (bool) – If True, sort labels in decreasing order of cluster size.

• return_dendrogram (bool) – If True, returns the dendrogram formed by the clusters up to the root.

Returns:

• labels (np.ndarray) – Label of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])


## Dendrograms

sknetwork.hierarchy.aggregate_dendrogram(dendrogram: ndarray, n_clusters: int = 2, return_counts: bool = False) ndarray | Tuple[ndarray, ndarray][source]

Aggregate a dendrogram in order to get a certain number of leaves. The leaves in the output dendrogram correspond to subtrees in the input one.

Parameters:
• dendrogram (np.ndarray) – The input to aggregate.

• n_clusters (int) – Number of clusters (or leaves) to keep.

• return_counts (bool) – If True, returns an array of counts corresponding to the sizes of the merged subtrees. The sum of the counts is equal to the number of samples in the input dendrogram.

Returns:

• new_dendrogram (np.ndarray) – Aggregated dendrogram. The nodes are reindexed from 0.

• counts (np.ndarray) – Size of the subtrees corresponding to each leaf in new_dendrogram.

sknetwork.hierarchy.reorder_dendrogram(dendrogram: ndarray) ndarray[source]

Reorder the dendrogram in non-decreasing order of height.