# Hierarchy

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size $$(n-1) \times 4$$ representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node $$n$$).

## Paris

class sknetwork.hierarchy.Paris(weights: unicode = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

The similarity between nodes $$i,j$$ is $$\dfrac{A_{ij}}{w_i w_j}$$ where

• $$A_{ij}$$ is the weight of edge $$i,j$$,

• $$w_i, w_j$$ are the weights of nodes $$i,j$$

If the input matrix $$B$$ is a biadjacency matrix (i.e., rectangular), the algorithm is applied to the corresponding adjacency matrix $$A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}$$

Parameters
• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• reorder – If True (default), reorder the dendrogram in non-decreasing order of height.

Variables
• dendrogram_ – Dendrogram of the graph.

• dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
[1.        , 0.        , 0.25      , 2.        ],
[6.        , 4.        , 0.31      , 3.        ],
[7.        , 5.        , 0.67      , 5.        ]])


Notes

Each row of the dendrogram = $$i, j$$, distance, size of cluster $$i + j$$.

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray])

Agglomerative clustering using the nearest neighbor chain.

Parameters

Returns

self

Return type

Paris

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

get_params()

Get parameters as dictionary.

Returns

params – Parameters of the algorithm.

Return type

dict

set_params(params: dict) sknetwork.base.Algorithm

Set parameters of the algorithm.

Parameters

params (dict) – Parameters of the algorithm.

Returns

self

Return type

Algorithm

## Louvain

class sknetwork.hierarchy.LouvainHierarchy(resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by Louvain (bottom-up).

Parameters
• resolution – Resolution parameter.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• shuffle_nodes – Enables node shuffling before optimization.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• dendrogram_ – Dendrogram of the graph.

• dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
array([[3., 2., 0., 2.],
[4., 1., 0., 2.],
[6., 0., 0., 3.],
[5., 7., 1., 5.]])


Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

scipy.cluster.hierarchy.dendrogram

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) [source]

Fit algorithm to data.

Parameters

Returns

self

Return type

LouvainIteration

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

get_params()

Get parameters as dictionary.

Returns

params – Parameters of the algorithm.

Return type

dict

set_params(params: dict) sknetwork.base.Algorithm

Set parameters of the algorithm.

Parameters

params (dict) – Parameters of the algorithm.

Returns

self

Return type

Algorithm

class sknetwork.hierarchy.LouvainIteration(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

Parameters
• depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

• resolution – Resolution parameter.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• dendrogram_ – Dendrogram of the graph.

• dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

• dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainIteration
>>> from sknetwork.data import house
>>> louvain = LouvainIteration()
array([[3., 2., 0., 2.],
[4., 1., 0., 2.],
[6., 0., 0., 3.],
[5., 7., 1., 5.]])


Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

scipy.cluster.hierarchy.dendrogram

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) [source]

Fit algorithm to data.

Parameters

Returns

self

Return type

LouvainIteration

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

get_params()

Get parameters as dictionary.

Returns

params – Parameters of the algorithm.

Return type

dict

set_params(params: dict) sknetwork.base.Algorithm

Set parameters of the algorithm.

Parameters

params (dict) – Parameters of the algorithm.

Returns

self

Return type

Algorithm

## Metrics

sknetwork.hierarchy.dasgupta_cost(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform', normalized: bool = False) float[source]

Dasgupta’s cost of a hierarchy.

Expected size (weights = 'uniform') or expected volume (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

• normalized – If True, normalized cost (between 0 and 1).

Returns

cost – Cost.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(cost, 2)
3.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform') float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

Defined as 1 - normalized Dasgupta’s cost.

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'degree', normalized: bool = True) float[source]

Tree sampling divergence of a hierarchy (quality metric).

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• normalized – If True, normalized score (between 0 and 1).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.05


References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

## Cuts

sknetwork.hierarchy.cut_straight(dendrogram: numpy.ndarray, n_clusters: Optional[int] = None, threshold: Union[None, float] = None, sort_clusters: bool = True, return_dendrogram: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cut a dendrogram and return the corresponding clustering.

Parameters
• dendrogram – Dendrogram.

• n_clusters – Number of clusters (optional). The number of clusters can be larger than n_clusters in case of equal heights in the dendrogram.

• threshold – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.

• sort_clusters – If True, sorts clusters in decreasing order of size.

• return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

• labels (np.ndarray) – Cluster of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])

sknetwork.hierarchy.cut_balanced(dendrogram: numpy.ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters
• dendrogram – Dendrogram

• max_cluster_size – Maximum size of each cluster.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

• labels (np.ndarray) – Label of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])


## Dendrograms

sknetwork.hierarchy.aggregate_dendrogram(dendrogram: numpy.ndarray, n_clusters: int = 2, return_counts: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Aggregate a dendrogram in order to get a certain number of leaves. The leaves in the output dendrogram correspond to subtrees in the input one.

Parameters
• dendrogram – The input to aggregate.

• n_clusters – Number of clusters (or leaves) to keep.

• return_counts – If True, returns an array of counts corresponding to the sizes of the merged subtrees. The sum of the counts is equal to the number of samples in the input dendrogram.

Returns

• new_dendrogram – Aggregated dendrogram. The nodes are reindexed from 0.

• counts – Size of the subtrees corresponding to each leaf in new_dendrogram.

sknetwork.hierarchy.reorder_dendrogram(dendrogram: numpy.ndarray) numpy.ndarray[source]

Reorder the dendrogram in non-decreasing order of height.