# Hierarchy¶

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size $$(n-1) \times 4$$ representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node $$n$$).

## Paris¶

class sknetwork.hierarchy.Paris(weights: unicode = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

• Graphs

• Digraphs

The similarity between nodes $$i,j$$ is $$\dfrac{A_{ij}}{w_i w_j}$$ where

• $$A_{ij}$$ is the weight of edge $$i,j$$,

• $$w_i, w_j$$ are the weights of nodes $$i,j$$

Parameters
• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• reorder – If True, reorder the dendrogram in non-decreasing order of height.

Variables

dendrogram_ (numpy array of shape (total number of nodes - 1, 4)) – Dendrogram.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
[1.        , 0.        , 0.25      , 2.        ],
[6.        , 4.        , 0.31      , 3.        ],
[7.        , 5.        , 0.67      , 5.        ]])


Notes

Each row of the dendrogram = $$i, j$$, distance, size of cluster $$i + j$$.

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.paris.Paris

Agglomerative clustering using the nearest neighbor chain.

Parameters

Returns

self

Return type

Paris

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiParis(weights: unicode = 'degree', reorder: bool = True)

Hierarchical clustering of bipartite graphs by the Paris method.

• Bigraphs

Parameters
• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• reorder – If True, reorder the dendrogram in non-decreasing order of height.

Variables
• dendrogram_ – Dendrogram for the rows.

• dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

• dendrogram_col_ – Dendrogram for the columns.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiParis
>>> from sknetwork.data import star_wars
>>> biparis = BiParis()
>>> np.round(dendrogram, 2)
array([[1.        , 2.        , 0.37      , 2.        ],
[4.        , 0.        , 0.55      , 3.        ],
[5.        , 3.        , 0.75      , 4.        ]])


Notes

Each row of the dendrogram = $$i, j$$, height, size of cluster.

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.paris.BiParis

Apply the Paris algorithm to

$$A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}$$

where $$B$$ is the biadjacency matrix of the graph.

Parameters

Returns

self

Return type

BiParis

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

## Louvain¶

class sknetwork.hierarchy.LouvainHierarchy(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

• Graphs

• Digraphs

Parameters
• depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

• resolution – Resolution parameter.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables

dendrogram_ (np.ndarray) – Dendrogram.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
array([[3., 2., 0., 2.],
[4., 1., 0., 2.],
[6., 0., 0., 3.],
[5., 7., 1., 5.]])


Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

scipy.cluster.hierarchy.dendrogram

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.louvain_hierarchy.LouvainHierarchy[source]

Fit algorithm to data.

Parameters

Returns

self

Return type

LouvainHierarchy

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiLouvainHierarchy(**kwargs)[source]

Hierarchical clustering of bipartite graphs by successive instances of Louvain (top-down).

• Bigraphs

Parameters
• depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

• resolution – Resolution parameter.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• dendrogram_ – Dendrogram for the rows.

• dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

• dendrogram_col_ – Dendrogram for the columns.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiLouvainHierarchy
>>> from sknetwork.data import star_wars
>>> bilouvain = BiLouvainHierarchy()
array([[3., 2., 1., 2.],
[1., 0., 1., 2.],
[5., 4., 2., 4.]])


Notes

Each row of the dendrogram = $$i, j$$, height, size of cluster.

scipy.cluster.hierarchy.linkage

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.louvain_hierarchy.BiLouvainHierarchy[source]

Applies Louvain hierarchical clustering to

$$A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}$$

where $$B$$ is the biadjacency matrix of the graphs.

Parameters

Returns

self

Return type

BiLouvainHierarchy

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

## Ward¶

class sknetwork.hierarchy.Ward(embedding_method: sknetwork.embedding.base.BaseEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'))[source]

Hierarchical clustering by the Ward method.

• Graphs

• Digraphs

Parameters

embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

Examples

>>> from sknetwork.hierarchy import Ward
>>> from sknetwork.data import karate_club
>>> ward = Ward()
>>> dendrogram.shape
(33, 4)


References

• Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association.

• Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.ward.Ward[source]

Applies embedding method followed by the Ward algorithm.

Parameters

Returns

self

Return type

Ward

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiWard(embedding_method: sknetwork.embedding.base.BaseBiEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'), cluster_col: bool = False, cluster_both: bool = False)[source]

Hierarchical clustering of bipartite graphs by the Ward method.

• Bigraphs

Parameters
• embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

• cluster_col – If True, return a dendrogram for the columns (default = False).

• cluster_both – If True, return a dendrogram for all nodes (co-clustering rows + columns, default = False).

Variables
• dendrogram_ – Dendrogram for the rows.

• dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

• dendrogram_col_ – Dendrogram for the columns.

• dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiWard
>>> from sknetwork.data import movie_actor
>>> biward = BiWard()
(14, 4)


References

• Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.

• Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97.

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.ward.BiWard[source]

Applies the embedding method followed by the Ward algorithm.

Parameters

Returns

self

Return type

BiWard

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

## Metrics¶

sknetwork.hierarchy.dasgupta_cost(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform', normalized: bool = False) → float[source]

Dasgupta’s cost of a hierarchy.

• Graphs

• Digraphs

Expected size (weights = 'uniform') or expected weight (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

• normalized – If True, normalized cost (between 0 and 1).

Returns

cost – Cost.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(cost, 2)
3.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform') → float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

• Graphs

• Digraphs

Defined as 1 - normalized Dasgupta’s cost.

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.33


References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'degree', normalized: bool = True) → float[source]

Tree sampling divergence of a hierarchy (quality metric).

• Graphs

• Digraphs

Parameters

• dendrogram – Dendrogram.

• weights – Weights of nodes. 'degree' (default) or 'uniform'.

• normalized – If True, normalized score (between 0 and 1).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> np.round(score, 2)
0.52


References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

## Cuts¶

sknetwork.hierarchy.cut_straight(dendrogram: numpy.ndarray, n_clusters: Optional[int] = None, threshold: Optional[float] = None, sort_clusters: bool = True, return_dendrogram: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cut a dendrogram and return the corresponding clustering.

Parameters
• dendrogram – Dendrogram.

• n_clusters – Number of clusters (optional).

• threshold – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.

• sort_clusters – If True, sorts clusters in decreasing order of size.

• return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

• labels (np.ndarray) – Cluster of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])

sknetwork.hierarchy.cut_balanced(dendrogram: numpy.ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters
• dendrogram – Dendrogram

• max_cluster_size – Maximum size of each cluster.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

• labels (np.ndarray) – Label of each node.

• dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])