Hierarchy

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size \((n-1) \times 4\) representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node \(n\)).

Paris

class sknetwork.hierarchy.Paris(weights: unicode = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

The similarity between nodes \(i,j\) is \(\dfrac{A_{ij}}{w_i w_j}\) where

  • \(A_{ij}\) is the weight of edge \(i,j\),

  • \(w_i, w_j\) are the weights of nodes \(i,j\)

If the input matrix \(B\) is a biadjacency matrix (i.e., rectangular), the algorithm is applied to the corresponding adjacency matrix \(A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}\)

Parameters
  • weights – Weights of nodes. 'degree' (default) or 'uniform'.

  • reorder – If True (default), reorder the dendrogram in non-decreasing order of height.

Variables
  • dendrogram_ – Dendrogram of the graph.

  • dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

  • dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_predict(adjacency)
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
       [1.        , 0.        , 0.25      , 2.        ],
       [6.        , 4.        , 0.31      , 3.        ],
       [7.        , 5.        , 0.67      , 5.        ]])

Notes

Each row of the dendrogram = \(i, j\), distance, size of cluster \(i + j\).

See also

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.hierarchy.paris.Paris

Agglomerative clustering using the nearest neighbor chain.

Parameters

input_matrix – Adjacency matrix or biadjacency matrix of the graph.

Returns

self

Return type

Paris

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Louvain

class sknetwork.hierarchy.LouvainHierarchy(resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by Louvain (bottom-up).

Parameters
  • resolution – Resolution parameter.

  • tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

  • tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

  • shuffle_nodes – Enables node shuffling before optimization.

  • random_state – Random number generator or random seed. If None, numpy.random is used.

  • verbose – Verbose mode.

Variables
  • dendrogram_ – Dendrogram of the graph.

  • dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

  • dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
>>> adjacency = house()
>>> louvain.fit_predict(adjacency)
array([[3., 2., 0., 2.],
       [4., 1., 0., 2.],
       [6., 0., 0., 3.],
       [5., 7., 1., 5.]])

Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

See also

scipy.cluster.hierarchy.dendrogram

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.hierarchy.louvain_hierarchy.LouvainHierarchy[source]

Fit algorithm to data.

Parameters

input_matrix – Adjacency matrix or biadjacency matrix of the graph.

Returns

self

Return type

LouvainIteration

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.LouvainIteration(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

Parameters
  • depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

  • resolution – Resolution parameter.

  • tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

  • tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

  • n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

  • shuffle_nodes – Enables node shuffling before optimization.

  • random_state – Random number generator or random seed. If None, numpy.random is used.

  • verbose – Verbose mode.

Variables
  • dendrogram_ – Dendrogram of the graph.

  • dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

  • dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Example

>>> from sknetwork.hierarchy import LouvainIteration
>>> from sknetwork.data import house
>>> louvain = LouvainIteration()
>>> adjacency = house()
>>> louvain.fit_predict(adjacency)
array([[3., 2., 0., 2.],
       [4., 1., 0., 2.],
       [6., 0., 0., 3.],
       [5., 7., 1., 5.]])

Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

See also

scipy.cluster.hierarchy.dendrogram

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.hierarchy.louvain_hierarchy.LouvainIteration[source]

Fit algorithm to data.

Parameters

input_matrix – Adjacency matrix or biadjacency matrix of the graph.

Returns

self

Return type

LouvainIteration

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Ward

class sknetwork.hierarchy.Ward(embedding_method: sknetwork.embedding.base.BaseEmbedding = Spectral(n_components=10, decomposition='rw', regularization=- 1, normalized=True), co_cluster: bool = False)[source]

Hierarchical clustering by the Ward method.

Parameters
  • embedding_method – Embedding method (default = Spectral embedding in dimension 10).

  • co_cluster – If True, co-cluster rows and columns, considered as different nodes (default = False).

Variables
  • dendrogram_ – Dendrogram of the graph.

  • dendrogram_row_ – Dendrogram for the rows, for bipartite graphs.

  • dendrogram_col_ – Dendrogram for the columns, for bipartite graphs.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.

Examples

>>> from sknetwork.hierarchy import Ward
>>> from sknetwork.data import karate_club
>>> ward = Ward()
>>> adjacency = karate_club()
>>> dendrogram = ward.fit_transform(adjacency)
>>> dendrogram.shape
(33, 4)

References

  • Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association.

  • Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.hierarchy.ward.Ward[source]

Applies embedding method followed by the Ward algorithm.

Parameters

input_matrix – Adjacency matrix or biadjacency matrix of the graph.

Returns

self

Return type

Ward

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to data and return the dendrogram. Alias for fit_predict. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Metrics

sknetwork.hierarchy.dasgupta_cost(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform', normalized: bool = False) float[source]

Dasgupta’s cost of a hierarchy.

Expected size (weights = 'uniform') or expected volume (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' or 'uniform' (default).

  • normalized – If True, normalized cost (between 0 and 1).

Returns

cost – Cost.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> cost = dasgupta_cost(adjacency, dendrogram)
>>> np.round(cost, 2)
3.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform') float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

Defined as 1 - normalized Dasgupta’s cost.

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = dasgupta_score(adjacency, dendrogram)
>>> np.round(score, 2)
0.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: scipy.sparse._csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'degree', normalized: bool = True) float[source]

Tree sampling divergence of a hierarchy (quality metric).

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' (default) or 'uniform'.

  • normalized – If True, normalized score (between 0 and 1).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = tree_sampling_divergence(adjacency, dendrogram)
>>> np.round(score, 2)
0.05

References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

Cuts

sknetwork.hierarchy.cut_straight(dendrogram: numpy.ndarray, n_clusters: Optional[int] = None, threshold: Union[None, float] = None, sort_clusters: bool = True, return_dendrogram: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cut a dendrogram and return the corresponding clustering.

Parameters
  • dendrogram – Dendrogram.

  • n_clusters – Number of clusters (optional). The number of clusters can be larger than n_clusters in case of equal heights in the dendrogram.

  • threshold – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.

  • sort_clusters – If True, sorts clusters in decreasing order of size.

  • return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

  • labels (np.ndarray) – Cluster of each node.

  • dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])
sknetwork.hierarchy.cut_balanced(dendrogram: numpy.ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters
  • dendrogram – Dendrogram

  • max_cluster_size – Maximum size of each cluster.

  • sort_clusters – If True, sort labels in decreasing order of cluster size.

  • return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

  • labels (np.ndarray) – Label of each node.

  • dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])

Dendrograms

sknetwork.hierarchy.aggregate_dendrogram(dendrogram: numpy.ndarray, n_clusters: int = 2, return_counts: bool = False) Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Aggregate a dendrogram in order to get a certain number of leaves. The leaves in the output dendrogram correspond to subtrees in the input one.

Parameters
  • dendrogram – The input to aggregate.

  • n_clusters – Number of clusters (or leaves) to keep.

  • return_counts – If True, returns an array of counts corresponding to the sizes of the merged subtrees. The sum of the counts is equal to the number of samples in the input dendrogram.

Returns

  • new_dendrogram – Aggregated dendrogram. The nodes are reindexed from 0.

  • counts – Size of the subtrees corresponding to each leaf in new_dendrogram.

sknetwork.hierarchy.reorder_dendrogram(dendrogram: numpy.ndarray) numpy.ndarray[source]

Reorder the dendrogram in non-decreasing order of height.