Hierarchy

Hierarchical clustering algorithms.

The attribute dendrogram_ gives the dendrogram.

A dendrogram is an array of size \((n-1) \times 4\) representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node \(n\)).

Paris

class sknetwork.hierarchy.Paris(weights: unicode = 'degree', reorder: bool = True)

Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.

  • Graphs

  • Digraphs

The similarity between nodes \(i,j\) is \(\dfrac{A_{ij}}{w_i w_j}\) where

  • \(A_{ij}\) is the weight of edge \(i,j\),

  • \(w_i, w_j\) are the weights of nodes \(i,j\)

Parameters
  • weights – Weights of nodes. 'degree' (default) or 'uniform'.

  • reorder – If True, reorder the dendrogram in non-decreasing order of height.

Variables

dendrogram_ (numpy array of shape (total number of nodes - 1, 4)) – Dendrogram.

Examples

>>> from sknetwork.hierarchy import Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> np.round(dendrogram, 2)
array([[3.        , 2.        , 0.17      , 2.        ],
       [1.        , 0.        , 0.25      , 2.        ],
       [6.        , 4.        , 0.31      , 3.        ],
       [7.        , 5.        , 0.67      , 5.        ]])

Notes

Each row of the dendrogram = \(i, j\), distance, size of cluster \(i + j\).

See also

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.paris.Paris

Agglomerative clustering using the nearest neighbor chain.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

Paris

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiParis(weights: unicode = 'degree', reorder: bool = True)

Hierarchical clustering of bipartite graphs by the Paris method.

  • Bigraphs

Parameters
  • weights – Weights of nodes. 'degree' (default) or 'uniform'.

  • reorder – If True, reorder the dendrogram in non-decreasing order of height.

Variables
  • dendrogram_ – Dendrogram for the rows.

  • dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

  • dendrogram_col_ – Dendrogram for the columns.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiParis
>>> from sknetwork.data import star_wars
>>> biparis = BiParis()
>>> biadjacency = star_wars()
>>> dendrogram = biparis.fit_transform(biadjacency)
>>> np.round(dendrogram, 2)
array([[1.        , 2.        , 0.37      , 2.        ],
       [4.        , 0.        , 0.55      , 3.        ],
       [5.        , 3.        , 0.75      , 4.        ]])

Notes

Each row of the dendrogram = \(i, j\), height, size of cluster.

See also

scipy.cluster.hierarchy.linkage

References

T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.paris.BiParis

Apply the Paris algorithm to

\(A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}\)

where \(B\) is the biadjacency matrix of the graph.

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiParis

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Louvain

class sknetwork.hierarchy.LouvainHierarchy(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Hierarchical clustering by successive instances of Louvain (top-down).

  • Graphs

  • Digraphs

Parameters
  • depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

  • resolution – Resolution parameter.

  • tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

  • tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

  • n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

  • shuffle_nodes – Enables node shuffling before optimization.

  • random_state – Random number generator or random seed. If None, numpy.random is used.

  • verbose – Verbose mode.

Variables

dendrogram_ (np.ndarray) – Dendrogram.

Example

>>> from sknetwork.hierarchy import LouvainHierarchy
>>> from sknetwork.data import house
>>> louvain = LouvainHierarchy()
>>> adjacency = house()
>>> louvain.fit_transform(adjacency)
array([[3., 2., 0., 2.],
       [4., 1., 0., 2.],
       [6., 0., 0., 3.],
       [5., 7., 1., 5.]])

Notes

Each row of the dendrogram = merge nodes, distance, size of cluster.

See also

scipy.cluster.hierarchy.dendrogram

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.louvain_hierarchy.LouvainHierarchy[source]

Fit algorithm to data.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

LouvainHierarchy

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiLouvainHierarchy(**kwargs)[source]

Hierarchical clustering of bipartite graphs by successive instances of Louvain (top-down).

  • Bigraphs

Parameters
  • depth – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).

  • resolution – Resolution parameter.

  • tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

  • tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

  • n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

  • shuffle_nodes – Enables node shuffling before optimization.

  • random_state – Random number generator or random seed. If None, numpy.random is used.

  • verbose – Verbose mode.

Variables
  • dendrogram_ – Dendrogram for the rows.

  • dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

  • dendrogram_col_ – Dendrogram for the columns.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiLouvainHierarchy
>>> from sknetwork.data import star_wars
>>> bilouvain = BiLouvainHierarchy()
>>> biadjacency = star_wars()
>>> bilouvain.fit_transform(biadjacency)
array([[3., 2., 1., 2.],
       [1., 0., 1., 2.],
       [5., 4., 2., 4.]])

Notes

Each row of the dendrogram = \(i, j\), height, size of cluster.

See also

scipy.cluster.hierarchy.linkage

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.louvain_hierarchy.BiLouvainHierarchy[source]

Applies Louvain hierarchical clustering to

\(A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}\)

where \(B\) is the biadjacency matrix of the graphs.

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiLouvainHierarchy

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Ward

class sknetwork.hierarchy.Ward(embedding_method: sknetwork.embedding.base.BaseEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'))[source]

Hierarchical clustering by the Ward method.

  • Graphs

  • Digraphs

Parameters

embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

Examples

>>> from sknetwork.hierarchy import Ward
>>> from sknetwork.data import karate_club
>>> ward = Ward()
>>> adjacency = karate_club()
>>> dendrogram = ward.fit_transform(adjacency)
>>> dendrogram.shape
(33, 4)

References

  • Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association.

  • Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.ward.Ward[source]

Applies embedding method followed by the Ward algorithm.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

Ward

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

class sknetwork.hierarchy.BiWard(embedding_method: sknetwork.embedding.base.BaseBiEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'), cluster_col: bool = False, cluster_both: bool = False)[source]

Hierarchical clustering of bipartite graphs by the Ward method.

  • Bigraphs

Parameters
  • embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

  • cluster_col – If True, return a dendrogram for the columns (default = False).

  • cluster_both – If True, return a dendrogram for all nodes (co-clustering rows + columns, default = False).

Variables
  • dendrogram_ – Dendrogram for the rows.

  • dendrogram_row_ – Dendrogram for the rows (copy of dendrogram_).

  • dendrogram_col_ – Dendrogram for the columns.

  • dendrogram_full_ – Dendrogram for both rows and columns, indexed in this order.

Examples

>>> from sknetwork.hierarchy import BiWard
>>> from sknetwork.data import movie_actor
>>> biward = BiWard()
>>> biadjacency = movie_actor()
>>> biward.fit_transform(biadjacency).shape
(14, 4)

References

  • Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.

  • Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97.

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.hierarchy.ward.BiWard[source]

Applies the embedding method followed by the Ward algorithm.

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiWard

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to data and return the dendrogram. Same parameters as the fit method.

Returns

dendrogram – Dendrogram.

Return type

np.ndarray

Metrics

sknetwork.hierarchy.dasgupta_cost(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform', normalized: bool = False) → float[source]

Dasgupta’s cost of a hierarchy.

  • Graphs

  • Digraphs

Expected size (weights = 'uniform') or expected weight (weights = 'degree') of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' or 'uniform' (default).

  • normalized – If True, normalized cost (between 0 and 1).

Returns

cost – Cost.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> cost = dasgupta_cost(adjacency, dendrogram)
>>> np.round(cost, 2)
3.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.dasgupta_score(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'uniform') → float[source]

Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).

  • Graphs

  • Digraphs

Defined as 1 - normalized Dasgupta’s cost.

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' or 'uniform' (default).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import dasgupta_score, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = dasgupta_score(adjacency, dendrogram)
>>> np.round(score, 2)
0.33

References

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.

sknetwork.hierarchy.tree_sampling_divergence(adjacency: scipy.sparse.csr.csr_matrix, dendrogram: numpy.ndarray, weights: str = 'degree', normalized: bool = True) → float[source]

Tree sampling divergence of a hierarchy (quality metric).

  • Graphs

  • Digraphs

Parameters
  • adjacency – Adjacency matrix of the graph.

  • dendrogram – Dendrogram.

  • weights – Weights of nodes. 'degree' (default) or 'uniform'.

  • normalized – If True, normalized score (between 0 and 1).

Returns

score – Score.

Return type

float

Example

>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris
>>> from sknetwork.data import house
>>> paris = Paris()
>>> adjacency = house()
>>> dendrogram = paris.fit_transform(adjacency)
>>> score = tree_sampling_divergence(adjacency, dendrogram)
>>> np.round(score, 2)
0.52

References

Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.

Cuts

sknetwork.hierarchy.cut_straight(dendrogram: numpy.ndarray, n_clusters: Optional[int] = None, threshold: Optional[float] = None, sort_clusters: bool = True, return_dendrogram: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cut a dendrogram and return the corresponding clustering.

Parameters
  • dendrogram – Dendrogram.

  • n_clusters – Number of clusters (optional).

  • threshold – Threshold on height (optional). If both n_clusters and threshold are None, n_clusters is set to 2.

  • sort_clusters – If True, sorts clusters in decreasing order of size.

  • return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

  • labels (np.ndarray) – Cluster of each node.

  • dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_straight
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_straight(dendrogram)
array([0, 0, 1])
sknetwork.hierarchy.cut_balanced(dendrogram: numpy.ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]][source]

Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.

Parameters
  • dendrogram – Dendrogram

  • max_cluster_size – Maximum size of each cluster.

  • sort_clusters – If True, sort labels in decreasing order of cluster size.

  • return_dendrogram – If True, returns the dendrogram formed by the clusters up to the root.

Returns

  • labels (np.ndarray) – Label of each node.

  • dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).

Example

>>> from sknetwork.hierarchy import cut_balanced
>>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]])
>>> cut_balanced(dendrogram, 2)
array([0, 0, 1])