Hierarchy
Hierarchical clustering algorithms.
The attribute dendrogram_
gives the dendrogram.
A dendrogram is an array of size \((n-1) \times 4\) representing the successive merges of nodes. Each row gives the two merged nodes, their distance and the size of the resulting cluster. Any new node resulting from a merge takes the first available index (e.g., the first merge corresponds to node \(n\)).
Paris
- class sknetwork.hierarchy.Paris(weights: str = 'degree', reorder: bool = True)
Agglomerative clustering algorithm that performs greedy merge of nodes based on their similarity.
The similarity between nodes \(i,j\) is \(\dfrac{A_{ij}}{w_i w_j}\) where
\(A_{ij}\) is the weight of edge \(i,j\),
\(w_i, w_j\) are the weights of nodes \(i,j\)
If the input matrix \(B\) is a biadjacency matrix (i.e., rectangular), the algorithm is applied to the corresponding adjacency matrix \(A = \begin{bmatrix} 0 & B \\ B^T & 0 \end{bmatrix}\)
- Parameters:
weights (str) – Weights of nodes.
'degree'
(default) or'uniform'
.reorder (bool) – If
True
(default), reorder the dendrogram in non-decreasing order of height.
- Variables:
dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.
Examples
>>> from sknetwork.hierarchy import Paris >>> from sknetwork.data import house >>> paris = Paris() >>> adjacency = house() >>> dendrogram = paris.fit_predict(adjacency) >>> np.round(dendrogram, 2) array([[3. , 2. , 0.17 , 2. ], [1. , 0. , 0.25 , 2. ], [6. , 4. , 0.31 , 3. ], [7. , 5. , 0.67 , 5. ]])
Notes
Each row of the dendrogram = \(i, j\), distance, size of cluster \(i + j\).
See also
scipy.cluster.hierarchy.linkage
References
T. Bonald, B. Charpentier, A. Galland, A. Hollocou (2018). Hierarchical Graph Clustering using Node Pair Sampling. Workshop on Mining and Learning with Graphs.
- fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) Paris
Agglomerative clustering using the nearest neighbor chain.
- Parameters:
input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If
True
, force the input matrix to be considered as a biadjacency matrix.
- Returns:
self
- Return type:
- fit_predict(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Same parameters as the
fit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- fit_transform(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Alias for
fit_predict
. Same parameters as thefit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- get_params()
Get parameters as dictionary.
- Returns:
params – Parameters of the algorithm.
- Return type:
dict
- predict(columns: bool = False) ndarray
Return the dendrogram predicted by the algorithm.
- Parameters:
columns (bool) – If
True
, return the prediction for columns.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- set_params(params: dict) Algorithm
Set parameters of the algorithm.
- Parameters:
params (dict) – Parameters of the algorithm.
- Returns:
self
- Return type:
Algorithm
- transform() ndarray
Return the dendrogram predicted by the algorithm.
- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
Louvain
- class sknetwork.hierarchy.LouvainHierarchy(resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]
Hierarchical clustering by Louvain (bottom-up).
Each level corresponds to an aggregation step of the Louvain algorithm.
- Parameters:
resolution (float) – Resolution parameter.
tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.
shuffle_nodes (bool) – If
True
, shuffle nodes before optimization.random_state (int) – Random number generator or random seed. If
None
, numpy.random is used.verbose (bool) – Verbose mode.
- Variables:
dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.
Example
>>> from sknetwork.hierarchy import LouvainHierarchy >>> from sknetwork.data import house >>> louvain = LouvainHierarchy() >>> adjacency = house() >>> louvain.fit_predict(adjacency) array([[3., 2., 1., 2.], [4., 1., 1., 2.], [6., 0., 1., 3.], [5., 7., 2., 5.]])
Notes
Each row of the dendrogram = merge nodes, distance, size of cluster.
See also
scipy.cluster.hierarchy.dendrogram
,sknetwork.clustering.Louvain
- fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) LouvainHierarchy [source]
Fit algorithm to data.
- Parameters:
input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If
True
, force the input matrix to be considered as a biadjacency matrix.
- Returns:
self
- Return type:
- fit_predict(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Same parameters as the
fit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- fit_transform(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Alias for
fit_predict
. Same parameters as thefit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- get_params()
Get parameters as dictionary.
- Returns:
params – Parameters of the algorithm.
- Return type:
dict
- predict(columns: bool = False) ndarray
Return the dendrogram predicted by the algorithm.
- Parameters:
columns (bool) – If
True
, return the prediction for columns.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- set_params(params: dict) Algorithm
Set parameters of the algorithm.
- Parameters:
params (dict) – Parameters of the algorithm.
- Returns:
self
- Return type:
Algorithm
- transform() ndarray
Return the dendrogram predicted by the algorithm.
- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- class sknetwork.hierarchy.LouvainIteration(depth: int = 3, resolution: float = 1, tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = -1, shuffle_nodes: bool = False, random_state: RandomState | int | None = None, verbose: bool = False)[source]
Hierarchical clustering by successive instances of Louvain (top-down).
- Parameters:
depth (int) – Depth of the tree. A negative value is interpreted as no limit (return a tree of maximum depth).
resolution (float) – Resolution parameter.
tol_optimization (float) – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation (float) – Minimum increase in the objective function to enter a new aggregation pass.
n_aggregations (int) – Maximum number of aggregations. A negative value is interpreted as no limit.
shuffle_nodes (bool) – If
True
, shuffle nodes before optimization.random_state (int) – Random number generator or random seed. If
None
, numpy.random is used.verbose (bool) – Verbose mode.
- Variables:
dendrogram (np.ndarray) – Dendrogram of the graph.
dendrogram_row (np.ndarray) – Dendrogram for the rows, for bipartite graphs.
dendrogram_col (np.ndarray) – Dendrogram for the columns, for bipartite graphs.
dendrogram_full (np.ndarray) – Dendrogram for both rows and columns, indexed in this order, for bipartite graphs.
Example
>>> from sknetwork.hierarchy import LouvainIteration >>> from sknetwork.data import house >>> louvain = LouvainIteration() >>> adjacency = house() >>> louvain.fit_predict(adjacency) array([[3., 2., 1., 2.], [4., 1., 1., 2.], [6., 0., 1., 3.], [5., 7., 2., 5.]])
Notes
Each row of the dendrogram = merge nodes, distance, size of cluster.
See also
scipy.cluster.hierarchy.dendrogram
,sknetwork.clustering.Louvain
- fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) LouvainIteration [source]
Fit algorithm to data.
- Parameters:
input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If
True
, force the input matrix to be considered as a biadjacency matrix.
- Returns:
self
- Return type:
- fit_predict(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Same parameters as the
fit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- fit_transform(*args, **kwargs) ndarray
Fit algorithm to data and return the dendrogram. Alias for
fit_predict
. Same parameters as thefit
method.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- get_params()
Get parameters as dictionary.
- Returns:
params – Parameters of the algorithm.
- Return type:
dict
- predict(columns: bool = False) ndarray
Return the dendrogram predicted by the algorithm.
- Parameters:
columns (bool) – If
True
, return the prediction for columns.- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
- set_params(params: dict) Algorithm
Set parameters of the algorithm.
- Parameters:
params (dict) – Parameters of the algorithm.
- Returns:
self
- Return type:
Algorithm
- transform() ndarray
Return the dendrogram predicted by the algorithm.
- Returns:
dendrogram – Dendrogram.
- Return type:
np.ndarray
Metrics
- sknetwork.hierarchy.dasgupta_cost(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform', normalized: bool = False) float [source]
Dasgupta’s cost of a hierarchy.
Expected size (weights =
'uniform'
) or expected volume (weights ='degree'
) of the cluster induced by random edge sampling (closest ancestor of the two nodes in the hierarchy).- Parameters:
adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes.
'degree'
or'uniform'
(default).normalized – If
True
, normalized cost (between 0 and 1).
- Returns:
cost – Cost.
- Return type:
float
Example
>>> from sknetwork.hierarchy import dasgupta_score, Paris >>> from sknetwork.data import house >>> paris = Paris() >>> adjacency = house() >>> dendrogram = paris.fit_transform(adjacency) >>> cost = dasgupta_cost(adjacency, dendrogram) >>> float(np.round(cost, 2)) 3.33
References
Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.
- sknetwork.hierarchy.dasgupta_score(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'uniform') float [source]
Dasgupta’s score of a hierarchy (quality metric, between 0 and 1).
Defined as 1 - normalized Dasgupta’s cost.
- Parameters:
adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes.
'degree'
or'uniform'
(default).
- Returns:
score – Score.
- Return type:
float
Example
>>> from sknetwork.hierarchy import dasgupta_score, Paris >>> from sknetwork.data import house >>> paris = Paris() >>> adjacency = house() >>> dendrogram = paris.fit_transform(adjacency) >>> score = dasgupta_score(adjacency, dendrogram) >>> float(np.round(score, 2)) 0.33
References
Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. Proceedings of ACM symposium on Theory of Computing.
- sknetwork.hierarchy.tree_sampling_divergence(adjacency: csr_matrix, dendrogram: ndarray, weights: str = 'degree', normalized: bool = True) float [source]
Tree sampling divergence of a hierarchy (quality metric).
- Parameters:
adjacency – Adjacency matrix of the graph.
dendrogram – Dendrogram.
weights – Weights of nodes.
'degree'
(default) or'uniform'
.normalized – If
True
, normalized score (between 0 and 1).
- Returns:
score – Score.
- Return type:
float
Example
>>> from sknetwork.hierarchy import tree_sampling_divergence, Paris >>> from sknetwork.data import house >>> paris = Paris() >>> adjacency = house() >>> dendrogram = paris.fit_transform(adjacency) >>> score = tree_sampling_divergence(adjacency, dendrogram) >>> float(np.round(score, 2)) 0.05
References
Charpentier, B. & Bonald, T. (2019). Tree Sampling Divergence: An Information-Theoretic Metric for Hierarchical Graph Clustering. Proceedings of IJCAI.
Cuts
- sknetwork.hierarchy.cut_straight(dendrogram: ndarray, n_clusters: int | None = None, threshold: None | float = None, sort_clusters: bool = True, return_dendrogram: bool = False) ndarray | Tuple[ndarray, ndarray] [source]
Cut a dendrogram and return the corresponding clustering.
- Parameters:
dendrogram (np.ndarray) – Dendrogram.
n_clusters (int) – Number of clusters (optional). The number of clusters can be larger than n_clusters in case of equal heights in the dendrogram.
threshold (float) – Threshold on height (optional). If both n_clusters and threshold are
None
, n_clusters is set to 2.sort_clusters (bool) – If
True
, sorts clusters in decreasing order of size.return_dendrogram (bool) – If
True
, returns the dendrogram formed by the clusters up to the root.
- Returns:
labels (np.ndarray) – Cluster of each node.
dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).
Example
>>> from sknetwork.hierarchy import cut_straight >>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]]) >>> cut_straight(dendrogram) array([0, 0, 1])
- sknetwork.hierarchy.cut_balanced(dendrogram: ndarray, max_cluster_size: int = 20, sort_clusters: bool = True, return_dendrogram: bool = False) ndarray | Tuple[ndarray, ndarray] [source]
Cuts a dendrogram with a constraint on the cluster size and returns the corresponding clustering.
- Parameters:
dendrogram (np.ndarray) – Dendrogram
max_cluster_size (int) – Maximum size of each cluster.
sort_clusters (bool) – If
True
, sort labels in decreasing order of cluster size.return_dendrogram (bool) – If
True
, returns the dendrogram formed by the clusters up to the root.
- Returns:
labels (np.ndarray) – Label of each node.
dendrogram_aggregate (np.ndarray) – Dendrogram starting from clusters (leaves = clusters).
Example
>>> from sknetwork.hierarchy import cut_balanced >>> dendrogram = np.array([[0, 1, 0, 2], [2, 3, 1, 3]]) >>> cut_balanced(dendrogram, 2) array([0, 0, 1])
Dendrograms
- sknetwork.hierarchy.aggregate_dendrogram(dendrogram: ndarray, n_clusters: int = 2, return_counts: bool = False) ndarray | Tuple[ndarray, ndarray] [source]
Aggregate a dendrogram in order to get a certain number of leaves. The leaves in the output dendrogram correspond to subtrees in the input one.
- Parameters:
dendrogram (np.ndarray) – The input to aggregate.
n_clusters (int) – Number of clusters (or leaves) to keep.
return_counts (bool) – If
True
, returns an array of counts corresponding to the sizes of the merged subtrees. The sum of the counts is equal to the number of samples in the input dendrogram.
- Returns:
new_dendrogram (np.ndarray) – Aggregated dendrogram. The nodes are reindexed from 0.
counts (np.ndarray) – Size of the subtrees corresponding to each leaf in new_dendrogram.