# Clustering¶

Clustering algorithms.

The attribute labels_ assigns a label (cluster index) to each node of the graph.

## Louvain¶

Here are the available modularities for the Louvain algorithm:

Modularity

Formula

Remarks

Newman ('newman')

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d_id_j}{w}\right)\delta_{c_i,c_j}$$

Ignores edge directions

Dugué ('dugue')

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$

Constant Potts ('potts')

$$Q = \sum_{i,j}\left(\dfrac{A_{ij}}{w} - \gamma \dfrac{1}{n^2}\right)\delta_{c_i,c_j}$$

Ignores edge directions

class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Louvain algorithm for clustering graphs by maximization of modularity.

• Graphs

• Digraphs

Parameters
• resolution – Resolution parameter.

• modularity (str) – Which objective function to maximize. Can be 'dugue', 'newman' or 'potts'.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• labels_ (np.ndarray) – Label of each node.

• membership_ (sparse.csr_matrix) – Membership matrix.

• adjacency_ (sparse.csr_matrix) – Adjacency matrix between clusters.

Example

>>> from sknetwork.clustering import Louvain
>>> from sknetwork.data import karate_club
>>> louvain = Louvain()
>>> adjacency = karate_club()
>>> labels = louvain.fit_transform(adjacency)
>>> len(set(labels))
4


References

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.louvain.Louvain[source]

Fit algorithm to the data.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

Louvain

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

class sknetwork.clustering.BiLouvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

BiLouvain algorithm for the clustering of bipartite graphs.

• Bigraphs

Parameters
• resolution – Resolution parameter.

• modularity (str) – Which objective function to maximize. Can be 'dugue', 'newman' or 'potts'.

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the biadjacency matrix of the graph between clusters.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• labels_ (np.ndarray) – Labels of the rows.

• labels_row_ (np.ndarray) – Labels of the rows (copy of labels_).

• labels_col_ (np.ndarray) – Labels of the columns.

• membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (copy of membership_).

• membership_col_ (sparse.csr_matrix) – Membership matrix of the columns. Only valid if cluster_both = True.

• biadjacency_ (sparse.csr_matrix) – Biadjacency matrix of the aggregate graph between clusters.

Example

>>> from sknetwork.clustering import BiLouvain
>>> from sknetwork.data import movie_actor
>>> bilouvain = BiLouvain()
>>> biadjacency = movie_actor()
>>> labels = bilouvain.fit_transform(biadjacency)
>>> len(labels)
15

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.louvain.BiLouvain[source]

Apply the Louvain algorithm to the corresponding directed graph, with adjacency matrix:

$$A = \begin{bmatrix} 0 & B \\ 0 & 0 \end{bmatrix}$$

where $$B$$ is the input (biadjacency matrix).

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiLouvain

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

## Propagation¶

class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

Clustering by label propagation.

• Graphs

• Digraphs

Parameters
• n_iter (int) – Maximum number of iterations (-1 for infinity).

• node_order (str) –

• ‘random’: node labels are updated in random order.

• ’increasing’: node labels are updated by increasing order of weight.

• ’decreasing’: node labels are updated by decreasing order of weight.

• Otherwise, node labels are updated by index order.

• weighted (bool) – If True, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

Variables
• labels_ (np.ndarray) – Label of each node.

• membership_ (sparse.csr_matrix) – Membership matrix (columns = labels).

Example

>>> from sknetwork.clustering import PropagationClustering
>>> from sknetwork.data import karate_club
>>> propagation = PropagationClustering()
>>> graph = karate_club(metadata=True)
>>> adjacency = graph.adjacency
>>> labels = propagation.fit_transform(adjacency)
>>> len(set(labels))
2


References

Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.propagation_clustering.PropagationClustering[source]

Clustering by label propagation.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

PropagationClustering

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

score(label: int)

Classification scores for a given label.

Parameters

label (int) – The label index of the class.

Returns

scores – Classification scores of shape (number of nodes,).

Return type

np.ndarray

class sknetwork.clustering.BiPropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

Clustering by label propagation in bipartite graphs.

• Bigraphs

Parameters
• n_iter – Maximum number of iteration (-1 for infinity).

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the biadjacency matrix of the graph between clusters.

Variables
• labels_ (np.ndarray) – Label of each row.

• labels_row_ (np.ndarray) – Label of each row (copy of labels_).

• labels_col_ (np.ndarray) – Label of each column.

• membership_ (sparse.csr_matrix) – Membership matrix of rows.

• membership_row_ (sparse.csr_matrix) – Membership matrix of rows (copy of membership_).

• membership_col_ (sparse.csr_matrix) – Membership matrix of columns.

Example

>>> from sknetwork.clustering import BiPropagationClustering
>>> from sknetwork.data import movie_actor
>>> bipropagation = BiPropagationClustering()
>>> graph = movie_actor(metadata=True)
>>> biadjacency = graph.biadjacency
>>> len(bipropagation.fit_transform(biadjacency))
15
>>> len(bipropagation.labels_col_)
16

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.propagation_clustering.BiPropagationClustering[source]

Clustering.

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiPropagationClustering

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

score(label: int)

Classification scores for a given label.

Parameters

label (int) – The label index of the class.

Returns

scores – Classification scores of shape (number of nodes,).

Return type

np.ndarray

## K-Means¶

class sknetwork.clustering.KMeans(n_clusters: int = 8, embedding_method: sknetwork.embedding.base.BaseEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'), sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

K-means applied in the embedding space.

• Graphs

• Digraphs

Parameters
• n_clusters – Number of desired clusters.

• embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

Variables
• labels_ (np.ndarray) – Label of each node.

• membership_ (sparse.csr_matrix) – Membership matrix.

• adjacency_ (sparse.csr_matrix) – Adjacency matrix between clusters.

Example

>>> from sknetwork.clustering import KMeans
>>> from sknetwork.data import karate_club
>>> kmeans = KMeans(n_clusters=3)
>>> adjacency = karate_club()
>>> labels = kmeans.fit_transform(adjacency)
>>> len(set(labels))
3

fit(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.kmeans.KMeans[source]

Apply embedding method followed by K-means.

Parameters

adjacency – Adjacency matrix of the graph.

Returns

self

Return type

KMeans

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

class sknetwork.clustering.BiKMeans(n_clusters: int = 2, embedding_method: sknetwork.embedding.base.BaseBiEmbedding = GSVD(n_components=10, regularization=None, relative_regularization=True, factor_row=0.5, factor_col=0.5, factor_singular=0.0, normalized=True, solver='auto'), co_cluster: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

KMeans clustering of bipartite graphs applied in the embedding space.

• Bigraphs

Parameters
• n_clusters – Number of clusters.

• embedding_method – Embedding method (default = GSVD in dimension 10, projected on the unit sphere).

• co_cluster – If True, co-cluster rows and columns (default = False).

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).

• return_aggregate – If True, return the biadjacency matrix of the graph between clusters.

Variables
• labels_ (np.ndarray) – Labels of the rows.

• labels_row_ (np.ndarray) – Labels of the rows (copy of labels_).

• labels_col_ (np.ndarray) – Labels of the columns. Only valid if co_cluster = True.

• membership_ (sparse.csr_matrix) – Membership matrix of the rows.

• membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (copy of membership_).

• membership_col_ (sparse.csr_matrix) – Membership matrix of the columns. Only valid if co_cluster = True.

• biadjacency_ (sparse.csr_matrix) – Biadjacency matrix of the graph between clusters.

Example

>>> from sknetwork.clustering import BiKMeans
>>> from sknetwork.data import movie_actor
>>> bikmeans = BiKMeans()
>>> biadjacency = movie_actor()
>>> labels = bikmeans.fit_transform(biadjacency)
>>> len(labels)
15

fit(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray])sknetwork.clustering.kmeans.BiKMeans[source]

Apply embedding method followed by clustering to the graph.

Parameters

biadjacency – Biadjacency matrix of the graph.

Returns

self

Return type

BiKMeans

fit_transform(*args, **kwargs)numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

## Metrics¶

sknetwork.clustering.modularity(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, weights: Union[str, numpy.ndarray] = 'degree', weights_in: Union[str, numpy.ndarray] = 'degree', resolution: float = 1, return_all: bool = False)Union[float, Tuple[float, float, float]][source]

Modularity of a clustering (node partition).

• Graphs

• Digraphs

The modularity of a clustering is

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d_id_j}{w}\right)\delta_{c_i,c_j}$$ for graphs,

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$ for digraphs,

where

• $$c_i$$ is the cluster of node $$i$$,

• $$d_i$$ is the weight of node $$i$$,

• $$d^+_i, d^-_i$$ are the out-weight, in-weight of node $$i$$ (for digraphs),

• $$w = 1^TA1$$ is the total weight,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Parameters
• adjacency – Adjacency matrix of the graph.

• labels – Labels of nodes, vector of size $$n$$ .

• weights – Weights of nodes. 'degree' (default), 'uniform' or custom weights.

• weights_in – In-weights of nodes. None (default), 'degree', 'uniform' or custom weights. If None, taken equal to weights.

• resolution – Resolution parameter (default = 1).

• return_all – If True, return modularity, fit, diversity.

Returns

• modularity (float)

• fit (float, optional)

• diversity (float, optional)

Example

>>> from sknetwork.clustering import modularity
>>> from sknetwork.data import house
>>> adjacency = house()
>>> labels = np.array([0, 0, 1, 1, 0])
>>> np.round(modularity(adjacency, labels), 2)
0.11

sknetwork.clustering.bimodularity(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, labels_col: numpy.ndarray, weights: Union[str, numpy.ndarray] = 'degree', weights_col: Union[str, numpy.ndarray] = 'degree', resolution: float = 1, return_all: bool = False)Union[float, Tuple[float, float, float]][source]

Bimodularity of a clustering (node partition).

• Bigraphs

The bimodularity of a clustering is

$$Q = \sum_{i}\sum_{j}\left(\dfrac{B_{ij}}{w} - \gamma \dfrac{d_{1,i}d_{2,j}}{w^2}\right) \delta_{c_{1,i},c_{2,j}}$$

where

• $$c_{1,i}, c_{2,j}$$ are the clusters of nodes $$i$$ (row) and $$j$$ (column),

• $$d_{1,i}, d_{2,j}$$ are the weights of nodes $$i$$ (row) and $$j$$ (column),

• $$w = 1^TB1$$ is the total weight,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Parameters
• biadjacency – Biadjacency matrix of the graph (shape $$n_1 \times n_2$$).

• labels – Labels of rows, vector of size $$n_1$$.

• labels_col – Labels of columns, vector of size $$n_2$$.

• weights – Weights of nodes. 'degree' (default), 'uniform' or custom weights.

• weights_col – Weights of columns. 'degree' (default), 'uniform' or custom weights.

• resolution – Resolution parameter (default = 1).

• return_all – If True, return modularity, fit, diversity.

Returns

• modularity (float)

• fit (float, optional)

• diversity (float, optional)

Example

>>> from sknetwork.clustering import bimodularity
>>> from sknetwork.data import star_wars
>>> biadjacency = star_wars()
>>> labels = np.array([1, 1, 0, 0])
>>> labels_col = np.array([1, 0, 0])
>>> np.round(bimodularity(biadjacency, labels, labels_col), 2)
0.22

sknetwork.clustering.comodularity(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, resolution: float = 1, return_all: bool = False)Union[float, Tuple[float, float, float]][source]

Modularity of a clustering in the normalized co-neighborhood graph.

• Graphs

• Digraphs

• Bigraphs

Quality metric of a clustering given by:

$$Q = \dfrac{1}{w}\sum_{i,j}\left((AD_2^{-1}A^T)_{ij} - \gamma \dfrac{d_id_j}{w}\right) \delta_{c_i,c_j}$$

where

• $$c_i$$ is the cluster of node i,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Parameters
• adjacency – Adjacency matrix of the graph.

• labels – Labels of the nodes.

• resolution – Resolution parameter (default = 1).

• return_all – If True, return modularity, fit, diversity.

Returns

• modularity (float)

• fit (float, optional)

• diversity (float, optional)

Example

>>> from sknetwork.clustering import comodularity
>>> from sknetwork.data import house
>>> adjacency = house()
>>> labels = np.array([0, 0, 1, 1, 0])
>>> np.round(comodularity(adjacency, labels), 2)
0.06


Notes

Does not require the computation of the adjacency matrix of the normalized co-neighborhood graph.

sknetwork.clustering.normalized_std(labels: numpy.ndarray)float[source]

Normalized standard deviation of cluster sizes.

A score of 1 means perfectly balanced clustering.

Parameters

labels – Labels of nodes.

Returns

Return type

float

Example

>>> from sknetwork.clustering import normalized_std
>>> labels = np.array([0, 0, 1, 1])
>>> normalized_std(labels)
1.0


## Post-processing¶

sknetwork.clustering.postprocess.reindex_labels(labels: numpy.ndarray, consecutive: bool = True)numpy.ndarray[source]

Reindex clusters in decreasing order of size.

Parameters
• labels – label of each node.

• consecutive – If True, the set of labels must be from 0 to $$k - 1$$, where $$k$$ is the number of labels. Lead to faster computation.

Returns

new_labels – New label of each node.

Return type

np.ndarray

Example

>>> from sknetwork.clustering import reindex_labels
>>> labels = np.array([0, 1, 1])
>>> reindex_labels(labels)
array([1, 0, 0])