# Clustering

Clustering algorithms.

The attribute labels_ assigns a label (cluster index) to each node of the graph.

## Louvain

The Louvain algorithm aims at maximizing the modularity. Several variants of modularity are available:

Modularity

Formula

Newman ('newman')

$$Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d_id_j}{w}\right)\delta_{c_i,c_j}$$

Dugué ('dugue')

$$Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$

Potts ('potts')

$$Q = \sum_{i,j}\left(\frac{A_{ij}}{w} - \gamma \frac{1}{n^2}\right)\delta_{c_i,c_j}$$

where
• $$A$$ is the adjacency matrix,

• $$c_i$$ is the cluster of node $$i$$,

• $$d_i$$ is the degree of node $$i$$,

• $$d^+_i, d^-_i$$ are the out-degree, in-degree of node $$i$$ (for directed graphs),

• $$w = 1^TA1$$ is the sum of degrees,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Observe that for undirected graphs, the Newman and Dugué variants are equivalent.

For bipartite graphs, the considered adjacency matrix $$A$$ depends on the modularity. For the Newman and Potts variants, the graph is considered as undirected so that:

$$A = \begin{pmatrix} 0 & B\\B^T & 0\end{pmatrix}$$

For the Dugué variant, the graph is considered as directed so that:

$$A = \begin{pmatrix} 0 & B\\0 & 0\end{pmatrix}$$

This is the default option and corresponds to Barber’s modularity (see reference below).

When the graph is weighted, the degree of a node is replaced by its weight (sum of edge weights).

class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = -1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Louvain algorithm for clustering graphs by maximization of modularity.

For bipartite graphs, the algorithm maximizes Barber’s modularity by default.

Parameters:
• resolution – Resolution parameter.

• modularity (str) – Type of modularity to maximize. Can be 'Dugue', 'Newman' or 'Potts' (default = 'dugue').

• tol_optimization – Minimum increase in modularity to enter a new optimization pass in the local search.

• tol_aggregation – Minimum increase in modularity to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_probs – If True, return the probability distribution over clusters (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables:
• labels (np.ndarray, shape (n_labels,)) – Label of each node.

• probs (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.

• labels_col (labels_row,) – Labels of rows and columns, for bipartite graphs.

• probs_col (probs_row,) – Probability distributions over labels for rows and columns (for bipartite graphs).

• aggregate (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import Louvain
>>> from sknetwork.data import karate_club
>>> louvain = Louvain()
>>> len(set(labels))
4


References

fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) [source]

Fit algorithm to data.

Parameters:

• force_bipartite – If True, force the input matrix to be considered as a biadjacency matrix even if square.

Returns:

self

Return type:

Louvain

fit_predict(*args, **kwargs) ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns:

labels – Labels.

Return type:

np.ndarray

fit_predict_proba(*args, **kwargs) ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns:

probs – Probability of each label.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns:

membership – Membership matrix (distribution over clusters).

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns=False) ndarray

Return the labels predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

labels – Labels.

Return type:

np.ndarray

predict_proba(columns=False) ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

np.ndarray

print_log(*args)

Fill log with text.

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform(columns=False) csr_matrix

Return the probability distribution over labels in sparse format.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

sparse.csr_matrix

## Leiden

class sknetwork.clustering.Leiden(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = -1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True, random_state: RandomState | int | None = None, verbose: bool = False)[source]

Leiden algorithm for clustering graphs by maximization of modularity. Compared to the Louvain algorithm, the partition is refined before each aggregation.

For bipartite graphs, the algorithm maximizes Barber’s modularity by default.

Parameters:
• resolution – Resolution parameter.

• modularity (str) – Type of modularity to maximize. Can be 'Dugue', 'Newman' or 'Potts' (default = 'dugue').

• tol_optimization – Minimum increase in modularity to enter a new optimization pass in the local search.

• tol_aggregation – Minimum increase in modularity to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_probs – If True, return the probability distribution over clusters (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables:
• labels (np.ndarray, shape (n_labels,)) – Label of each node.

• probs (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.

• labels_col (labels_row,) – Labels of rows and columns, for bipartite graphs.

• probs_col (probs_row,) – Probability distributions over labels for rows and columns (for bipartite graphs).

• aggregate (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import Leiden
>>> from sknetwork.data import karate_club
>>> leiden = Leiden()
>>> len(set(labels))
4


References

• Traag, V. A., Waltman, L., & Van Eck, N. J. (2019).

From Louvain to Leiden: guaranteeing well-connected communities, Scientific reports.

fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) [source]

Fit algorithm to data.

Parameters:

• force_bipartite – If True, force the input matrix to be considered as a biadjacency matrix even if square.

Returns:

self

Return type:

Leiden

fit_predict(*args, **kwargs) ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns:

labels – Labels.

Return type:

np.ndarray

fit_predict_proba(*args, **kwargs) ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns:

probs – Probability of each label.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns:

membership – Membership matrix (distribution over clusters).

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns=False) ndarray

Return the labels predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

labels – Labels.

Return type:

np.ndarray

predict_proba(columns=False) ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

np.ndarray

print_log(*args)

Fill log with text.

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform(columns=False) csr_matrix

Return the probability distribution over labels in sparse format.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

sparse.csr_matrix

## K-centers

class sknetwork.clustering.KCenters(n_clusters: int, directed: bool = False, center_position: str = 'row', n_init: int = 5, max_iter: int = 20)[source]

K-center clustering algorithm. The center of each cluster is obtained by the PageRank algorithm.

Parameters:
• n_clusters (int) – Number of clusters.

• directed (bool, default False) – If True, the graph is considered directed.

• center_position (str, default "row") – Force centers to correspond to the nodes on the rows or columns of the biadjacency matrix. Can be row, col or both. Only considered for bipartite graphs.

• n_init (int, default 5) – Number of reruns of the k-centers algorithm with different centers. The run that produce the best modularity is chosen as the final result.

• max_iter (int, default 20) – Maximum number of iterations of the k-centers algorithm for a single run.

Variables:
• labels (np.ndarray, shape (n_nodes,)) – Label of each node.

• labels_col (labels_row,) – Labels of rows and columns, for bipartite graphs.

• centers (np.ndarray, shape (n_nodes,)) – Cluster centers.

• centers_col (centers_row,) – Cluster centers of rows and columns, for bipartite graphs.

Example

>>> from sknetwork.clustering import KCenters
>>> from sknetwork.data import karate_club
>>> kcenters = KCenters(n_clusters=2)
>>> len(set(labels))
2

fit(input_matrix: csr_matrix | ndarray, force_bipartite: bool = False) [source]

Compute the clustering of the graph by k-centers.

Parameters:

• force_bipartite – If True, force the input matrix to be considered as a biadjacency matrix even if square.

Returns:

self

Return type:

KCenters

fit_predict(*args, **kwargs) ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns:

labels – Labels.

Return type:

np.ndarray

fit_predict_proba(*args, **kwargs) ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns:

probs – Probability of each label.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns:

membership – Membership matrix (distribution over clusters).

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns=False) ndarray

Return the labels predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

labels – Labels.

Return type:

np.ndarray

predict_proba(columns=False) ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

np.ndarray

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform(columns=False) csr_matrix

Return the probability distribution over labels in sparse format.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

sparse.csr_matrix

## Propagation

class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True)[source]

Clustering by label propagation.

Parameters:
• n_iter (int) – Maximum number of iterations (-1 for infinity).

• node_order (str) –

• ‘random’: node labels are updated in random order.

• ’increasing’: node labels are updated by increasing order of weight.

• ’decreasing’: node labels are updated by decreasing order of weight.

• Otherwise, node labels are updated by index order.

• weighted (bool) – If True, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.

• sort_clusters (bool) – If True, sort labels in decreasing order of cluster size.

• return_probs (bool) – If True, return the probability distribution over clusters (soft clustering).

• return_aggregate (bool) – If True, return the aggregate adjacency matrix or biadjacency matrix between clusters.

Variables:
• labels (np.ndarray, shape (n_labels,)) – Label of each node.

• probs (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.

• labels_col (labels_row,) – Labels of rows and columns, for bipartite graphs.

• probs_col (probs_row,) – Probability distributions over labels for rows and columns (for bipartite graphs).

• aggregate (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import PropagationClustering
>>> from sknetwork.data import karate_club
>>> propagation = PropagationClustering()
>>> len(set(labels))
2


References

Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.

fit(input_matrix: csr_matrix | ndarray) [source]

Clustering by label propagation.

Parameters:

input_matrix (sparse.csr_matrix, np.ndarray) – Adjacency matrix or biadjacency matrix of the graph.

Returns:

self

Return type:

PropagationClustering

fit_predict(*args, **kwargs) ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns:

labels – Labels.

Return type:

np.ndarray

fit_predict_proba(*args, **kwargs) ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns:

probs – Probability of each label.

Return type:

np.ndarray

fit_transform(*args, **kwargs) ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns:

membership – Membership matrix (distribution over clusters).

Return type:

np.ndarray

get_params()

Get parameters as dictionary.

Returns:

params – Parameters of the algorithm.

Return type:

dict

predict(columns=False) ndarray

Return the labels predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

labels – Labels.

Return type:

np.ndarray

predict_proba(columns=False) ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

np.ndarray

set_params(params: dict) Algorithm

Set parameters of the algorithm.

Parameters:

params (dict) – Parameters of the algorithm.

Returns:

self

Return type:

Algorithm

transform(columns=False) csr_matrix

Return the probability distribution over labels in sparse format.

Parameters:

columns (bool) – If True, return the prediction for columns.

Returns:

probs – Probability distribution over labels.

Return type:

sparse.csr_matrix

## Post-processing

sknetwork.clustering.reindex_labels(labels: ndarray) ndarray[source]

Reindex clusters in decreasing order of size.

Parameters:

labels – Label of each node.

Returns:

new_labels – New label of each node.

Return type:

np.ndarray

Example

>>> from sknetwork.clustering import reindex_labels
>>> labels = np.array([0, 1, 1])
>>> reindex_labels(labels)
array([1, 0, 0])

sknetwork.clustering.aggregate_graph(input_matrix: csr_matrix, labels: ndarray | None = None, labels_row: ndarray | None = None, labels_col: ndarray | None = None) csr_matrix[source]

Aggregate graph per label. All nodes with the same label become a single node. Negative labels are ignored (corresponding nodes are discarded).

Parameters:
• input_matrix (sparse matrix) – Adjacency or biadjacency matrix of the graph.

• labels (np.ndarray) – Labels of nodes.

• labels_row (np.ndarray) – Labels of rows (for bipartite graphs). Alias for labels.

• labels_col (np.ndarray) – Labels of columns (for bipartite graphs).

## Metrics

sknetwork.clustering.get_modularity(input_matrix: csr_matrix | ndarray, labels: ndarray, labels_col: ndarray | None = None, weights: str = 'degree', resolution: float = 1, return_all: bool = False) float | Tuple[float, float, float][source]

Modularity of a clustering.

The modularity of a clustering is

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{w_iw_j}{w}\right)\delta_{c_i,c_j}$$ for graphs,

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$ for directed graphs,

where

• $$c_i$$ is the cluster of node $$i$$,

• $$w_i$$ is the weight of node $$i$$,

• $$w^+_i, w^-_i$$ are the out-weight, in-weight of node $$i$$ (for directed graphs),

• $$w = 1^TA1$$ is the total weight,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Parameters:

• labels – Labels of nodes.

• labels_col – Labels of column nodes (for bipartite graphs).

• weights – Weighting of nodes ('degree' (default) or 'uniform').

• resolution – Resolution parameter (default = 1).

• return_all – If True, return modularity, fit, diversity.

Returns:

• modularity (float)

• fit (float, optional)

• diversity (float, optional)

Example

>>> from sknetwork.clustering import get_modularity
>>> from sknetwork.data import house