# Clustering

Clustering algorithms.

The attribute labels_ assigns a label (cluster index) to each node of the graph.

## Louvain

The Louvain algorithm aims at maximizing the modularity. Several variants of modularity are available:

Modularity

Formula

Newman ('newman')

$$Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d_id_j}{w}\right)\delta_{c_i,c_j}$$

Dugué ('dugue')

$$Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$

Potts ('potts')

$$Q = \sum_{i,j}\left(\frac{A_{ij}}{w} - \gamma \frac{1}{n^2}\right)\delta_{c_i,c_j}$$

where
• $$A$$ is the adjacency matrix,

• $$c_i$$ is the cluster of node $$i$$,

• $$d_i$$ is the degree of node $$i$$,

• $$d^+_i, d^-_i$$ are the out-degree, in-degree of node $$i$$ (for directed graphs),

• $$w = 1^TA1$$ is the sum of degrees,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Observe that for undirected graphs, the Newman and Dugué variants are equivalent.

For bipartite graphs, the considered adjacency matrix $$A$$ depends on the modularity. For the Newman and Potts variants, the graph is considered as undirected so that:

$$A = \begin{pmatrix} 0 & B\\B^T & 0\end{pmatrix}$$

For the Dugué variant, the graph is considered as directed so that:

$$A = \begin{pmatrix} 0 & B\\0 & 0\end{pmatrix}$$

This is the default option and corresponds to Barber’s modularity (see reference below).

When the graph is weighted, the degree of a node is replaced by its weight (sum of edge weights).

class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Louvain algorithm for clustering graphs by maximization of modularity.

For bipartite graphs, the algorithm maximizes Barber’s modularity by default.

Parameters
• resolution – Resolution parameter.

• modularity (str) – Which objective function to maximize. Can be 'Dugue', 'Newman' or 'Potts' (default = 'dugue').

• tol_optimization – Minimum increase in the objective function to enter a new optimization pass.

• tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.

• n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.

• shuffle_nodes – Enables node shuffling before optimization.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_probs – If True, return the probability distribution over clusters (soft clustering).

• return_aggregate – If True, return the adjacency matrix of the graph between clusters.

• random_state – Random number generator or random seed. If None, numpy.random is used.

• verbose – Verbose mode.

Variables
• labels_ (np.ndarray, shape (n_labels,)) – Label of each node.

• probs_ (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.

• labels_col (labels_row_,) – Labels of rows and columns, for bipartite graphs.

• probs_col (probs_row_,) – Probability distributions over labels for rows and columns (for bipartite graphs).

• aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import Louvain
>>> from sknetwork.data import karate_club
>>> louvain = Louvain()
>>> len(set(labels))
4


References

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], force_bipartite: bool = False) [source]

Fit algorithm to data.

Parameters

• force_bipartite – If True, force the input matrix to be considered as a biadjacency matrix even if square.

Returns

self

Return type

Louvain

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

fit_predict_proba(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns

probs – Probability of each label.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns

membership – Membership matrix (distribution over clusters).

Return type

np.ndarray

get_params()

Get parameters as dictionary.

Returns

params – Parameters of the algorithm.

Return type

dict

predict(columns=False) numpy.ndarray

Return the labels predicted by the algorithm.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

labels – Labels.

Return type

np.ndarray

predict_proba(columns=False) numpy.ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

probs – Probability distribution over labels.

Return type

np.ndarray

print_log(*args)

Fill log with text.

set_params(params: dict) sknetwork.base.Algorithm

Set parameters of the algorithm.

Parameters

params (dict) – Parameters of the algorithm.

Returns

self

Return type

Algorithm

transform(columns=False) scipy.sparse._csr.csr_matrix

Return the probability distribution over labels in sparse format.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

probs – Probability distribution over labels.

Return type

sparse.csr_matrix

## Propagation

class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True)[source]

Clustering by label propagation.

Parameters
• n_iter (int) – Maximum number of iterations (-1 for infinity).

• node_order (str) –

• ‘random’: node labels are updated in random order.

• ’increasing’: node labels are updated by increasing order of weight.

• ’decreasing’: node labels are updated by decreasing order of weight.

• Otherwise, node labels are updated by index order.

• weighted (bool) – If True, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.

• sort_clusters – If True, sort labels in decreasing order of cluster size.

• return_probs – If True, return the probability distribution over clusters (soft clustering).

• return_aggregate – If True, return the aggregate adjacency matrix or biadjacency matrix between clusters.

Variables
• labels_ (np.ndarray, shape (n_labels,)) – Label of each node.

• probs_ (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.

• labels_col (labels_row_,) – Labels of rows and columns, for bipartite graphs.

• probs_col (probs_row_,) – Probability distributions over labels for rows and columns (for bipartite graphs).

• aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import PropagationClustering
>>> from sknetwork.data import karate_club
>>> propagation = PropagationClustering()
>>> len(set(labels))
2


References

Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.

fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) [source]

Clustering by label propagation.

Parameters

Returns

self

Return type

PropagationClustering

fit_predict(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns

labels – Labels.

Return type

np.ndarray

fit_predict_proba(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the probability distribution over labels. Same parameters as the fit method.

Returns

probs – Probability of each label.

Return type

np.ndarray

fit_transform(*args, **kwargs) numpy.ndarray

Fit algorithm to the data and return the membership matrix. Same parameters as the fit method.

Returns

membership – Membership matrix (distribution over clusters).

Return type

np.ndarray

get_params()

Get parameters as dictionary.

Returns

params – Parameters of the algorithm.

Return type

dict

predict(columns=False) numpy.ndarray

Return the labels predicted by the algorithm.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

labels – Labels.

Return type

np.ndarray

predict_proba(columns=False) numpy.ndarray

Return the probability distribution over labels as predicted by the algorithm.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

probs – Probability distribution over labels.

Return type

np.ndarray

set_params(params: dict) sknetwork.base.Algorithm

Set parameters of the algorithm.

Parameters

params (dict) – Parameters of the algorithm.

Returns

self

Return type

Algorithm

transform(columns=False) scipy.sparse._csr.csr_matrix

Return the probability distribution over labels in sparse format.

Parameters

columns (bool) – If True, return the prediction for columns.

Returns

probs – Probability distribution over labels.

Return type

sparse.csr_matrix

## Post-processing

sknetwork.clustering.reindex_labels(labels: numpy.ndarray) numpy.ndarray[source]

Reindex clusters in decreasing order of size.

Parameters

labels – Label of each node.

Returns

new_labels – New label of each node.

Return type

np.ndarray

Example

>>> from sknetwork.clustering import reindex_labels
>>> labels = np.array([0, 1, 1])
>>> reindex_labels(labels)
array([1, 0, 0])

sknetwork.clustering.aggregate_graph(input_matrix: scipy.sparse._csr.csr_matrix, labels: Optional[numpy.ndarray] = None, labels_row: Optional[numpy.ndarray] = None, labels_col: Optional[numpy.ndarray] = None) scipy.sparse._csr.csr_matrix[source]

Aggregate graph per label. All nodes with the same label become a single node. Negative labels are ignored (corresponding nodes are not discarded).

Parameters
• input_matrix (sparse matrix) – Adjacency or biadjacency matrix of the graph.

• labels (np.ndarray) – Labels of nodes.

• labels_row (np.ndarray) – Labels of rows (for bipartite graphs). Alias for labels.

• labels_col (np.ndarray) – Labels of columns (for bipartite graphs).

## Metrics

sknetwork.clustering.get_modularity(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, labels_col: Optional[numpy.ndarray] = None, weights: str = 'degree', resolution: float = 1, return_all: bool = False) Union[float, Tuple[float, float, float]][source]

Modularity of a clustering.

The modularity of a clustering is

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{w_iw_j}{w}\right)\delta_{c_i,c_j}$$ for graphs,

$$Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}$$ for directed graphs,

where

• $$c_i$$ is the cluster of node $$i$$,

• $$w_i$$ is the weight of node $$i$$,

• $$w^+_i, w^-_i$$ are the out-weight, in-weight of node $$i$$ (for directed graphs),

• $$w = 1^TA1$$ is the total weight,

• $$\delta$$ is the Kronecker symbol,

• $$\gamma \ge 0$$ is the resolution parameter.

Parameters

• labels – Labels of nodes.

• labels_col – Labels of column nodes (for bipartite graphs).

• weights – Weighting of nodes ('degree' (default) or 'uniform').

• resolution – Resolution parameter (default = 1).

• return_all – If True, return modularity, fit, diversity.

Returns

• modularity (float)

• fit (float, optional)

• diversity (float, optional)

Example

>>> from sknetwork.clustering import get_modularity
>>> from sknetwork.data import house