Clustering

Clustering algorithms.

The attribute labels_ assigns a label (cluster index) to each node of the graph.

Louvain

Here are the available notions of modularity for the Louvain algorithm:

Modularity	Formula
Newman (`'newman'`)	\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d_id_j}{w}\right)\delta_{c_i,c_j}\)
Dugué (`'dugue'`)	\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\)
Potts (`'potts'`)	\(Q = \sum_{i,j}\left(\frac{A_{ij}}{w} - \gamma \frac{1}{n^2}\right)\delta_{c_i,c_j}\)

where

\(A\) is the adjacency matrix,
\(c_i\) is the cluster of node \(i\),
\(d_i\) is the degree of node \(i\),
\(d^+_i, d^-_i\) are the out-degree, in-degree of node \(i\) (for directed graphs),
\(w = 1^TA1\) is the sum of degrees,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.

For bipartite graphs, the considered adjacency matrix is

\(A = \begin{pmatrix} 0 & B\\B^T & 0\end{pmatrix}\)

for Newman modularity and Potts modularity (i.e., the graph is considered as undirected), and

\(A = \begin{pmatrix} 0 & B\\0 & 0\end{pmatrix}\)

for Dugué modularity (i.e., the graph is considered as directed). The latter is the default option and corresponds to Barber’s modularity:

\(Q = \frac{1}{w} \sum_{i,j}\left(B_{ij} - \gamma \frac{d_if_j}{w}\right)\delta_{c_i,c_j}\)

where \(i\) in the row index, \(j\) in the column index, \(d_i\) is the degree of row \(i\), \(f_j\) is the degree of column \(j\) and \(w = 1^TB1\) is the sum of degrees (either rows or columns).

When the graph is weighted, the degree of a node is replaced by its weight (sum of edge weights).

class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]

Louvain algorithm for clustering graphs by maximization of modularity.

For bipartite graphs, the algorithm maximizes Barber’s modularity by default.

Parameters

resolution – Resolution parameter.
modularity (str) – Which objective function to maximize. Can be 'dugue', 'newman' or 'potts' (default = 'dugue').
tol_optimization – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.
n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.
shuffle_nodes – Enables node shuffling before optimization.
sort_clusters – If True, sort labels in decreasing order of cluster size.
return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).
return_aggregate – If True, return the adjacency matrix of the graph between clusters.
random_state – Random number generator or random seed. If None, numpy.random is used.
verbose – Verbose mode.

Variables

labels_ (np.ndarray) – Labels of the nodes.
labels_row_ (np.ndarray) – Labels of the rows (for bipartite graphs).
labels_col_ (np.ndarray) – Labels of the columns (for bipartite graphs).
membership_ (sparse.csr_matrix) – Membership matrix of the nodes, shape (n_nodes, n_clusters).
membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (for bipartite graphs).
membership_col_ (sparse.csr_matrix) – Membership matrix of the columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import Louvain
>>> from sknetwork.data import karate_club
>>> louvain = Louvain()
>>> adjacency = karate_club()
>>> labels = louvain.fit_transform(adjacency)
>>> len(set(labels))
4

References

Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008.
Dugué, N., & Perez, A. (2015). Directed Louvain: maximizing modularity in directed networks (Doctoral dissertation, Université d’Orléans).
Barber, M. J. (2007). Modularity and community detection in bipartite networks Physical Review E, 76(6).

fit(input_matrix: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], force_bipartite: bool = False) → sknetwork.clustering.louvain.Louvain[source]

Fit algorithm to data.

Parameters

input_matrix – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If True, force the input matrix to be considered as a biadjacency matrix even if square.

Returns

self

Return type

Louvain

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns: labels – Labels.
Return type: np.ndarray

Propagation

class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

Clustering by label propagation.

Parameters

n_iter (int) – Maximum number of iterations (-1 for infinity).
node_order (str) –
- ‘random’: node labels are updated in random order.
- ’increasing’: node labels are updated by increasing order of weight.
- ’decreasing’: node labels are updated by decreasing order of weight.
- Otherwise, node labels are updated by index order.
weighted (bool) – If True, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.
sort_clusters – If True, sort labels in decreasing order of cluster size.
return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).
return_aggregate – If True, return the aggregate adjacency matrix or biadjacency matrix between clusters.

Variables

labels_ (np.ndarray) – Labels of the nodes.
labels_row_ (np.ndarray) – Labels of the rows (for bipartite graphs).
labels_col_ (np.ndarray) – Labels of the columns (for bipartite graphs).
membership_ (sparse.csr_matrix) – Membership matrix of the nodes, shape (n_nodes, n_clusters).
membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (for bipartite graphs).
membership_col_ (sparse.csr_matrix) – Membership matrix of the columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import PropagationClustering
>>> from sknetwork.data import karate_club
>>> propagation = PropagationClustering()
>>> graph = karate_club(metadata=True)
>>> adjacency = graph.adjacency
>>> labels = propagation.fit_transform(adjacency)
>>> len(set(labels))
2

References

Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.

fit(input_matrix: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.clustering.propagation_clustering.PropagationClustering[source]

Clustering by label propagation.

Parameters: input_matrix – Adjacency matrix or biadjacency matrix of the graph.
Returns: self
Return type: PropagationClustering

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns: labels – Labels.
Return type: np.ndarray

score(label: int)

Classification scores for a given label.

Parameters: label (int) – The label index of the class.
Returns: scores – Classification scores of shape (number of nodes,).
Return type: np.ndarray

K-Means

class sknetwork.clustering.KMeans(n_clusters: int = 2, embedding_method: sknetwork.embedding.base.BaseEmbedding = Spectral(n_components=10, decomposition='rw', regularization=- 1, normalized=True), co_cluster: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]

K-means clustering applied in the embedding space.

Parameters

n_clusters – Number of desired clusters (default = 2).
embedding_method – Embedding method (default = Spectral embedding in dimension 10).
co_cluster – If True, co-cluster rows and columns, considered as different nodes (default = False).
sort_clusters – If True, sort labels in decreasing order of cluster size.
return_membership – If True, return the membership matrix of nodes to each cluster (soft clustering).
return_aggregate – If True, return the adjacency matrix of the graph between clusters.

Variables

labels_ (np.ndarray) – Labels of the nodes.
labels_row_ (np.ndarray) – Labels of the rows (for bipartite graphs).
labels_col_ (np.ndarray) – Labels of the columns (for bipartite graphs).
membership_ (sparse.csr_matrix) – Membership matrix of the nodes, shape (n_nodes, n_clusters).
membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (for bipartite graphs).
membership_col_ (sparse.csr_matrix) – Membership matrix of the columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.

Example

>>> from sknetwork.clustering import KMeans
>>> from sknetwork.data import karate_club
>>> kmeans = KMeans(n_clusters=3)
>>> adjacency = karate_club()
>>> labels = kmeans.fit_transform(adjacency)
>>> len(set(labels))
3

fit(input_matrix: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray]) → sknetwork.clustering.kmeans.KMeans[source]

Apply embedding method followed by K-means.

Parameters: input_matrix – Adjacency matrix or biadjacency matrix of the graph.
Returns: self
Return type: KMeans

fit_transform(*args, **kwargs) → numpy.ndarray

Fit algorithm to the data and return the labels. Same parameters as the fit method.

Returns: labels – Labels.
Return type: np.ndarray

Metrics

sknetwork.clustering.modularity(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, weights: Union[str, numpy.ndarray] = 'degree', weights_in: Union[str, numpy.ndarray] = 'degree', resolution: float = 1, return_all: bool = False) → Union[float, Tuple[float, float, float]][source]

Modularity of a clustering.

The modularity of a clustering is

\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d_id_j}{w}\right)\delta_{c_i,c_j}\) for graphs,

\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\) for directed graphs,

where

\(c_i\) is the cluster of node \(i\),
\(d_i\) is the weight of node \(i\),
\(d^+_i, d^-_i\) are the out-weight, in-weight of node \(i\) (for digraphs),
\(w = 1^TA1\) is the total weight,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.

Parameters

adjacency – Adjacency matrix of the graph.
labels – Labels of nodes, vector of size \(n\) .
weights – Weights of nodes. 'degree' (default), 'uniform' or custom weights.
weights_in – In-weights of nodes. None (default), 'degree', 'uniform' or custom weights. If None, taken equal to weights.
resolution – Resolution parameter (default = 1).
return_all – If True, return modularity, fit, diversity.

Returns

modularity (float)
fit (float, optional)
diversity (float, optional)

Example

>>> from sknetwork.clustering import modularity
>>> from sknetwork.data import house
>>> adjacency = house()
>>> labels = np.array([0, 0, 1, 1, 0])
>>> np.round(modularity(adjacency, labels), 2)
0.11

sknetwork.clustering.bimodularity(biadjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, labels_col: numpy.ndarray, weights: Union[str, numpy.ndarray] = 'degree', weights_col: Union[str, numpy.ndarray] = 'degree', resolution: float = 1, return_all: bool = False) → Union[float, Tuple[float, float, float]][source]

Bimodularity of the clustering (for bipartite graphs).

The bimodularity of a clustering is

\(Q = \sum_{i}\sum_{j}\left(\dfrac{B_{ij}}{w} - \gamma \dfrac{d_{1,i}d_{2,j}}{w^2}\right) \delta_{c_{1,i},c_{2,j}}\)

where

\(c_{1,i}, c_{2,j}\) are the clusters of nodes \(i\) (row) and \(j\) (column),
\(d_{1,i}, d_{2,j}\) are the weights of nodes \(i\) (row) and \(j\) (column),
\(w = 1^TB1\) is the total weight,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.

Parameters

biadjacency – Biadjacency matrix of the graph (shape \(n_1 \times n_2\)).
labels – Labels of rows, vector of size \(n_1\).
labels_col – Labels of columns, vector of size \(n_2\).
weights – Weights of nodes. 'degree' (default), 'uniform' or custom weights.
weights_col – Weights of columns. 'degree' (default), 'uniform' or custom weights.
resolution – Resolution parameter (default = 1).
return_all – If True, return modularity, fit, diversity.

Returns

modularity (float)
fit (float, optional)
diversity (float, optional)

Example

>>> from sknetwork.clustering import bimodularity
>>> from sknetwork.data import star_wars
>>> biadjacency = star_wars()
>>> labels = np.array([1, 1, 0, 0])
>>> labels_col = np.array([1, 0, 0])
>>> np.round(bimodularity(biadjacency, labels, labels_col), 2)
0.22

sknetwork.clustering.comodularity(adjacency: Union[scipy.sparse.csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, resolution: float = 1, return_all: bool = False) → Union[float, Tuple[float, float, float]][source]

Modularity of a clustering in the normalized co-neighborhood graph.

Quality metric of a clustering given by:

\(Q = \dfrac{1}{w}\sum_{i,j}\left((AD_2^{-1}A^T)_{ij} - \gamma \dfrac{d_id_j}{w}\right) \delta_{c_i,c_j}\)

where

\(c_i\) is the cluster of node i,
\(D_2\) is the diagonal matrix of the weights of columns,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.

Parameters

adjacency – Adjacency matrix or biadjacency matrix of the graph.
labels – Labels of the nodes.
resolution – Resolution parameter (default = 1).
return_all – If True, return modularity, fit, diversity.

Returns

modularity (float)
fit (float, optional)
diversity (float, optional)

Example

>>> from sknetwork.clustering import comodularity
>>> from sknetwork.data import house
>>> adjacency = house()
>>> labels = np.array([0, 0, 1, 1, 0])
>>> np.round(comodularity(adjacency, labels), 2)
0.06

Notes

Does not require the computation of the adjacency matrix of the normalized co-neighborhood graph.

sknetwork.clustering.normalized_std(labels: numpy.ndarray) → float[source]

Normalized standard deviation of cluster sizes.

A score of 1 means perfectly balanced clustering.

Parameters: labels – Labels of nodes.
Returns
Return type: float

Example

>>> from sknetwork.clustering import normalized_std
>>> labels = np.array([0, 0, 1, 1])
>>> normalized_std(labels)
1.0

Post-processing

sknetwork.clustering.postprocess.reindex_labels(labels: numpy.ndarray, consecutive: bool = True) → numpy.ndarray[source]

Reindex clusters in decreasing order of size.

Parameters

labels – label of each node.
consecutive – If True, the set of labels must be from 0 to \(k - 1\), where \(k\) is the number of labels. Lead to faster computation.

Returns

new_labels – New label of each node.

Return type

np.ndarray

Example

>>> from sknetwork.clustering import reindex_labels
>>> labels = np.array([0, 1, 1])
>>> reindex_labels(labels)
array([1, 0, 0])