Clustering
Clustering algorithms.
The attribute labels_
assigns a label (cluster index) to each node of the graph.
Louvain
Here are the available notions of modularity for the Louvain algorithm:
Modularity |
Formula |
---|---|
Newman ( |
\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d_id_j}{w}\right)\delta_{c_i,c_j}\) |
Dugué ( |
\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\) |
Potts ( |
\(Q = \sum_{i,j}\left(\frac{A_{ij}}{w} - \gamma \frac{1}{n^2}\right)\delta_{c_i,c_j}\) |
- where
\(A\) is the adjacency matrix,
\(c_i\) is the cluster of node \(i\),
\(d_i\) is the degree of node \(i\),
\(d^+_i, d^-_i\) are the out-degree, in-degree of node \(i\) (for directed graphs),
\(w = 1^TA1\) is the sum of degrees,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.
For bipartite graphs, the considered adjacency matrix is
\(A = \begin{pmatrix} 0 & B\\B^T & 0\end{pmatrix}\)
for Newman modularity and Potts modularity (i.e., the graph is considered as undirected), and
\(A = \begin{pmatrix} 0 & B\\0 & 0\end{pmatrix}\)
for Dugué modularity (i.e., the graph is considered as directed). The latter is the default option and corresponds to Barber’s modularity:
\(Q = \frac{1}{w} \sum_{i,j}\left(B_{ij} - \gamma \frac{d_if_j}{w}\right)\delta_{c_i,c_j}\)
where \(i\) in the row index, \(j\) in the column index, \(d_i\) is the degree of row \(i\), \(f_j\) is the degree of column \(j\) and \(w = 1^TB1\) is the sum of degrees (either rows or columns).
When the graph is weighted, the degree of a node is replaced by its weight (sum of edge weights).
- class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]
Louvain algorithm for clustering graphs by maximization of modularity.
For bipartite graphs, the algorithm maximizes Barber’s modularity by default.
- Parameters
resolution – Resolution parameter.
modularity (str) – Which objective function to maximize. Can be
'Dugue'
,'Newman'
or'Potts'
(default ='dugue'
).tol_optimization – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.
n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.
shuffle_nodes – Enables node shuffling before optimization.
sort_clusters – If
True
, sort labels in decreasing order of cluster size.return_membership – If
True
, return the membership matrix of nodes to each cluster (soft clustering).return_aggregate – If
True
, return the adjacency matrix of the graph between clusters.random_state – Random number generator or random seed. If None, numpy.random is used.
verbose – Verbose mode.
- Variables
labels_ (np.ndarray) – Labels of the nodes.
labels_row_ (np.ndarray) – Labels of the rows (for bipartite graphs).
labels_col_ (np.ndarray) – Labels of the columns (for bipartite graphs).
membership_ (sparse.csr_matrix) – Membership matrix of the nodes, shape (n_nodes, n_clusters).
membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (for bipartite graphs).
membership_col_ (sparse.csr_matrix) – Membership matrix of the columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.
Example
>>> from sknetwork.clustering import Louvain >>> from sknetwork.data import karate_club >>> louvain = Louvain() >>> adjacency = karate_club() >>> labels = louvain.fit_predict(adjacency) >>> len(set(labels)) 4
References
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008.
Dugué, N., & Perez, A. (2015). Directed Louvain: maximizing modularity in directed networks (Doctoral dissertation, Université d’Orléans).
Barber, M. J. (2007). Modularity and community detection in bipartite networks Physical Review E, 76(6).
- fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], force_bipartite: bool = False) sknetwork.clustering.louvain.Louvain [source]
Fit algorithm to data.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If
True
, force the input matrix to be considered as a biadjacency matrix even if square.
- Returns
self
- Return type
- fit_predict(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the labels. Same parameters as the
fit
method.- Returns
labels – Labels.
- Return type
np.ndarray
- fit_transform(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the membership matrix. Same parameters as the
fit
method. :returns: membership – Membership matrix (distribution over clusters). :rtype: np.ndarray
Propagation
- class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_membership: bool = True, return_aggregate: bool = True)[source]
Clustering by label propagation.
- Parameters
n_iter (int) – Maximum number of iterations (-1 for infinity).
node_order (str) –
‘random’: node labels are updated in random order.
’increasing’: node labels are updated by increasing order of weight.
’decreasing’: node labels are updated by decreasing order of weight.
Otherwise, node labels are updated by index order.
weighted (bool) – If
True
, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.sort_clusters – If
True
, sort labels in decreasing order of cluster size.return_membership – If
True
, return the membership matrix of nodes to each cluster (soft clustering).return_aggregate – If
True
, return the aggregate adjacency matrix or biadjacency matrix between clusters.
- Variables
labels_ (np.ndarray) – Labels of the nodes.
labels_row_ (np.ndarray) – Labels of the rows (for bipartite graphs).
labels_col_ (np.ndarray) – Labels of the columns (for bipartite graphs).
membership_ (sparse.csr_matrix) – Membership matrix of the nodes, shape (n_nodes, n_clusters).
membership_row_ (sparse.csr_matrix) – Membership matrix of the rows (for bipartite graphs).
membership_col_ (sparse.csr_matrix) – Membership matrix of the columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.
Example
>>> from sknetwork.clustering import PropagationClustering >>> from sknetwork.data import karate_club >>> propagation = PropagationClustering() >>> graph = karate_club(metadata=True) >>> adjacency = graph.adjacency >>> labels = propagation.fit_predict(adjacency) >>> len(set(labels)) 2
References
Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.
- fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.clustering.propagation_clustering.PropagationClustering [source]
Clustering by label propagation.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
- Returns
self
- Return type
- fit_predict(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the labels. Same parameters as the
fit
method.- Returns
labels – Labels.
- Return type
np.ndarray
- fit_transform(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the membership matrix. Same parameters as the
fit
method. :returns: membership – Membership matrix (distribution over clusters). :rtype: np.ndarray
Post-processing
- sknetwork.clustering.reindex_labels(labels: numpy.ndarray) numpy.ndarray [source]
Reindex clusters in decreasing order of size.
- Parameters
labels – Label of each node.
- Returns
new_labels – New label of each node.
- Return type
np.ndarray
Example
>>> from sknetwork.clustering import reindex_labels >>> labels = np.array([0, 1, 1]) >>> reindex_labels(labels) array([1, 0, 0])
- sknetwork.clustering.aggregate_graph(input_matrix: scipy.sparse._csr.csr_matrix, labels: Optional[numpy.ndarray] = None, labels_row: Optional[numpy.ndarray] = None, labels_col: Optional[numpy.ndarray] = None) scipy.sparse._csr.csr_matrix [source]
Aggregate graph per label. All nodes with the same label become a single node. Negative labels are ignored (corresponding nodes are not discarded).
- Parameters
input_matrix (sparse matrix) – Adjacency or biadjacency matrix of the graph.
labels (np.ndarray) – Labels of nodes.
labels_row (np.ndarray) – Labels of rows (for bipartite graphs). Alias for labels.
labels_col (np.ndarray) – Labels of columns (for bipartite graphs).
Metrics
- sknetwork.clustering.get_modularity(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, labels_col: Optional[numpy.ndarray] = None, weights: str = 'degree', resolution: float = 1, return_all: bool = False) Union[float, Tuple[float, float, float]] [source]
Modularity of a clustering.
The modularity of a clustering is
\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{w_iw_j}{w}\right)\delta_{c_i,c_j}\) for graphs,
\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\) for directed graphs,
where
\(c_i\) is the cluster of node \(i\),
\(w_i\) is the weight of node \(i\),
\(w^+_i, w^-_i\) are the out-weight, in-weight of node \(i\) (for directed graphs),
\(w = 1^TA1\) is the total weight,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
labels – Labels of nodes.
labels_col – Labels of column nodes (for bipartite graphs).
weights – Weighting of nodes (
'degree'
(default) or'uniform'
).resolution – Resolution parameter (default = 1).
return_all – If
True
, return modularity, fit, diversity.
- Returns
modularity (float)
fit (float, optional)
diversity (float, optional)
Example
>>> from sknetwork.clustering import get_modularity >>> from sknetwork.data import house >>> adjacency = house() >>> labels = np.array([0, 0, 1, 1, 0]) >>> np.round(get_modularity(adjacency, labels), 2) 0.11