Clustering
Clustering algorithms.
The attribute labels_
assigns a label (cluster index) to each node of the graph.
Louvain
The Louvain algorithm aims at maximizing the modularity. Several variants of modularity are available:
Modularity |
Formula |
---|---|
Newman ( |
\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d_id_j}{w}\right)\delta_{c_i,c_j}\) |
Dugué ( |
\(Q = \frac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \frac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\) |
Potts ( |
\(Q = \sum_{i,j}\left(\frac{A_{ij}}{w} - \gamma \frac{1}{n^2}\right)\delta_{c_i,c_j}\) |
- where
\(A\) is the adjacency matrix,
\(c_i\) is the cluster of node \(i\),
\(d_i\) is the degree of node \(i\),
\(d^+_i, d^-_i\) are the out-degree, in-degree of node \(i\) (for directed graphs),
\(w = 1^TA1\) is the sum of degrees,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.
Observe that for undirected graphs, the Newman and Dugué variants are equivalent.
For bipartite graphs, the considered adjacency matrix \(A\) depends on the modularity. For the Newman and Potts variants, the graph is considered as undirected so that:
\(A = \begin{pmatrix} 0 & B\\B^T & 0\end{pmatrix}\)
For the Dugué variant, the graph is considered as directed so that:
\(A = \begin{pmatrix} 0 & B\\0 & 0\end{pmatrix}\)
This is the default option and corresponds to Barber’s modularity (see reference below).
When the graph is weighted, the degree of a node is replaced by its weight (sum of edge weights).
- class sknetwork.clustering.Louvain(resolution: float = 1, modularity: str = 'dugue', tol_optimization: float = 0.001, tol_aggregation: float = 0.001, n_aggregations: int = - 1, shuffle_nodes: bool = False, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True, random_state: Optional[Union[numpy.random.mtrand.RandomState, int]] = None, verbose: bool = False)[source]
Louvain algorithm for clustering graphs by maximization of modularity.
For bipartite graphs, the algorithm maximizes Barber’s modularity by default.
- Parameters
resolution – Resolution parameter.
modularity (str) – Which objective function to maximize. Can be
'Dugue'
,'Newman'
or'Potts'
(default ='dugue'
).tol_optimization – Minimum increase in the objective function to enter a new optimization pass.
tol_aggregation – Minimum increase in the objective function to enter a new aggregation pass.
n_aggregations – Maximum number of aggregations. A negative value is interpreted as no limit.
shuffle_nodes – Enables node shuffling before optimization.
sort_clusters – If
True
, sort labels in decreasing order of cluster size.return_probs – If
True
, return the probability distribution over clusters (soft clustering).return_aggregate – If
True
, return the adjacency matrix of the graph between clusters.random_state – Random number generator or random seed. If None, numpy.random is used.
verbose – Verbose mode.
- Variables
labels_ (np.ndarray, shape (n_labels,)) – Label of each node.
probs_ (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.
labels_col (labels_row_,) – Labels of rows and columns, for bipartite graphs.
probs_col (probs_row_,) – Probability distributions over labels for rows and columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.
Example
>>> from sknetwork.clustering import Louvain >>> from sknetwork.data import karate_club >>> louvain = Louvain() >>> adjacency = karate_club() >>> labels = louvain.fit_predict(adjacency) >>> len(set(labels)) 4
References
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008.
Dugué, N., & Perez, A. (2015). Directed Louvain: maximizing modularity in directed networks (Doctoral dissertation, Université d’Orléans).
Barber, M. J. (2007). Modularity and community detection in bipartite networks Physical Review E, 76(6).
- fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], force_bipartite: bool = False) sknetwork.clustering.louvain.Louvain [source]
Fit algorithm to data.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
force_bipartite – If
True
, force the input matrix to be considered as a biadjacency matrix even if square.
- Returns
self
- Return type
- fit_predict(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the labels. Same parameters as the
fit
method.- Returns
labels – Labels.
- Return type
np.ndarray
- fit_predict_proba(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the probability distribution over labels. Same parameters as the
fit
method.- Returns
probs – Probability of each label.
- Return type
np.ndarray
- fit_transform(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the membership matrix. Same parameters as the
fit
method.- Returns
membership – Membership matrix (distribution over clusters).
- Return type
np.ndarray
- get_params()
Get parameters as dictionary.
- Returns
params – Parameters of the algorithm.
- Return type
dict
- predict(columns=False) numpy.ndarray
Return the labels predicted by the algorithm.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
labels – Labels.
- Return type
np.ndarray
- predict_proba(columns=False) numpy.ndarray
Return the probability distribution over labels as predicted by the algorithm.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
probs – Probability distribution over labels.
- Return type
np.ndarray
- print_log(*args)
Fill log with text.
- set_params(params: dict) sknetwork.base.Algorithm
Set parameters of the algorithm.
- Parameters
params (dict) – Parameters of the algorithm.
- Returns
self
- Return type
Algorithm
- transform(columns=False) scipy.sparse._csr.csr_matrix
Return the probability distribution over labels in sparse format.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
probs – Probability distribution over labels.
- Return type
sparse.csr_matrix
Propagation
- class sknetwork.clustering.PropagationClustering(n_iter: int = 5, node_order: str = 'decreasing', weighted: bool = True, sort_clusters: bool = True, return_probs: bool = True, return_aggregate: bool = True)[source]
Clustering by label propagation.
- Parameters
n_iter (int) – Maximum number of iterations (-1 for infinity).
node_order (str) –
‘random’: node labels are updated in random order.
’increasing’: node labels are updated by increasing order of weight.
’decreasing’: node labels are updated by decreasing order of weight.
Otherwise, node labels are updated by index order.
weighted (bool) – If
True
, the vote of each neighbor is proportional to the edge weight. Otherwise, all votes have weight 1.sort_clusters – If
True
, sort labels in decreasing order of cluster size.return_probs – If
True
, return the probability distribution over clusters (soft clustering).return_aggregate – If
True
, return the aggregate adjacency matrix or biadjacency matrix between clusters.
- Variables
labels_ (np.ndarray, shape (n_labels,)) – Label of each node.
probs_ (sparse.csr_matrix, shape (n_row, n_labels)) – Probability distribution over labels.
labels_col (labels_row_,) – Labels of rows and columns, for bipartite graphs.
probs_col (probs_row_,) – Probability distributions over labels for rows and columns (for bipartite graphs).
aggregate_ (sparse.csr_matrix) – Aggregate adjacency matrix or biadjacency matrix between clusters.
Example
>>> from sknetwork.clustering import PropagationClustering >>> from sknetwork.data import karate_club >>> propagation = PropagationClustering() >>> graph = karate_club(metadata=True) >>> adjacency = graph.adjacency >>> labels = propagation.fit_predict(adjacency) >>> len(set(labels)) 2
References
Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106.
- fit(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray]) sknetwork.clustering.propagation_clustering.PropagationClustering [source]
Clustering by label propagation.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
- Returns
self
- Return type
- fit_predict(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the labels. Same parameters as the
fit
method.- Returns
labels – Labels.
- Return type
np.ndarray
- fit_predict_proba(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the probability distribution over labels. Same parameters as the
fit
method.- Returns
probs – Probability of each label.
- Return type
np.ndarray
- fit_transform(*args, **kwargs) numpy.ndarray
Fit algorithm to the data and return the membership matrix. Same parameters as the
fit
method.- Returns
membership – Membership matrix (distribution over clusters).
- Return type
np.ndarray
- get_params()
Get parameters as dictionary.
- Returns
params – Parameters of the algorithm.
- Return type
dict
- predict(columns=False) numpy.ndarray
Return the labels predicted by the algorithm.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
labels – Labels.
- Return type
np.ndarray
- predict_proba(columns=False) numpy.ndarray
Return the probability distribution over labels as predicted by the algorithm.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
probs – Probability distribution over labels.
- Return type
np.ndarray
- set_params(params: dict) sknetwork.base.Algorithm
Set parameters of the algorithm.
- Parameters
params (dict) – Parameters of the algorithm.
- Returns
self
- Return type
Algorithm
- transform(columns=False) scipy.sparse._csr.csr_matrix
Return the probability distribution over labels in sparse format.
- Parameters
columns (bool) – If
True
, return the prediction for columns.- Returns
probs – Probability distribution over labels.
- Return type
sparse.csr_matrix
Post-processing
- sknetwork.clustering.reindex_labels(labels: numpy.ndarray) numpy.ndarray [source]
Reindex clusters in decreasing order of size.
- Parameters
labels – Label of each node.
- Returns
new_labels – New label of each node.
- Return type
np.ndarray
Example
>>> from sknetwork.clustering import reindex_labels >>> labels = np.array([0, 1, 1]) >>> reindex_labels(labels) array([1, 0, 0])
- sknetwork.clustering.aggregate_graph(input_matrix: scipy.sparse._csr.csr_matrix, labels: Optional[numpy.ndarray] = None, labels_row: Optional[numpy.ndarray] = None, labels_col: Optional[numpy.ndarray] = None) scipy.sparse._csr.csr_matrix [source]
Aggregate graph per label. All nodes with the same label become a single node. Negative labels are ignored (corresponding nodes are not discarded).
- Parameters
input_matrix (sparse matrix) – Adjacency or biadjacency matrix of the graph.
labels (np.ndarray) – Labels of nodes.
labels_row (np.ndarray) – Labels of rows (for bipartite graphs). Alias for labels.
labels_col (np.ndarray) – Labels of columns (for bipartite graphs).
Metrics
- sknetwork.clustering.get_modularity(input_matrix: Union[scipy.sparse._csr.csr_matrix, numpy.ndarray], labels: numpy.ndarray, labels_col: Optional[numpy.ndarray] = None, weights: str = 'degree', resolution: float = 1, return_all: bool = False) Union[float, Tuple[float, float, float]] [source]
Modularity of a clustering.
The modularity of a clustering is
\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{w_iw_j}{w}\right)\delta_{c_i,c_j}\) for graphs,
\(Q = \dfrac{1}{w} \sum_{i,j}\left(A_{ij} - \gamma \dfrac{d^+_id^-_j}{w}\right)\delta_{c_i,c_j}\) for directed graphs,
where
\(c_i\) is the cluster of node \(i\),
\(w_i\) is the weight of node \(i\),
\(w^+_i, w^-_i\) are the out-weight, in-weight of node \(i\) (for directed graphs),
\(w = 1^TA1\) is the total weight,
\(\delta\) is the Kronecker symbol,
\(\gamma \ge 0\) is the resolution parameter.
- Parameters
input_matrix – Adjacency matrix or biadjacency matrix of the graph.
labels – Labels of nodes.
labels_col – Labels of column nodes (for bipartite graphs).
weights – Weighting of nodes (
'degree'
(default) or'uniform'
).resolution – Resolution parameter (default = 1).
return_all – If
True
, return modularity, fit, diversity.
- Returns
modularity (float)
fit (float, optional)
diversity (float, optional)
Example
>>> from sknetwork.clustering import get_modularity >>> from sknetwork.data import house >>> adjacency = house() >>> labels = np.array([0, 0, 1, 1, 0]) >>> np.round(get_modularity(adjacency, labels), 2) 0.11