Tutorial
This notebook gives an overview of scikit-network, a open-source Python package for machine learning on graphs.
1. Installation
The easiest way to install scikit-network
is through pip
.
[1]:
# uncomment to install
# !pip install scikit-network
2. Data
Adjacency matrix
Each graph with \(n\) nodes is represented by a square adjacency matrix \(A\) of size \(n \times n\). In its simplest form, the matrix \(A\) is binary and we have \(A_{ij} =1\) if and only if there is a link from node \(i\) to node \(j\). If the graph is undirected, the matrix \(A\) is symmetric.
[2]:
import numpy as np
[3]:
adjacency = np.array([[0, 1, 1, 1, 0], [1, 0, 0, 0, 1], [1, 0, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]])
[4]:
adjacency
[4]:
array([[0, 1, 1, 1, 0],
[1, 0, 0, 0, 1],
[1, 0, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]])
Sparse matrix
In scikit-network, the adajcency matrix is represented in the sparse CSR format of scipy.
[5]:
from scipy import sparse
[6]:
adjacency = sparse.csr_matrix(adjacency)
[7]:
adjacency
[7]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 12 stored elements in Compressed Sparse Row format>
Visualization
Here is the visualization of the above graph.
[8]:
from IPython.display import SVG
from sknetwork.visualization import visualize_graph
[9]:
# name nodes by indices
n_nodes, _ = adjacency.shape
names = np.arange(n_nodes)
[10]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[10]:
Directed graphs
The adjacency matrix is not necessarily symmetric. There might be a link from node \(i\) to node \(j\) but not from node \(j\) to node \(i\).
[11]:
adjacency = np.array([[0, 1, 1, 1, 0], [0, 0, 0, 1, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 1], [0, 1, 0, 0, 0]])
[12]:
adjacency
[12]:
array([[0, 1, 1, 1, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 1],
[0, 1, 0, 0, 0]])
[13]:
adjacency = sparse.csr_matrix(adjacency)
[14]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[14]:
Bipartite graphs
Bipartite graphs are special graphs with edges between two sets of nodes. A bipartite graph can be represented by its biadjacency matrix \(B\), a rectangular matrix of links between the two sets of nodes. We have \(B_{ij}=1\) if and only if there is an edge between node \(i\) of the first set (represented by the rows of \(B\)) and node \(j\) of the second set (represented by the columns of \(B\)).
[15]:
biadjacency = np.array([[1, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 1, 1, 1, 0], [0, 0, 0, 1, 1]])
[16]:
biadjacency
[16]:
array([[1, 1, 0, 1, 0],
[0, 0, 1, 0, 1],
[1, 1, 1, 1, 0],
[0, 0, 0, 1, 1]])
[17]:
biadjacency = sparse.csr_matrix(biadjacency)
[18]:
biadjacency
[18]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in Compressed Sparse Row format>
[19]:
from sknetwork.visualization import visualize_bigraph
[20]:
# name nodes by indices
n_row, n_col = biadjacency.shape
names_row = np.arange(n_row)
names_col = np.arange(n_col)
[21]:
# visualization
SVG(visualize_bigraph(biadjacency, names_row=names_row, names_col=names_col, reorder=False))
[21]:
Weighted graphs
Weights on edges can be represented by an adjacency matrix with non-negative entries (or a biadjacency matrix for a bipartite graph).
[22]:
adjacency = np.array([[0, 2, 1, 3, 0], [2, 0, 0, 0, 3], [1, 0, 0, 2, 0], [3, 0, 2, 0, 1], [0, 3, 0, 1, 0]])
[23]:
adjacency
[23]:
array([[0, 2, 1, 3, 0],
[2, 0, 0, 0, 3],
[1, 0, 0, 2, 0],
[3, 0, 2, 0, 1],
[0, 3, 0, 1, 0]])
[24]:
adjacency = sparse.csr_matrix(adjacency)
[25]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[25]:
Subgraphs
Subgraphs can easily be obtained by slicing the adjacency matrix.
[26]:
index = [0, 2, 3, 4]
sub_adjacency = adjacency[index][:, index]
[27]:
# visualization
SVG(visualize_graph(sub_adjacency, names=names[index]))
[27]:
3. Algorithms
Basic tools
Some basic tools for loading, processing and visualizing graphs are available as functions.
Module |
Description |
Functions |
---|---|---|
Data |
Loading and saving graphs |
|
Topology |
Connectivity and structure |
|
Path |
Shortest paths and distances |
|
Linear algebra |
Matrix operations |
|
Utils |
Useful tools |
|
Visualization |
Visualization tools |
|
[28]:
from sknetwork.data import karate_club
from sknetwork.utils import get_degrees
[29]:
adjacency = karate_club()
[30]:
get_degrees(adjacency)
[30]:
array([16, 9, 10, 6, 3, 4, 4, 4, 5, 2, 3, 1, 2, 5, 2, 2, 2,
2, 2, 3, 2, 2, 2, 5, 3, 3, 2, 4, 3, 4, 4, 6, 12, 17],
dtype=int32)
Machine learning
Scikit-network is suitable for machine learning on graphs.
Each algorithm is available as an object with some methods among the following:
fit
: fit the algorithm to the graph. This method is mandatory.predict
: predict the output (e.g., labels of nodes).predict_proba
: predict the output with probability (e.g., probability distribution over labels).transform
: transform the graph.fit_predict
: apply fit + predict.fit_predict_proba
: apply fit + predict_proba.fit_transform
: apply fit + transform.
The output is an attribute of the object with an underscore (e.g., labels_
).
Module |
Description |
Algorithms |
Output |
---|---|---|---|
Clustering |
Form clusters of nodes |
|
|
Classification |
Classify nodes |
|
|
GNN |
Graph neural networks |
|
|
Regression |
Assign values to nodes |
|
|
Hierarchy |
Get a hierarchical representation of nodes |
|
|
Embedding |
Get a vectorial representation of nodes |
|
|
Ranking |
Give importance scores to nodes |
|
|
Link prediction |
Predict links between nodes |
|
|
[31]:
from sknetwork.data import karate_club
from sknetwork.classification import DiffusionClassifier
[32]:
dataset = karate_club(metadata=True)
adjacency = dataset.adjacency
position = dataset.position
labels = dataset.labels
[33]:
classifier = DiffusionClassifier()
classifier.fit(adjacency, {node: labels[node] for node in [0, 1, 33]})
[33]:
DiffusionClassifier(n_iter=10, centering=True, scale=5)
[34]:
labels_pred = classifier.predict()
[35]:
SVG(visualize_graph(adjacency, position, labels=labels_pred))
[35]:
[36]:
probs = classifier.predict_proba()
[37]:
scores = probs[:, 1]
[38]:
SVG(visualize_graph(adjacency, position, scores=scores))
[38]:
4. Example
We here show how to use scikit-network to cluster a graph, initially stored in a text file as a list of edges.
Data
[39]:
# show first lines
with open('miserables.tsv', 'r') as f:
data = f.readlines()
print(''.join(data[:5]))
Myriel Napoleon
Myriel Mlle Baptistine
Myriel Mme Magloire
Myriel Countess de Lo
Myriel Geborand
Loading
[40]:
from sknetwork.data import from_csv
[41]:
graph = from_csv('miserables.tsv')
[42]:
adjacency = graph.adjacency
names = graph.names
[43]:
adjacency
[43]:
<77x77 sparse matrix of type '<class 'numpy.int64'>'
with 508 stored elements in Compressed Sparse Row format>
Embedding
We here compute a 2D embedding for visualization.
[44]:
from sknetwork.embedding import Spring
[45]:
algorithm = Spring()
[46]:
position = algorithm.fit_transform(adjacency)
[47]:
SVG(visualize_graph(adjacency, position, names=names))
[47]:
Clustering
Finally, we cluster the graph by the Louvain algorithm.
[48]:
from sknetwork.clustering import Louvain
[49]:
algorithm = Louvain()
labels = algorithm.fit_predict(adjacency)
[50]:
SVG(visualize_graph(adjacency, position, names=names, labels=labels))
[50]:
Bipartite graphs
Observe that most algorithms apply to bipartite graphs, with the fit
method applied to the biadjacency matrix. We show an example with a bipartite graph between movies and actors.
[51]:
# show first lines
with open('movie_actor.tsv', 'r') as f:
data = f.readlines()
print(''.join(data[:5]))
Inception Leonardo DiCaprio
Inception Marion Cotillard
Inception Joseph Gordon Lewitt
The Dark Knight Rises Marion Cotillard
The Dark Knight Rises Joseph Gordon Lewitt
[52]:
graph = from_csv('movie_actor.tsv', bipartite=True)
[53]:
biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col
[54]:
algorithm = Louvain()
algorithm.fit(biadjacency)
labels_row = algorithm.labels_row_
labels_col = algorithm.labels_col_
[55]:
SVG(visualize_bigraph(biadjacency, names_row=movies, names_col=actors, labels_row=labels_row, labels_col=labels_col))
[55]: