Tutorial

This notebook gives an overview of scikit-network, a open-source Python package for machine learning on graphs.

1. Installation

The easiest way to install scikit-network is through pip.

[1]:

# uncomment to install
# !pip install scikit-network

2. Data

Adjacency matrix

Each graph with \(n\) nodes is represented by a square adjacency matrix \(A\) of size \(n \times n\). In its simplest form, the matrix \(A\) is binary and we have \(A_{ij} =1\) if and only if there is a link from node \(i\) to node \(j\). If the graph is undirected, the matrix \(A\) is symmetric.

[2]:

import numpy as np

[3]:

adjacency = np.array([[0, 1, 1, 1, 0], [1, 0, 0, 0, 1], [1, 0, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]])

[4]:

adjacency

[4]:

array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0]])

Sparse matrix

In scikit-network, the adajcency matrix is represented in the sparse CSR format of scipy.

[5]:

from scipy import sparse

[6]:

adjacency = sparse.csr_matrix(adjacency)

[7]:

adjacency

[7]:

<5x5 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>

Visualization

Here is the visualization of the above graph.

[8]:

from IPython.display import SVG
from sknetwork.visualization import visualize_graph

[9]:

# name nodes by indices
n_nodes, _ = adjacency.shape
names = np.arange(n_nodes)

[10]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[10]:

../../_images/tutorials_overview_get_started_16_0.svg

Directed graphs

The adjacency matrix is not necessarily symmetric. There might be a link from node \(i\) to node \(j\) but not from node \(j\) to node \(i\).

[11]:

adjacency = np.array([[0, 1, 1, 1, 0], [0, 0, 0, 1, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 1], [0, 1, 0, 0, 0]])

[12]:

adjacency

[12]:

array([[0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0]])

[13]:

adjacency = sparse.csr_matrix(adjacency)

[14]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[14]:

../../_images/tutorials_overview_get_started_21_0.svg

Bipartite graphs

Bipartite graphs are special graphs with edges between two sets of nodes. A bipartite graph can be represented by its biadjacency matrix \(B\), a rectangular matrix of links between the two sets of nodes. We have \(B_{ij}=1\) if and only if there is an edge between node \(i\) of the first set (represented by the rows of \(B\)) and node \(j\) of the second set (represented by the columns of \(B\)).

[15]:

biadjacency = np.array([[1, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 1, 1, 1, 0], [0, 0, 0, 1, 1]])

[16]:

biadjacency

[16]:

array([[1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1],
       [1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1]])

[17]:

biadjacency = sparse.csr_matrix(biadjacency)

[18]:

biadjacency

[18]:

<4x5 sparse matrix of type '<class 'numpy.int64'>'
        with 11 stored elements in Compressed Sparse Row format>

[19]:

from sknetwork.visualization import visualize_bigraph

[20]:

# name nodes by indices
n_row, n_col = biadjacency.shape
names_row = np.arange(n_row)
names_col = np.arange(n_col)

[21]:

# visualization
SVG(visualize_bigraph(biadjacency, names_row=names_row, names_col=names_col, reorder=False))

[21]:

../../_images/tutorials_overview_get_started_29_0.svg

Weighted graphs

Weights on edges can be represented by an adjacency matrix with non-negative entries (or a biadjacency matrix for a bipartite graph).

[22]:

adjacency = np.array([[0, 2, 1, 3, 0], [2, 0, 0, 0, 3], [1, 0, 0, 2, 0], [3, 0, 2, 0, 1], [0, 3, 0, 1, 0]])

[23]:

adjacency

[23]:

array([[0, 2, 1, 3, 0],
       [2, 0, 0, 0, 3],
       [1, 0, 0, 2, 0],
       [3, 0, 2, 0, 1],
       [0, 3, 0, 1, 0]])

[24]:

adjacency = sparse.csr_matrix(adjacency)

[25]:

# visualization
SVG(visualize_graph(adjacency, names=names))

[25]:

../../_images/tutorials_overview_get_started_34_0.svg

Subgraphs

Subgraphs can easily be obtained by slicing the adjacency matrix.

[26]:

index = [0, 2, 3, 4]
sub_adjacency = adjacency[index][:, index]

[27]:

# visualization
SVG(visualize_graph(sub_adjacency, names=names[index]))

[27]:

../../_images/tutorials_overview_get_started_37_0.svg

3. Algorithms

Basic tools

Some basic tools for loading, processing and visualizing graphs are available as functions.

Module	Description	Functions
Data	Loading and saving graphs	`from_csv`, `save`, `load`, `load_netset`, …
Topology	Connectivity and structure	`get_connected_components`, `is_acyclic`, …
Path	Shortest paths and distances	`get_distances`, `get_shortest_path`, …
Linear algebra	Matrix operations	`normalize`, `diagonal_pseudo_inverse`, …
Utils	Useful tools	`directed2undirected`, `get_degrees`, `get_neighbors`, …
Visualization	Visualization tools	`visualize_graph`, `svg_bigraphs`, …

[28]:

from sknetwork.data import karate_club
from sknetwork.utils import get_degrees

[29]:

adjacency = karate_club()

[30]:

get_degrees(adjacency)

[30]:

array([16,  9, 10,  6,  3,  4,  4,  4,  5,  2,  3,  1,  2,  5,  2,  2,  2,
        2,  2,  3,  2,  2,  2,  5,  3,  3,  2,  4,  3,  4,  4,  6, 12, 17],
      dtype=int32)

Machine learning

Scikit-network is suitable for machine learning on graphs.

Each algorithm is available as an object with some methods among the following:

fit: fit the algorithm to the graph. This method is mandatory.
predict: predict the output (e.g., labels of nodes).
predict_proba: predict the output with probability (e.g., probability distribution over labels).
transform: transform the graph.
fit_predict: apply fit + predict.
fit_predict_proba: apply fit + predict_proba.
fit_transform: apply fit + transform.

The output is an attribute of the object with an underscore (e.g., labels_).

Module	Description	Algorithms	Output
Clustering	Form clusters of nodes	`Louvain`, `PropagationClustering`	`labels_`
Classification	Classify nodes	`DiffusionClassifier`, `NNClassifier`, …	`labels_`
GNN	Graph neural networks	`GNNClassifier`	`labels_`
Regression	Assign values to nodes	`Diffusion`, `Dirichlet`	`values_`
Hierarchy	Get a hierarchical representation of nodes	`Paris`, `LouvainHierarchy`	`dendrogram_`
Embedding	Get a vectorial representation of nodes	`Spectral`, `SVD`, `LouvainEmbedding`, …	`embedding_`
Ranking	Give importance scores to nodes	`PageRank`, `Katz`, `Betweenness`, …	`scores_`
Link prediction	Predict links between nodes	`NNLinker`	`links_`

[31]:

from sknetwork.data import karate_club
from sknetwork.classification import DiffusionClassifier

[32]:

dataset = karate_club(metadata=True)
adjacency = dataset.adjacency
position = dataset.position
labels = dataset.labels

[33]:

classifier = DiffusionClassifier()
classifier.fit(adjacency, {node: labels[node] for node in [0, 1, 33]})

[33]:

DiffusionClassifier(n_iter=10, centering=True, scale=5)

[34]:

labels_pred = classifier.predict()

[35]:

SVG(visualize_graph(adjacency, position, labels=labels_pred))

[35]:

../../_images/tutorials_overview_get_started_51_0.svg

[36]:

probs = classifier.predict_proba()

[37]:

scores = probs[:, 1]

[38]:

SVG(visualize_graph(adjacency, position, scores=scores))

[38]:

../../_images/tutorials_overview_get_started_54_0.svg

4. Example

We here show how to use scikit-network to cluster a graph, initially stored in a text file as a list of edges.

Data

[39]:

# show first lines
with open('miserables.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))

Myriel  Napoleon
Myriel  Mlle Baptistine
Myriel  Mme Magloire
Myriel  Countess de Lo
Myriel  Geborand

Loading

[40]:

from sknetwork.data import from_csv

[41]:

graph = from_csv('miserables.tsv')

[42]:

adjacency = graph.adjacency
names = graph.names

[43]:

adjacency

[43]:

<77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>

Embedding

We here compute a 2D embedding for visualization.

[44]:

from sknetwork.embedding import Spring

[45]:

algorithm = Spring()

[46]:

position = algorithm.fit_transform(adjacency)

[47]:

SVG(visualize_graph(adjacency, position, names=names))

[47]:

../../_images/tutorials_overview_get_started_67_0.svg

Clustering

Finally, we cluster the graph by the Louvain algorithm.

[48]:

from sknetwork.clustering import Louvain

[49]:

algorithm = Louvain()
labels = algorithm.fit_predict(adjacency)

[50]:

SVG(visualize_graph(adjacency, position, names=names, labels=labels))

[50]:

../../_images/tutorials_overview_get_started_72_0.svg

Bipartite graphs

Observe that most algorithms apply to bipartite graphs, with the fit method applied to the biadjacency matrix. We show an example with a bipartite graph between movies and actors.

[51]:

# show first lines
with open('movie_actor.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))

Inception       Leonardo DiCaprio
Inception       Marion Cotillard
Inception       Joseph Gordon Lewitt
The Dark Knight Rises   Marion Cotillard
The Dark Knight Rises   Joseph Gordon Lewitt

[52]:

graph = from_csv('movie_actor.tsv', bipartite=True)

[53]:

biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col

[54]:

algorithm = Louvain()
algorithm.fit(biadjacency)
labels_row = algorithm.labels_row_
labels_col = algorithm.labels_col_

[55]:

SVG(visualize_bigraph(biadjacency, names_row=movies, names_col=actors, labels_row=labels_row, labels_col=labels_col))

[55]:

../../_images/tutorials_overview_get_started_79_0.svg