Tutorial

This notebook gives an overview of scikit-network, a open-source Python package for machine learning on graphs.

1. Installation

The easiest way to install scikit-network is through pip.

[1]:
# uncomment to install
# !pip install scikit-network

2. Data

Adjacency matrix

Each graph with \(n\) nodes is represented by a square adjacency matrix \(A\) of size \(n \times n\). In its simplest form, the matrix \(A\) is binary and we have \(A_{ij} =1\) if and only if there is a link from node \(i\) to node \(j\). If the graph is undirected, the matrix \(A\) is symmetric.

[2]:
import numpy as np
[3]:
adjacency = np.array([[0, 1, 1, 1, 0], [1, 0, 0, 0, 1], [1, 0, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]])
[4]:
adjacency
[4]:
array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0]])

Sparse matrix

In scikit-network, the adajcency matrix is represented in the sparse CSR format of scipy.

[5]:
from scipy import sparse
[6]:
adjacency = sparse.csr_matrix(adjacency)
[7]:
adjacency
[7]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>

Visualization

Here is the visualization of the above graph.

[8]:
from IPython.display import SVG
from sknetwork.visualization import visualize_graph
[9]:
# name nodes by indices
n_nodes, _ = adjacency.shape
names = np.arange(n_nodes)
[10]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[10]:
../../_images/tutorials_overview_get_started_16_0.svg

Directed graphs

The adjacency matrix is not necessarily symmetric. There might be a link from node \(i\) to node \(j\) but not from node \(j\) to node \(i\).

[11]:
adjacency = np.array([[0, 1, 1, 1, 0], [0, 0, 0, 1, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 1], [0, 1, 0, 0, 0]])
[12]:
adjacency
[12]:
array([[0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0]])
[13]:
adjacency = sparse.csr_matrix(adjacency)
[14]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[14]:
../../_images/tutorials_overview_get_started_21_0.svg

Bipartite graphs

Bipartite graphs are special graphs with edges between two sets of nodes. A bipartite graph can be represented by its biadjacency matrix \(B\), a rectangular matrix of links between the two sets of nodes. We have \(B_{ij}=1\) if and only if there is an edge between node \(i\) of the first set (represented by the rows of \(B\)) and node \(j\) of the second set (represented by the columns of \(B\)).

[15]:
biadjacency = np.array([[1, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 1, 1, 1, 0], [0, 0, 0, 1, 1]])
[16]:
biadjacency
[16]:
array([[1, 1, 0, 1, 0],
       [0, 0, 1, 0, 1],
       [1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1]])
[17]:
biadjacency = sparse.csr_matrix(biadjacency)
[18]:
biadjacency
[18]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
        with 11 stored elements in Compressed Sparse Row format>
[19]:
from sknetwork.visualization import visualize_bigraph
[20]:
# name nodes by indices
n_row, n_col = biadjacency.shape
names_row = np.arange(n_row)
names_col = np.arange(n_col)
[21]:
# visualization
SVG(visualize_bigraph(biadjacency, names_row=names_row, names_col=names_col, reorder=False))
[21]:
../../_images/tutorials_overview_get_started_29_0.svg

Weighted graphs

Weights on edges can be represented by an adjacency matrix with non-negative entries (or a biadjacency matrix for a bipartite graph).

[22]:
adjacency = np.array([[0, 2, 1, 3, 0], [2, 0, 0, 0, 3], [1, 0, 0, 2, 0], [3, 0, 2, 0, 1], [0, 3, 0, 1, 0]])
[23]:
adjacency
[23]:
array([[0, 2, 1, 3, 0],
       [2, 0, 0, 0, 3],
       [1, 0, 0, 2, 0],
       [3, 0, 2, 0, 1],
       [0, 3, 0, 1, 0]])
[24]:
adjacency = sparse.csr_matrix(adjacency)
[25]:
# visualization
SVG(visualize_graph(adjacency, names=names))
[25]:
../../_images/tutorials_overview_get_started_34_0.svg

Subgraphs

Subgraphs can easily be obtained by slicing the adjacency matrix.

[26]:
index = [0, 2, 3, 4]
sub_adjacency = adjacency[index][:, index]
[27]:
# visualization
SVG(visualize_graph(sub_adjacency, names=names[index]))
[27]:
../../_images/tutorials_overview_get_started_37_0.svg

3. Algorithms

Basic tools

Some basic tools for loading, processing and visualizing graphs are available as functions.

Module

Description

Functions

Data

Loading and saving graphs

from_csv, save, load, load_netset, …

Topology

Connectivity and structure

get_connected_components, is_acyclic, …

Path

Shortest paths and distances

get_distances, get_shortest_path, …

Linear algebra

Matrix operations

normalize, diagonal_pseudo_inverse, …

Utils

Useful tools

directed2undirected, get_degrees, get_neighbors, …

Visualization

Visualization tools

visualize_graph, svg_bigraphs, …

[28]:
from sknetwork.data import karate_club
from sknetwork.utils import get_degrees
[29]:
adjacency = karate_club()
[30]:
get_degrees(adjacency)
[30]:
array([16,  9, 10,  6,  3,  4,  4,  4,  5,  2,  3,  1,  2,  5,  2,  2,  2,
        2,  2,  3,  2,  2,  2,  5,  3,  3,  2,  4,  3,  4,  4,  6, 12, 17],
      dtype=int32)

Machine learning

Scikit-network is suitable for machine learning on graphs.

Each algorithm is available as an object with some methods among the following:

  • fit: fit the algorithm to the graph. This method is mandatory.

  • predict: predict the output (e.g., labels of nodes).

  • predict_proba: predict the output with probability (e.g., probability distribution over labels).

  • transform: transform the graph.

  • fit_predict: apply fit + predict.

  • fit_predict_proba: apply fit + predict_proba.

  • fit_transform: apply fit + transform.

The output is an attribute of the object with an underscore (e.g., labels_).

Module

Description

Algorithms

Output

Clustering

Form clusters of nodes

Louvain, PropagationClustering

labels_

Classification

Classify nodes

DiffusionClassifier, NNClassifier, …

labels_

GNN

Graph neural networks

GNNClassifier

labels_

Regression

Assign values to nodes

Diffusion, Dirichlet

values_

Hierarchy

Get a hierarchical representation of nodes

Paris, LouvainHierarchy

dendrogram_

Embedding

Get a vectorial representation of nodes

Spectral, SVD, LouvainEmbedding, …

embedding_

Ranking

Give importance scores to nodes

PageRank, Katz, Betweenness, …

scores_

Link prediction

Predict links between nodes

NNLinker

links_

[31]:
from sknetwork.data import karate_club
from sknetwork.classification import DiffusionClassifier
[32]:
dataset = karate_club(metadata=True)
adjacency = dataset.adjacency
position = dataset.position
labels = dataset.labels
[33]:
classifier = DiffusionClassifier()
classifier.fit(adjacency, {node: labels[node] for node in [0, 1, 33]})
[33]:
DiffusionClassifier(n_iter=10, centering=True, scale=5)
[34]:
labels_pred = classifier.predict()
[35]:
SVG(visualize_graph(adjacency, position, labels=labels_pred))
[35]:
../../_images/tutorials_overview_get_started_51_0.svg
[36]:
probs = classifier.predict_proba()
[37]:
scores = probs[:, 1]
[38]:
SVG(visualize_graph(adjacency, position, scores=scores))
[38]:
../../_images/tutorials_overview_get_started_54_0.svg

4. Example

We here show how to use scikit-network to cluster a graph, initially stored in a text file as a list of edges.

Data

[39]:
# show first lines
with open('miserables.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))
Myriel  Napoleon
Myriel  Mlle Baptistine
Myriel  Mme Magloire
Myriel  Countess de Lo
Myriel  Geborand

Loading

[40]:
from sknetwork.data import from_csv
[41]:
graph = from_csv('miserables.tsv')
[42]:
adjacency = graph.adjacency
names = graph.names
[43]:
adjacency
[43]:
<77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>

Embedding

We here compute a 2D embedding for visualization.

[44]:
from sknetwork.embedding import Spring
[45]:
algorithm = Spring()
[46]:
position = algorithm.fit_transform(adjacency)
[47]:
SVG(visualize_graph(adjacency, position, names=names))
[47]:
../../_images/tutorials_overview_get_started_67_0.svg

Clustering

Finally, we cluster the graph by the Louvain algorithm.

[48]:
from sknetwork.clustering import Louvain
[49]:
algorithm = Louvain()
labels = algorithm.fit_predict(adjacency)
[50]:
SVG(visualize_graph(adjacency, position, names=names, labels=labels))
[50]:
../../_images/tutorials_overview_get_started_72_0.svg

Bipartite graphs

Observe that most algorithms apply to bipartite graphs, with the fit method applied to the biadjacency matrix. We show an example with a bipartite graph between movies and actors.

[51]:
# show first lines
with open('movie_actor.tsv', 'r') as f:
    data = f.readlines()
    print(''.join(data[:5]))
Inception       Leonardo DiCaprio
Inception       Marion Cotillard
Inception       Joseph Gordon Lewitt
The Dark Knight Rises   Marion Cotillard
The Dark Knight Rises   Joseph Gordon Lewitt

[52]:
graph = from_csv('movie_actor.tsv', bipartite=True)
[53]:
biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col
[54]:
algorithm = Louvain()
algorithm.fit(biadjacency)
labels_row = algorithm.labels_row_
labels_col = algorithm.labels_col_
[55]:
SVG(visualize_bigraph(biadjacency, names_row=movies, names_col=actors, labels_row=labels_row, labels_col=labels_col))

[55]:
../../_images/tutorials_overview_get_started_79_0.svg