{ "cells": [ { "cell_type": "markdown", "id": "53b1c091", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Tutorial" ] }, { "cell_type": "markdown", "id": "b7edcced", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "This notebook gives an overview of [scikit-network](https://scikit-network.readthedocs.io/), a open-source Python package for machine learning on graphs." ] }, { "cell_type": "markdown", "id": "45cd6a2f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 1. Installation\n", "\n", "The easiest way to install ``scikit-network`` is through ``pip``." ] }, { "cell_type": "code", "execution_count": null, "id": "2db557a2", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# uncomment to install\n", "# !pip install scikit-network" ] }, { "cell_type": "markdown", "id": "e9fc8e15", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 2. Data" ] }, { "cell_type": "markdown", "id": "723af34f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Adjacency matrix\n", "\n", "Each graph with $n$ nodes is represented by a square adjacency matrix $A$ of size $n \\times n$. In its simplest form, the matrix $A$ is binary and we have $A_{ij} =1$ if and only if there is a link from node $i$ to node $j$. If the graph is undirected, the matrix $A$ is symmetric." ] }, { "cell_type": "code", "execution_count": null, "id": "4b7741f1", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "id": "362e022b", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = np.array([[0, 1, 1, 1, 0], [1, 0, 0, 0, 1], [1, 0, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]])" ] }, { "cell_type": "code", "execution_count": null, "id": "ce44e13b", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency" ] }, { "cell_type": "markdown", "id": "448ad02d", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Sparse matrix\n", "\n", "In scikit-network, the adajcency matrix is represented in the [sparse CSR format](https://en.wikipedia.org/wiki/Sparse_matrix) of scipy." ] }, { "cell_type": "code", "execution_count": null, "id": "7cc765b5", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from scipy import sparse" ] }, { "cell_type": "code", "execution_count": null, "id": "e7ae4627", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = sparse.csr_matrix(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "56246a7c", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency" ] }, { "cell_type": "markdown", "id": "d12dcb3e", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Visualization\n", "\n", "Here is the visualization of the above graph." ] }, { "cell_type": "code", "execution_count": null, "id": "bdb5880b", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from IPython.display import SVG\n", "from sknetwork.visualization import visualize_graph" ] }, { "cell_type": "code", "execution_count": null, "id": "2e92f346", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# name nodes by indices\n", "n_nodes, _ = adjacency.shape\n", "names = np.arange(n_nodes)" ] }, { "cell_type": "code", "execution_count": null, "id": "f47e30c0", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# visualization\n", "SVG(visualize_graph(adjacency, names=names))" ] }, { "cell_type": "markdown", "id": "8d0be1ad", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Directed graphs\n", "\n", "The adjacency matrix is not necessarily symmetric. There might be a link from node $i$ to node $j$ but not from node $j$ to node $i$." ] }, { "cell_type": "code", "execution_count": null, "id": "a629f5a6", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = np.array([[0, 1, 1, 1, 0], [0, 0, 0, 1, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 1], [0, 1, 0, 0, 0]])" ] }, { "cell_type": "code", "execution_count": null, "id": "57450d57", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency" ] }, { "cell_type": "code", "execution_count": null, "id": "4edb7c8a", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = sparse.csr_matrix(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "e1133898", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# visualization\n", "SVG(visualize_graph(adjacency, names=names))" ] }, { "cell_type": "markdown", "id": "7e6c1fb1", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Bipartite graphs\n", "\n", "Bipartite graphs are special graphs with edges between two sets of nodes. A bipartite graph can be represented by its biadjacency matrix $B$, a rectangular matrix of links between the two sets of nodes. We have $B_{ij}=1$ if and only if there is an edge between node $i$ of the first set (represented by the rows of $B$) and node $j$ of the second set (represented by the columns of $B$)." ] }, { "cell_type": "code", "execution_count": null, "id": "b6d80306", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency = np.array([[1, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 1, 1, 1, 0], [0, 0, 0, 1, 1]])" ] }, { "cell_type": "code", "execution_count": null, "id": "0c1b52ee", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency" ] }, { "cell_type": "code", "execution_count": null, "id": "76a8fa0e", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency = sparse.csr_matrix(biadjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "c936ca93", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency" ] }, { "cell_type": "code", "execution_count": null, "id": "e6e1cecc", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.visualization import visualize_bigraph" ] }, { "cell_type": "code", "execution_count": null, "id": "e392bb01", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# name nodes by indices\n", "n_row, n_col = biadjacency.shape\n", "names_row = np.arange(n_row)\n", "names_col = np.arange(n_col)" ] }, { "cell_type": "code", "execution_count": null, "id": "0078a23e", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# visualization\n", "SVG(visualize_bigraph(biadjacency, names_row=names_row, names_col=names_col, reorder=False))" ] }, { "cell_type": "markdown", "id": "a62381a9", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Weighted graphs\n", "\n", "Weights on edges can be represented by an adjacency matrix with non-negative entries (or a biadjacency matrix for a bipartite graph)." ] }, { "cell_type": "code", "execution_count": null, "id": "e7802e11", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = np.array([[0, 2, 1, 3, 0], [2, 0, 0, 0, 3], [1, 0, 0, 2, 0], [3, 0, 2, 0, 1], [0, 3, 0, 1, 0]])" ] }, { "cell_type": "code", "execution_count": null, "id": "71ca4813", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency" ] }, { "cell_type": "code", "execution_count": null, "id": "e7db1f45", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = sparse.csr_matrix(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "7a6f6e83", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# visualization\n", "SVG(visualize_graph(adjacency, names=names))" ] }, { "cell_type": "markdown", "id": "b835a578", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Subgraphs\n", "\n", "Subgraphs can easily be obtained by slicing the adjacency matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "6b8d1c04", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "index = [0, 2, 3, 4]\n", "sub_adjacency = adjacency[index][:, index]" ] }, { "cell_type": "code", "execution_count": null, "id": "d7e225cf", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# visualization\n", "SVG(visualize_graph(sub_adjacency, names=names[index]))" ] }, { "cell_type": "markdown", "id": "8e893281", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 3. Algorithms" ] }, { "cell_type": "markdown", "id": "408cb8b3", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Basic tools\n", "\n", "Some basic tools for loading, processing and visualizing graphs are available as functions." ] }, { "cell_type": "markdown", "id": "141b3c9a", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "|Module|Description| Functions |\n", "|:---|:---|:-----------------------------------------------------------------|\n", "|Data|Loading and saving graphs| ``from_csv``, ``save``, ``load``, ``load_netset``, ... |\n", "|Topology|Connectivity and structure| ``get_connected_components``, ``is_acyclic``, ... |\n", "|Path|Shortest paths and distances| ``get_distances``, ``get_shortest_path``, ... |\n", "|Linear algebra|Matrix operations| ``normalize``, ``diagonal_pseudo_inverse``, ... |\n", "|Utils|Useful tools| ``directed2undirected``, ``get_degrees``, ``get_neighbors``, ... |\n", "|Visualization|Visualization tools| ``visualize_graph``, ``svg_bigraphs``, ... |" ] }, { "cell_type": "code", "execution_count": null, "id": "4d88d867", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.data import karate_club\n", "from sknetwork.utils import get_degrees" ] }, { "cell_type": "code", "execution_count": null, "id": "e702b8df", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = karate_club()" ] }, { "cell_type": "code", "execution_count": null, "id": "8f0b269d", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "get_degrees(adjacency)" ] }, { "cell_type": "markdown", "id": "4fb62153", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Machine learning" ] }, { "cell_type": "markdown", "id": "2555b676", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Scikit-network is suitable for machine learning on graphs.\n", "\n", "Each algorithm is available as an object with some methods among the following:\n", "\n", "* ``fit``: fit the algorithm to the graph. This method is mandatory.\n", "* ``predict``: predict the output (e.g., labels of nodes).\n", "* ``predict_proba``: predict the output with probability (e.g., probability distribution over labels).\n", "* ``transform``: transform the graph.\n", "* ``fit_predict``: apply fit + predict.\n", "* ``fit_predict_proba``: apply fit + predict_proba.\n", "* ``fit_transform``: apply fit + transform.\n", "\n", "\n", "The output is an attribute of the object with an underscore (e.g., ``labels_``)." ] }, { "cell_type": "markdown", "id": "3b6bb1a8", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "|Module|Description|Algorithms|Output|\n", "|:---|:---|:---|:---|\n", "|Clustering|Form clusters of nodes|``Louvain``, ``PropagationClustering``|``labels_``|\n", "|Classification|Classify nodes|``DiffusionClassifier``, ``NNClassifier``, ...|``labels_``|\n", "|GNN|Graph neural networks|``GNNClassifier``|``labels_``|\n", "|Regression|Assign values to nodes|``Diffusion``, ``Dirichlet``|``values_``|\n", "|Hierarchy|Get a hierarchical representation of nodes|``Paris``, ``LouvainHierarchy``|``dendrogram_``|\n", "|Embedding|Get a vectorial representation of nodes|``Spectral``, ``SVD``, ``LouvainEmbedding``, ...|``embedding_``|\n", "|Ranking|Give importance scores to nodes|``PageRank``, ``Katz``, ``Betweenness``, ...|``scores_``|\n", "|Link prediction|Predict links between nodes|``NNLinker``|``links_``|" ] }, { "cell_type": "code", "execution_count": null, "id": "68d898ac", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.data import karate_club\n", "from sknetwork.classification import DiffusionClassifier" ] }, { "cell_type": "code", "execution_count": null, "id": "06ffd064", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "dataset = karate_club(metadata=True)\n", "adjacency = dataset.adjacency\n", "position = dataset.position\n", "labels = dataset.labels" ] }, { "cell_type": "code", "execution_count": null, "id": "e136ac14", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "classifier = DiffusionClassifier()\n", "classifier.fit(adjacency, {node: labels[node] for node in [0, 1, 33]})" ] }, { "cell_type": "code", "execution_count": null, "id": "9f6ea95b", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "labels_pred = classifier.predict()" ] }, { "cell_type": "code", "execution_count": null, "id": "83060d46", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "SVG(visualize_graph(adjacency, position, labels=labels_pred))" ] }, { "cell_type": "code", "execution_count": null, "id": "952120d1", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "probs = classifier.predict_proba()" ] }, { "cell_type": "code", "execution_count": null, "id": "d417da66", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "scores = probs[:, 1]" ] }, { "cell_type": "code", "execution_count": null, "id": "e106d597", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "SVG(visualize_graph(adjacency, position, scores=scores))" ] }, { "cell_type": "markdown", "id": "48aaa0c0", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 4. Example\n", "\n", "We here show how to use scikit-network to cluster a graph, initially stored in a text file as a list of edges." ] }, { "cell_type": "markdown", "id": "8b06d92b", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Data" ] }, { "cell_type": "code", "execution_count": null, "id": "83fd1803", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# show first lines\n", "with open('miserables.tsv', 'r') as f:\n", " data = f.readlines()\n", " print(''.join(data[:5]))" ] }, { "cell_type": "markdown", "id": "140a74e9", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Loading" ] }, { "cell_type": "code", "execution_count": null, "id": "9f7dd9f3", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.data import from_csv" ] }, { "cell_type": "code", "execution_count": null, "id": "6b697fa3", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "graph = from_csv('miserables.tsv')" ] }, { "cell_type": "code", "execution_count": null, "id": "63b5b836", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency = graph.adjacency\n", "names = graph.names" ] }, { "cell_type": "code", "execution_count": null, "id": "df091e33", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "adjacency" ] }, { "cell_type": "markdown", "id": "82ce324e", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Embedding\n", "\n", "We here compute a 2D embedding for visualization." ] }, { "cell_type": "code", "execution_count": null, "id": "541d841a", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.embedding import Spring" ] }, { "cell_type": "code", "execution_count": null, "id": "f70423ee", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "algorithm = Spring()" ] }, { "cell_type": "code", "execution_count": null, "id": "88788cae", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "position = algorithm.fit_transform(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "54dd078a", "metadata": { "scrolled": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "SVG(visualize_graph(adjacency, position, names=names))" ] }, { "cell_type": "markdown", "id": "7319f9b9", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Clustering" ] }, { "cell_type": "markdown", "id": "34f715c3", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Finally, we cluster the graph by the Louvain algorithm." ] }, { "cell_type": "code", "execution_count": null, "id": "ad136b20", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.clustering import Louvain" ] }, { "cell_type": "code", "execution_count": null, "id": "5b4d1eca", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "algorithm = Louvain()\n", "labels = algorithm.fit_predict(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "id": "4433bd05", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "SVG(visualize_graph(adjacency, position, names=names, labels=labels))" ] }, { "cell_type": "markdown", "id": "0958462b", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Bipartite graphs" ] }, { "cell_type": "markdown", "id": "0b647a00", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Observe that most algorithms apply to bipartite graphs, with the ``fit`` method applied to the biadjacency matrix. We show an example with a bipartite graph between movies and actors." ] }, { "cell_type": "code", "execution_count": null, "id": "325beff5", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# show first lines\n", "with open('movie_actor.tsv', 'r') as f:\n", " data = f.readlines()\n", " print(''.join(data[:5]))" ] }, { "cell_type": "code", "execution_count": null, "id": "8ad2f064", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "graph = from_csv('movie_actor.tsv', bipartite=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "8246d167", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "biadjacency = graph.biadjacency\n", "movies = graph.names_row\n", "actors = graph.names_col" ] }, { "cell_type": "code", "execution_count": null, "id": "bf23be8f", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "algorithm = Louvain()\n", "algorithm.fit(biadjacency)\n", "labels_row = algorithm.labels_row_\n", "labels_col = algorithm.labels_col_" ] }, { "cell_type": "code", "execution_count": null, "id": "c0c08326", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "SVG(visualize_bigraph(biadjacency, names_row=movies, names_col=actors, labels_row=labels_row, labels_col=labels_col))\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }