{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Wikipedia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "This notebook shows how to apply [scikit-network](https://scikit-network.readthedocs.io/) to analyse the network structure of Wikipedia, through its hyperlinks.\n",
    "\n",
    "We consider the [Wikivitals](https://netset.telecom-paris.fr/pages/wikivitals.html) dataset of the [netset](https://netset.telecom-paris.fr) collection. This dataset consists of the (approximately) [top 10,000 (vital) articles of Wikipedia](https://fr.wikipedia.org/wiki/Wikipédia:Articles_vitaux/Niveau_4)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import SVG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.cluster.hierarchy import linkage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from sknetwork.data import load_netset\n",
    "from sknetwork.ranking import PageRank, top_k\n",
    "from sknetwork.embedding import Spectral\n",
    "from sknetwork.clustering import Louvain\n",
    "from sknetwork.classification import DiffusionClassifier\n",
    "from sknetwork.utils import get_neighbors\n",
    "from sknetwork.visualization import visualize_dendrogram"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Data\n",
    "\n",
    "All datasets of the [netset](https://netset.telecom-paris.fr) collection can be easily imported with scikit-network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parsing files...\n",
      "Done.\n"
     ]
    }
   ],
   "source": [
    "wikivitals = load_netset('wikivitals')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# graph of links\n",
    "adjacency = wikivitals.adjacency\n",
    "names = wikivitals.names\n",
    "labels = wikivitals.labels\n",
    "names_labels = wikivitals.names_labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<10011x10011 sparse matrix of type '<class 'numpy.bool_'>'\n",
       "\twith 824999 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adjacency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'\n",
      " 'History' 'Mathematics' 'People' 'Philosophy and religion'\n",
      " 'Physical sciences' 'Society and social sciences' 'Technology']\n"
     ]
    }
   ],
   "source": [
    "# categories\n",
    "print(names_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# get label\n",
    "label_id = {name: i for i, name in enumerate(names_labels)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Sample\n",
    "\n",
    "Let's have a look at an article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Édouard Manet\n"
     ]
    }
   ],
   "source": [
    "i = 10000\n",
    "print(names[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "People\n"
     ]
    }
   ],
   "source": [
    "# label\n",
    "label = labels[i]\n",
    "print(names_labels[label])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'\n",
      " 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'\n",
      " 'Edgar Allan Poe' 'Edgar Degas']\n"
     ]
    }
   ],
   "source": [
    "# some hyperlinks\n",
    "neighbors = get_neighbors(adjacency, i)\n",
    "print(names[neighbors[:10]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(neighbors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## PageRank\n",
    "\n",
    "We first use (personalized) [PageRank](https://en.wikipedia.org/wiki/PageRank) to select typical articles of each category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "pagerank = PageRank()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# number of articles per category\n",
    "n_selection = 50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# selection of articles\n",
    "selection = []\n",
    "for label in np.arange(len(names_labels)):\n",
    "    ppr = pagerank.fit_predict(adjacency, weights=(labels==label))\n",
    "    scores = ppr * (labels==label)\n",
    "    selection.append(top_k(scores, n_selection))\n",
    "selection = np.array(selection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11, 50)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selection.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "0 Arts\n",
      "['Encyclopædia Britannica' 'Romanticism' 'Jazz' 'Modernism' 'Baroque']\n",
      "---\n",
      "1 Biological and health sciences\n",
      "['Taxonomy (biology)' 'Animal' 'Chordate' 'Plant' 'Species']\n",
      "---\n",
      "2 Everyday life\n",
      "['Olympic Games' 'Association football' 'Basketball' 'Baseball' 'Softball']\n",
      "---\n",
      "3 Geography\n",
      "['Geographic coordinate system' 'United States' 'China' 'France' 'India']\n",
      "---\n",
      "4 History\n",
      "['World War II' 'World War I' 'Roman Empire' 'Ottoman Empire'\n",
      " 'Middle Ages']\n",
      "---\n",
      "5 Mathematics\n",
      "['Real number' 'Function (mathematics)' 'Complex number'\n",
      " 'Set (mathematics)' 'Integer']\n",
      "---\n",
      "6 People\n",
      "['Aristotle' 'Plato' 'Augustine of Hippo' 'Winston Churchill'\n",
      " 'Thomas Aquinas']\n",
      "---\n",
      "7 Philosophy and religion\n",
      "['Christianity' 'Islam' 'Buddhism' 'Hinduism' 'Catholic Church']\n",
      "---\n",
      "8 Physical sciences\n",
      "['Oxygen' 'Hydrogen' 'Earth' 'Kelvin' 'Density']\n",
      "---\n",
      "9 Society and social sciences\n",
      "['The New York Times' 'Latin' 'English language' 'French language'\n",
      " 'United Nations']\n",
      "---\n",
      "10 Technology\n",
      "['NASA' 'Internet' 'Operating system' 'Computer network' 'Computer']\n"
     ]
    }
   ],
   "source": [
    "# show selection\n",
    "for label, name_label in enumerate(names_labels):\n",
    "    print('---')\n",
    "    print(label, name_label)\n",
    "    print(names[selection[label, :5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Embedding\n",
    "\n",
    "We now represent each node of the graph by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# dimension of the embedding\n",
    "n_components = 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# embedding\n",
    "spectral = Spectral(n_components)\n",
    "embedding = spectral.fit_transform(adjacency)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10011, 20)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# hierarchy of articles\n",
    "label = label_id['Physical sciences']\n",
    "index = selection[label]\n",
    "dendrogram_articles = linkage(embedding[index], method='ward')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<svg xmlns=\"http://www.w3.org/2000/svg\" width=\"599.0\" height=\"620\"><text x=\"415\" y=\"26.0\" font-size=\"12\">Oxygen</text><text x=\"415\" y=\"122.0\" font-size=\"12\">Hydrogen</text><text x=\"415\" y=\"542.0\" font-size=\"12\">Earth</text><text x=\"415\" y=\"230.0\" font-size=\"12\">Kelvin</text><text x=\"415\" y=\"74.0\" font-size=\"12\">Density</text><text x=\"415\" y=\"446.0\" font-size=\"12\">Electron</text><text x=\"415\" y=\"374.0\" font-size=\"12\">Sun</text><text x=\"415\" y=\"554.0\" font-size=\"12\">Water</text><text x=\"415\" y=\"170.0\" font-size=\"12\">Nitrogen</text><text x=\"415\" y=\"50.0\" font-size=\"12\">Carbon</text><text x=\"415\" y=\"422.0\" font-size=\"12\">Quantum mechanics</text><text x=\"415\" y=\"470.0\" font-size=\"12\">Physics</text><text x=\"415\" y=\"38.0\" font-size=\"12\">Carbon dioxide</text><text x=\"415\" y=\"62.0\" font-size=\"12\">Iron</text><text x=\"415\" y=\"494.0\" font-size=\"12\">Energy</text><text x=\"415\" y=\"506.0\" font-size=\"12\">Temperature</text><text x=\"415\" y=\"86.0\" font-size=\"12\">Chemical element</text><text x=\"415\" y=\"566.0\" font-size=\"12\">Atom</text><text x=\"415\" y=\"350.0\" font-size=\"12\">International System of Units</text><text x=\"415\" y=\"134.0\" font-size=\"12\">Helium</text><text x=\"415\" y=\"518.0\" font-size=\"12\">Mass</text><text x=\"415\" y=\"578.0\" font-size=\"12\">Molecule</text><text x=\"415\" y=\"482.0\" font-size=\"12\">Pressure</text><text x=\"415\" y=\"398.0\" font-size=\"12\">Solar System</text><text x=\"415\" y=\"590.0\" font-size=\"12\">Ion</text><text x=\"415\" y=\"302.0\" font-size=\"12\">Joule</text><text x=\"415\" y=\"146.0\" font-size=\"12\">Copper</text><text x=\"415\" y=\"314.0\" font-size=\"12\">Magnetic field</text><text x=\"415\" y=\"386.0\" font-size=\"12\">Astronomy</text><text x=\"415\" y=\"266.0\" font-size=\"12\">Pascal (unit)</text><text x=\"415\" y=\"98.0\" font-size=\"12\">Sodium</text><text x=\"415\" y=\"338.0\" font-size=\"12\">Metre</text><text x=\"415\" y=\"278.0\" font-size=\"12\">Kilogram</text><text x=\"415\" y=\"410.0\" font-size=\"12\">Star</text><text x=\"415\" y=\"602.0\" font-size=\"12\">Solid</text><text x=\"415\" y=\"530.0\" font-size=\"12\">Gravity</text><text x=\"415\" y=\"434.0\" font-size=\"12\">Photon</text><text x=\"415\" y=\"362.0\" font-size=\"12\">Second</text><text x=\"415\" y=\"194.0\" font-size=\"12\">Magnesium</text><text x=\"415\" y=\"290.0\" font-size=\"12\">Electronvolt</text><text x=\"415\" y=\"458.0\" font-size=\"12\">Proton</text><text x=\"415\" y=\"206.0\" font-size=\"12\">Sulfur</text><text x=\"415\" y=\"326.0\" font-size=\"12\">Astronomical unit</text><text x=\"415\" y=\"218.0\" font-size=\"12\">Newton (unit)</text><text x=\"415\" y=\"110.0\" font-size=\"12\">Chlorine</text><text x=\"415\" y=\"242.0\" font-size=\"12\">Hertz</text><text x=\"415\" y=\"182.0\" font-size=\"12\">Zinc</text><text x=\"415\" y=\"254.0\" font-size=\"12\">Watt</text><text x=\"415\" y=\"14.0\" font-size=\"12\">Gold</text><text x=\"415\" y=\"158.0\" font-size=\"12\">Mercury (element)</text><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 262.0 405.1638020627039 262.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 274.0 405.1638020627039 274.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 405.1638020627039 262.0 405.1638020627039 274.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 190.0 402.9801147874148 190.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 202.0 402.9801147874148 202.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.9801147874148 190.0 402.9801147874148 202.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 22.0 402.67424190700905 22.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 34.0 402.67424190700905 34.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.67424190700905 22.0 402.67424190700905 34.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 46.0 402.4993414489436 46.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 58.0 402.4993414489436 58.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.4993414489436 46.0 402.4993414489436 58.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 94.0 402.37188416776524 94.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 106.0 402.37188416776524 106.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.37188416776524 94.0 402.37188416776524 106.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 178.0 401.60948082688725 178.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.9801147874148 196.0 401.60948082688725 196.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 401.60948082688725 178.0 401.60948082688725 196.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 142.0 400.5782241864501 142.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 154.0 400.5782241864501 154.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 400.5782241864501 142.0 400.5782241864501 154.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 250.0 400.42638436650094 250.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 405.1638020627039 268.0 400.42638436650094 268.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 400.42638436650094 250.0 400.42638436650094 268.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 166.0 399.95775664086295 166.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 401.60948082688725 187.0 399.95775664086295 187.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 399.95775664086295 166.0 399.95775664086295 187.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 118.0 399.90831979378265 118.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 130.0 399.90831979378265 130.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 399.90831979378265 118.0 399.90831979378265 130.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 346.0 398.8739768039438 346.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 358.0 398.8739768039438 358.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 398.8739768039438 346.0 398.8739768039438 358.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 562.0 398.8193654936308 562.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 574.0 398.8193654936308 574.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 398.8193654936308 562.0 398.8193654936308 574.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 418.0 397.9511835386945 418.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 430.0 397.9511835386945 430.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 397.9511835386945 418.0 397.9511835386945 430.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 238.0 397.93491795398387 238.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 400.42638436650094 259.0 397.93491795398387 259.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 397.93491795398387 238.0 397.93491795398387 259.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 400.5782241864501 148.0 396.8053466060358 148.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 399.95775664086295 176.5 396.8053466060358 176.5\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 396.8053466060358 148.0 396.8053466060358 176.5\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.67424190700905 28.0 396.26066093038366 28.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.4993414489436 52.0 396.26066093038366 52.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 396.26066093038366 28.0 396.26066093038366 52.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 410 370.0 395.85247689893464 370.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 410 382.0 395.85247689893464 382.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 395.85247689893464 370.0 395.85247689893464 382.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 490.0 395.5558997157416 490.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 502.0 395.5558997157416 502.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 395.5558997157416 490.0 395.5558997157416 502.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 298.0 395.0425960768839 298.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 310.0 395.0425960768839 310.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 395.0425960768839 298.0 395.0425960768839 310.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 226.0 393.99020995085476 226.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 397.93491795398387 248.5 393.99020995085476 248.5\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 393.99020995085476 226.0 393.99020995085476 248.5\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 410 394.0 392.4040233412891 394.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 410 406.0 392.4040233412891 406.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 392.4040233412891 394.0 392.4040233412891 406.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 10.0 392.1049682347298 10.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 396.26066093038366 40.0 392.1049682347298 40.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 392.1049682347298 10.0 392.1049682347298 40.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 478.0 391.9885794916357 478.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 395.5558997157416 496.0 391.9885794916357 496.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.9885794916357 478.0 391.9885794916357 496.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 70.0 391.7696358993685 70.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 410 82.0 391.7696358993685 82.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 391.7696358993685 70.0 391.7696358993685 82.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 538.0 391.65197775648807 538.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 550.0 391.65197775648807 550.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.65197775648807 538.0 391.65197775648807 550.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 214.0 391.49468704741685 214.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 393.99020995085476 237.25 391.49468704741685 237.25\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 391.49468704741685 214.0 391.49468704741685 237.25\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 442.0 391.41047457032545 442.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 454.0 391.41047457032545 454.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.41047457032545 442.0 391.41047457032545 454.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 334.0 390.58094281714585 334.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 398.8739768039438 352.0 390.58094281714585 352.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 390.58094281714585 334.0 390.58094281714585 352.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 466.0 386.95026591773694 466.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.9885794916357 487.0 386.95026591773694 487.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 386.95026591773694 466.0 386.95026591773694 487.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 392.1049682347298 25.0 386.49399574111067 25.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 391.7696358993685 76.0 386.49399574111067 76.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 386.49399574111067 25.0 386.49399574111067 76.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 586.0 386.1462583670227 586.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 598.0 386.1462583670227 598.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 386.1462583670227 586.0 386.1462583670227 598.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 399.90831979378265 124.0 385.7180399479052 124.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 396.8053466060358 162.25 385.7180399479052 162.25\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 385.7180399479052 124.0 385.7180399479052 162.25\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 286.0 385.58352041105104 286.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 395.0425960768839 304.0 385.58352041105104 304.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 385.58352041105104 286.0 385.58352041105104 304.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 397.9511835386945 424.0 384.4602674490779 424.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.41047457032545 448.0 384.4602674490779 448.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 384.4602674490779 424.0 384.4602674490779 448.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 402.37188416776524 100.0 383.1507847650436 100.0\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 385.7180399479052 143.125 383.1507847650436 143.125\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 383.1507847650436 100.0 383.1507847650436 143.125\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 514.0 380.6310864227479 514.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 410 526.0 380.6310864227479 526.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 380.6310864227479 514.0 380.6310864227479 526.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 410 322.0 378.4538425573628 322.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 390.58094281714585 343.0 378.4538425573628 343.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 378.4538425573628 322.0 378.4538425573628 343.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 398.8193654936308 568.0 377.36206171486157 568.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 386.1462583670227 592.0 377.36206171486157 592.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 377.36206171486157 568.0 377.36206171486157 592.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 395.85247689893464 376.0 374.58173576429215 376.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 392.4040233412891 400.0 374.58173576429215 400.0\"/><path stroke-width=\"2\" stroke=\"orange\" d=\"M 374.58173576429215 376.0 374.58173576429215 400.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 385.58352041105104 295.0 364.39207236731664 295.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 378.4538425573628 332.5 364.39207236731664 332.5\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 364.39207236731664 295.0 364.39207236731664 332.5\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 386.95026591773694 476.5 357.0523456061791 476.5\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 380.6310864227479 520.0 357.0523456061791 520.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 357.0523456061791 476.5 357.0523456061791 520.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 384.4602674490779 436.0 345.5679048777703 436.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 357.0523456061791 498.25 345.5679048777703 498.25\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 345.5679048777703 436.0 345.5679048777703 498.25\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 386.49399574111067 50.5 344.33642161989377 50.5\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 383.1507847650436 121.5625 344.33642161989377 121.5625\"/><path stroke-width=\"2\" stroke=\"blue\" d=\"M 344.33642161989377 50.5 344.33642161989377 121.5625\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 391.65197775648807 544.0 340.33473294919645 544.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 377.36206171486157 580.0 340.33473294919645 580.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 340.33473294919645 544.0 340.33473294919645 580.0\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 391.49468704741685 225.625 332.24746972297424 225.625\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 364.39207236731664 313.75 332.24746972297424 313.75\"/><path stroke-width=\"2\" stroke=\"green\" d=\"M 332.24746972297424 225.625 332.24746972297424 313.75\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 345.5679048777703 467.125 289.1130637520882 467.125\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 340.33473294919645 562.0 289.1130637520882 562.0\"/><path stroke-width=\"2\" stroke=\"red\" d=\"M 289.1130637520882 467.125 289.1130637520882 562.0\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 374.58173576429215 388.0 192.30107158388304 388.0\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 289.1130637520882 514.5625 192.30107158388304 514.5625\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 192.30107158388304 388.0 192.30107158388304 514.5625\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 332.24746972297424 269.6875 143.2968527294576 269.6875\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 192.30107158388304 451.28125 143.2968527294576 451.28125\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 143.2968527294576 269.6875 143.2968527294576 451.28125\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 344.33642161989377 86.03125 10.0 86.03125\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 143.2968527294576 360.484375 10.0 360.484375\"/><path stroke-width=\"2\" stroke=\"black\" d=\"M 10.0 86.03125 10.0 360.484375\"/></svg>"
      ],
      "text/plain": [
       "<IPython.core.display.SVG object>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# visualization\n",
    "image = visualize_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)\n",
    "SVG(image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We now apply Louvain to get a clustering of the graph, independently of the known labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "algo = Louvain()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "labels_pred = algo.fit_predict(adjacency)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),\n",
       " array([1800, 1368, 1315, 1225, 1165, 1067,  804,  672,  468,   97,   30]))"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.unique(labels_pred, return_counts=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We use again PageRank to get the top pages of each cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "n_selection = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "selection = []\n",
    "for label in np.arange(len(set(labels_pred))):\n",
    "    ppr = pagerank.fit_predict(adjacency, weights=(labels_pred==label))\n",
    "    scores = ppr * (labels_pred==label)\n",
    "    selection.append(top_k(scores, n_selection))\n",
    "selection = np.array(selection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "0\n",
      "['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']\n",
      "---\n",
      "1\n",
      "['Hydrogen' 'Oxygen' 'Kelvin' 'Electron' 'Physics']\n",
      "---\n",
      "2\n",
      "['Latin' 'World War I' 'Roman Empire' 'Middle Ages' 'Greek language']\n",
      "---\n",
      "3\n",
      "['United States' 'World War II' 'Geographic coordinate system'\n",
      " 'United Kingdom' 'France']\n",
      "---\n",
      "4\n",
      "['Christianity' 'Aristotle' 'Plato' 'Catholic Church'\n",
      " 'Age of Enlightenment']\n",
      "---\n",
      "5\n",
      "['China' 'India' 'Buddhism' 'Islam' 'Chinese language']\n",
      "---\n",
      "6\n",
      "['The New York Times' 'New York City' 'Time (magazine)' 'BBC'\n",
      " 'The Washington Post']\n",
      "---\n",
      "7\n",
      "['Earth' 'Atlantic Ocean' 'Europe' 'Drainage basin' 'Pacific Ocean']\n",
      "---\n",
      "8\n",
      "['Real number' 'Function (mathematics)' 'Complex number'\n",
      " 'Set (mathematics)' 'Mathematical analysis']\n",
      "---\n",
      "9\n",
      "['Marriage' 'Incest' 'Adoption' 'Kinship' 'Human sexuality']\n",
      "---\n",
      "10\n",
      "['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']\n"
     ]
    }
   ],
   "source": [
    "# show selection\n",
    "for label in np.arange(len(set(labels_pred))):\n",
    "    print('---')\n",
    "    print(label)\n",
    "    print(names[selection[label]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Classification\n",
    "\n",
    "Finally, we use heat diffusion to predict the closest category for each page in the People category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "algo = DiffusionClassifier()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "people = label_id['People']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "labels_people = algo.fit_predict(adjacency, labels = {i: label for i, label in enumerate(labels) if label != people})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "n_selection = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "selection = []\n",
    "for label in np.arange(len(names_labels)):\n",
    "    if label != people:\n",
    "        ppr = pagerank.fit_predict(adjacency, weights=(labels==people)*(labels_people==label))\n",
    "        scores = ppr * (labels==people)*(labels_people==label)\n",
    "        selection.append(top_k(scores, n_selection))\n",
    "selection = np.array(selection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "0 Arts\n",
      "['Richard Wagner' 'Igor Stravinsky' 'Bob Dylan' 'Fred Astaire'\n",
      " 'Ludwig van Beethoven']\n",
      "---\n",
      "1 Biological and health sciences\n",
      "['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'\n",
      " 'Carl Linnaeus']\n",
      "---\n",
      "2 Everyday life\n",
      "['Wayne Gretzky' 'Jim Thorpe' 'Jackie Robinson' 'LeBron James'\n",
      " 'Willie Mays']\n",
      "---\n",
      "3 Geography\n",
      "['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Vladimir Putin'\n",
      " 'Muhammad Ali']\n",
      "---\n",
      "4 History\n",
      "['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'\n",
      " 'Charles V, Holy Roman Emperor']\n",
      "---\n",
      "5 Mathematics\n",
      "['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'\n",
      " 'Pierre de Fermat']\n",
      "---\n",
      "7 Philosophy and religion\n",
      "['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']\n",
      "---\n",
      "8 Physical sciences\n",
      "['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Marie Curie'\n",
      " 'Niels Bohr']\n",
      "---\n",
      "9 Society and social sciences\n",
      "['Barack Obama' 'Noam Chomsky' 'Karl Marx' 'Ralph Waldo Emerson'\n",
      " 'Jean-Paul Sartre']\n",
      "---\n",
      "10 Technology\n",
      "['Tim Berners-Lee' 'Donald Knuth' 'Edsger W. Dijkstra' 'Douglas Engelbart'\n",
      " 'Dennis Ritchie']\n"
     ]
    }
   ],
   "source": [
    "# show selection\n",
    "i = 0\n",
    "for label, name_label in enumerate(names_labels):\n",
    "    if label != people:\n",
    "        print('---')\n",
    "        print(label, name_label)\n",
    "        print(names[selection[i]])\n",
    "        i += 1\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}