{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Wikipedia" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "This notebook shows how to apply [scikit-network](https://scikit-network.readthedocs.io/) to analyse the network structure of Wikipedia, through its hyperlinks.\n", "\n", "We consider the [Wikivitals](https://netset.telecom-paris.fr/pages/wikivitals.html) dataset of the [netset](https://netset.telecom-paris.fr) collection. This dataset consists of the (approximately) [top 10,000 (vital) articles of Wikipedia](https://fr.wikipedia.org/wiki/Wikipédia:Articles_vitaux/Niveau_4)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from IPython.display import SVG" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "from scipy.cluster.hierarchy import linkage" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sknetwork.data import load_netset\n", "from sknetwork.ranking import PageRank, top_k\n", "from sknetwork.embedding import Spectral\n", "from sknetwork.clustering import Louvain\n", "from sknetwork.classification import DiffusionClassifier\n", "from sknetwork.utils import get_neighbors\n", "from sknetwork.visualization import visualize_dendrogram" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Data\n", "\n", "All datasets of the [netset](https://netset.telecom-paris.fr) collection can be easily imported with scikit-network." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parsing files...\n", "Done.\n" ] } ], "source": [ "wikivitals = load_netset('wikivitals')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# graph of links\n", "adjacency = wikivitals.adjacency\n", "names = wikivitals.names\n", "labels = wikivitals.labels\n", "names_labels = wikivitals.names_labels" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "<10011x10011 sparse matrix of type ''\n", "\twith 824999 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adjacency" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'\n", " 'History' 'Mathematics' 'People' 'Philosophy and religion'\n", " 'Physical sciences' 'Society and social sciences' 'Technology']\n" ] } ], "source": [ "# categories\n", "print(names_labels)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# get label\n", "label_id = {name: i for i, name in enumerate(names_labels)}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Sample\n", "\n", "Let's have a look at an article." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Édouard Manet\n" ] } ], "source": [ "i = 10000\n", "print(names[i])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "People\n" ] } ], "source": [ "# label\n", "label = labels[i]\n", "print(names_labels[label])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'\n", " 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'\n", " 'Edgar Allan Poe' 'Edgar Degas']\n" ] } ], "source": [ "# some hyperlinks\n", "neighbors = get_neighbors(adjacency, i)\n", "print(names[neighbors[:10]])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "38" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(neighbors)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## PageRank\n", "\n", "We first use (personalized) [PageRank](https://en.wikipedia.org/wiki/PageRank) to select typical articles of each category." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "pagerank = PageRank()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# number of articles per category\n", "n_selection = 50" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# selection of articles\n", "selection = []\n", "for label in np.arange(len(names_labels)):\n", " ppr = pagerank.fit_predict(adjacency, weights=(labels==label))\n", " scores = ppr * (labels==label)\n", " selection.append(top_k(scores, n_selection))\n", "selection = np.array(selection)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "(11, 50)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "0 Arts\n", "['Encyclopædia Britannica' 'Romanticism' 'Jazz' 'Modernism' 'Baroque']\n", "---\n", "1 Biological and health sciences\n", "['Taxonomy (biology)' 'Animal' 'Chordate' 'Plant' 'Species']\n", "---\n", "2 Everyday life\n", "['Olympic Games' 'Association football' 'Basketball' 'Baseball' 'Softball']\n", "---\n", "3 Geography\n", "['Geographic coordinate system' 'United States' 'China' 'France' 'India']\n", "---\n", "4 History\n", "['World War II' 'World War I' 'Roman Empire' 'Ottoman Empire'\n", " 'Middle Ages']\n", "---\n", "5 Mathematics\n", "['Real number' 'Function (mathematics)' 'Complex number'\n", " 'Set (mathematics)' 'Integer']\n", "---\n", "6 People\n", "['Aristotle' 'Plato' 'Augustine of Hippo' 'Winston Churchill'\n", " 'Thomas Aquinas']\n", "---\n", "7 Philosophy and religion\n", "['Christianity' 'Islam' 'Buddhism' 'Hinduism' 'Catholic Church']\n", "---\n", "8 Physical sciences\n", "['Oxygen' 'Hydrogen' 'Earth' 'Kelvin' 'Density']\n", "---\n", "9 Society and social sciences\n", "['The New York Times' 'Latin' 'English language' 'French language'\n", " 'United Nations']\n", "---\n", "10 Technology\n", "['NASA' 'Internet' 'Operating system' 'Computer network' 'Computer']\n" ] } ], "source": [ "# show selection\n", "for label, name_label in enumerate(names_labels):\n", " print('---')\n", " print(label, name_label)\n", " print(names[selection[label, :5]])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Embedding\n", "\n", "We now represent each node of the graph by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# dimension of the embedding\n", "n_components = 20" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# embedding\n", "spectral = Spectral(n_components)\n", "embedding = spectral.fit_transform(adjacency)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "(10011, 20)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# hierarchy of articles\n", "label = label_id['Physical sciences']\n", "index = selection[label]\n", "dendrogram_articles = linkage(embedding[index], method='ward')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "image/svg+xml": [ "OxygenHydrogenEarthKelvinDensityElectronSunWaterNitrogenCarbonQuantum mechanicsPhysicsCarbon dioxideIronEnergyTemperatureChemical elementAtomInternational System of UnitsHeliumMassMoleculePressureSolar SystemIonJouleCopperMagnetic fieldAstronomyPascal (unit)SodiumMetreKilogramStarSolidGravityPhotonSecondMagnesiumElectronvoltProtonSulfurAstronomical unitNewton (unit)ChlorineHertzZincWattGoldMercury (element)" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# visualization\n", "image = visualize_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)\n", "SVG(image)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Clustering" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We now apply Louvain to get a clustering of the graph, independently of the known labels." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "algo = Louvain()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "labels_pred = algo.fit_predict(adjacency)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),\n", " array([1800, 1368, 1315, 1225, 1165, 1067, 804, 672, 468, 97, 30]))" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.unique(labels_pred, return_counts=True)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We use again PageRank to get the top pages of each cluster." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "n_selection = 5" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "selection = []\n", "for label in np.arange(len(set(labels_pred))):\n", " ppr = pagerank.fit_predict(adjacency, weights=(labels_pred==label))\n", " scores = ppr * (labels_pred==label)\n", " selection.append(top_k(scores, n_selection))\n", "selection = np.array(selection)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "0\n", "['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']\n", "---\n", "1\n", "['Hydrogen' 'Oxygen' 'Kelvin' 'Electron' 'Physics']\n", "---\n", "2\n", "['Latin' 'World War I' 'Roman Empire' 'Middle Ages' 'Greek language']\n", "---\n", "3\n", "['United States' 'World War II' 'Geographic coordinate system'\n", " 'United Kingdom' 'France']\n", "---\n", "4\n", "['Christianity' 'Aristotle' 'Plato' 'Catholic Church'\n", " 'Age of Enlightenment']\n", "---\n", "5\n", "['China' 'India' 'Buddhism' 'Islam' 'Chinese language']\n", "---\n", "6\n", "['The New York Times' 'New York City' 'Time (magazine)' 'BBC'\n", " 'The Washington Post']\n", "---\n", "7\n", "['Earth' 'Atlantic Ocean' 'Europe' 'Drainage basin' 'Pacific Ocean']\n", "---\n", "8\n", "['Real number' 'Function (mathematics)' 'Complex number'\n", " 'Set (mathematics)' 'Mathematical analysis']\n", "---\n", "9\n", "['Marriage' 'Incest' 'Adoption' 'Kinship' 'Human sexuality']\n", "---\n", "10\n", "['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']\n" ] } ], "source": [ "# show selection\n", "for label in np.arange(len(set(labels_pred))):\n", " print('---')\n", " print(label)\n", " print(names[selection[label]])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Classification\n", "\n", "Finally, we use heat diffusion to predict the closest category for each page in the People category." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "algo = DiffusionClassifier()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "people = label_id['People']" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "labels_people = algo.fit_predict(adjacency, labels = {i: label for i, label in enumerate(labels) if label != people})" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "n_selection = 5" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "selection = []\n", "for label in np.arange(len(names_labels)):\n", " if label != people:\n", " ppr = pagerank.fit_predict(adjacency, weights=(labels==people)*(labels_people==label))\n", " scores = ppr * (labels==people)*(labels_people==label)\n", " selection.append(top_k(scores, n_selection))\n", "selection = np.array(selection)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "0 Arts\n", "['Richard Wagner' 'Igor Stravinsky' 'Bob Dylan' 'Fred Astaire'\n", " 'Ludwig van Beethoven']\n", "---\n", "1 Biological and health sciences\n", "['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'\n", " 'Carl Linnaeus']\n", "---\n", "2 Everyday life\n", "['Wayne Gretzky' 'Jim Thorpe' 'Jackie Robinson' 'LeBron James'\n", " 'Willie Mays']\n", "---\n", "3 Geography\n", "['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Vladimir Putin'\n", " 'Muhammad Ali']\n", "---\n", "4 History\n", "['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'\n", " 'Charles V, Holy Roman Emperor']\n", "---\n", "5 Mathematics\n", "['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'\n", " 'Pierre de Fermat']\n", "---\n", "7 Philosophy and religion\n", "['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']\n", "---\n", "8 Physical sciences\n", "['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Marie Curie'\n", " 'Niels Bohr']\n", "---\n", "9 Society and social sciences\n", "['Barack Obama' 'Noam Chomsky' 'Karl Marx' 'Ralph Waldo Emerson'\n", " 'Jean-Paul Sartre']\n", "---\n", "10 Technology\n", "['Tim Berners-Lee' 'Donald Knuth' 'Edsger W. Dijkstra' 'Douglas Engelbart'\n", " 'Dennis Ritchie']\n" ] } ], "source": [ "# show selection\n", "i = 0\n", "for label, name_label in enumerate(names_labels):\n", " if label != people:\n", " print('---')\n", " print(label, name_label)\n", " print(names[selection[i]])\n", " i += 1\n" ] } ], "metadata": { "anaconda-cloud": {}, "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 1 }