Wikipedia

This notebook shows how to apply scikit-network to analyse the network structure of Wikipedia, through its hyperlinks.

We consider the Wikivitals dataset of the netset collection. This dataset consists of the (approximately) top 10,000 (vital) articles of Wikipedia.

[1]:
from IPython.display import SVG
[2]:
import numpy as np
[3]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.clustering import Louvain
from sknetwork.classification import DiffusionClassifier
from sknetwork.utils import WardDense, get_neighbors
from sknetwork.visualization import svg_dendrogram

Data

All datasets of the netset collection can be easily imported with scikit-network.

[4]:
wikivitals = load_netset('wikivitals')
Parsing files...
Done.
[5]:
# graph of links
adjacency = wikivitals.adjacency
names = wikivitals.names
labels = wikivitals.labels
names_labels = wikivitals.names_labels
[6]:
adjacency
[6]:
<10011x10011 sparse matrix of type '<class 'numpy.bool_'>'
        with 824999 stored elements in Compressed Sparse Row format>
[7]:
# categories
print(names_labels)
['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'
 'History' 'Mathematics' 'People' 'Philosophy and religion'
 'Physical sciences' 'Society and social sciences' 'Technology']
[8]:
# get label
label_id = {name: i for i, name in enumerate(names_labels)}

Sample

Let’s have a look at an article.

[9]:
i = 10000
print(names[i])
Édouard Manet
[10]:
# label
label = labels[i]
print(names_labels[label])
People
[11]:
# some hyperlinks
neighbors = get_neighbors(adjacency, i)
print(names[neighbors[:10]])
['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'
 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'
 'Edgar Allan Poe' 'Edgar Degas']
[12]:
len(neighbors)
[12]:
38

PageRank

We first use (personalized) PageRank to select typical articles of each category.

[13]:
pagerank = PageRank()
[14]:
# number of articles per category
n_selection = 50
[15]:
# selection of articles
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(adjacency, seeds=(labels==label))
    scores = ppr * (labels==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[16]:
selection.shape
[16]:
(11, 50)
[17]:
# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])
---
0 Arts
['Opera' 'Symbolism (arts)' 'Encyclopædia Britannica'
 'Oxford English Dictionary' 'Metropolitan Museum of Art']
---
1 Biological and health sciences
['Biodiversity' 'Taxonomy (biology)' 'Binomial nomenclature' 'Fossil'
 'Metabolism']
---
2 Everyday life
['Rugby sevens' 'Handball' 'Tennis' 'Lacrosse' 'Cycle sport']
---
3 Geography
['Canada' 'Japan' 'Germany' 'United States' 'Italy']
---
4 History
['Han dynasty' 'Soviet Union' 'Ancient Rome' 'Nazi Germany'
 'British Empire']
---
5 Mathematics
['Function (mathematics)' 'Real number' 'Rational number'
 'Algebraic geometry' 'Vector space']
---
6 People
['Immanuel Kant' 'Barack Obama' 'Napoleon' 'Ralph Waldo Emerson'
 'Thomas Aquinas']
---
7 Philosophy and religion
['Buddhism' 'Catholic Church' 'Religion' 'Bible' 'Hinduism']
---
8 Physical sciences
['Iron' 'Temperature' 'Astronomy' 'Astronomical unit' 'Gravity']
---
9 Society and social sciences
['The New York Times' 'Chinese language' 'Spanish language'
 'Indo-European languages' 'Writing system']
---
10 Technology
['Wood' 'Operating system' 'Integrated circuit' 'Engineering'
 'Internal combustion engine']

Embedding

We now represent each node of the graph by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding.

[18]:
# dimension of the embedding
n_components = 20
[19]:
# embedding
spectral = Spectral(n_components)
embedding = spectral.fit_transform(adjacency)
[20]:
embedding.shape
[20]:
(10011, 20)
[21]:
ward = WardDense()
[22]:
# hierarchy of articles
label = label_id['Physical sciences']
index = selection[label]
dendrogram_articles = ward.fit_transform(embedding[index])
[23]:
# visualization
image = svg_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)
[23]:
../_images/use_cases_wikipedia_28_0.svg

Clustering

We now apply Louvain to get a clustering of the graph, independently of the known labels.

[24]:
algo = Louvain()
[25]:
labels_pred = algo.fit_transform(adjacency)
[26]:
np.unique(labels_pred, return_counts=True)
[26]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([1800, 1368, 1315, 1225, 1165, 1067,  804,  672,  468,   97,   30]))

We use again PageRank to get the top pages of each cluster.

[27]:
n_selection = 5
[28]:
selection = []
for label in np.arange(len(set(labels_pred))):
    ppr = pagerank.fit_predict(adjacency, seeds=(labels_pred==label))
    scores = ppr * (labels_pred==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[29]:
# show selection
for label in np.arange(len(set(labels_pred))):
    print('---')
    print(label)
    print(names[selection[label]])
---
0
['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']
---
1
['Hydrogen' 'Oxygen' 'Kelvin' 'Electron' 'Physics']
---
2
['Latin' 'Roman Empire' 'World War I' 'Middle Ages' 'Greek language']
---
3
['United States' 'World War II' 'Geographic coordinate system'
 'United Kingdom' 'France']
---
4
['Aristotle' 'Christianity' 'Plato' 'Catholic Church'
 'Age of Enlightenment']
---
5
['Islam' 'India' 'China' 'Buddhism' 'Chinese language']
---
6
['The New York Times' 'BBC' 'New York City' 'Time (magazine)'
 'The Washington Post']
---
7
['Europe' 'Drainage basin' 'Atlantic Ocean' 'Earth' 'Pacific Ocean']
---
8
['Function (mathematics)' 'Real number' 'Complex number'
 'Set (mathematics)' 'Mathematical analysis']
---
9
['Incest' 'Adoption' 'Kinship' 'Marriage' 'Human sexuality']
---
10
['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']

Classification

Finally, we use heat diffusion to predict the closest category for each page in the People category.

[30]:
algo = DiffusionClassifier()
[31]:
people = label_id['People']
[32]:
labels_people = algo.fit_predict(adjacency, seeds = {i: label for i, label in enumerate(labels) if label != people})
[33]:
n_selection = 5
[34]:
selection = []
for label in np.arange(len(names_labels)):
    if label != people:
        ppr = pagerank.fit_predict(adjacency, seeds=(labels==people)*(labels_people==label))
        scores = ppr * (labels==people)*(labels_people==label)
        selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[35]:
# show selection
i = 0
for label, name_label in enumerate(names_labels):
    if label != people:
        print('---')
        print(label, name_label)
        print(names[selection[i]])
        i += 1
---
0 Arts
['Bob Dylan' 'Igor Stravinsky' 'Richard Wagner' 'Fred Astaire'
 'Ludwig van Beethoven']
---
1 Biological and health sciences
['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'
 'Carl Linnaeus']
---
2 Everyday life
['Wayne Gretzky' 'LeBron James' 'Jackie Robinson' 'Jim Thorpe'
 'Willie Mays']
---
3 Geography
['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Muhammad Ali'
 'Vladimir Putin']
---
4 History
['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'
 'Charles V, Holy Roman Emperor']
---
5 Mathematics
['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'
 'Pierre de Fermat']
---
7 Philosophy and religion
['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']
---
8 Physical sciences
['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Niels Bohr'
 'Marie Curie']
---
9 Society and social sciences
['Noam Chomsky' 'Ralph Waldo Emerson' 'Barack Obama' 'Karl Marx'
 'Jean-Paul Sartre']
---
10 Technology
['Donald Knuth' 'Douglas Engelbart' 'Dennis Ritchie' 'Tim Berners-Lee'
 'Edsger W. Dijkstra']