Wikipedia

This notebook shows how to apply scikit-network to analyse the network structure of Wikipedia, through its hyperlinks.

We consider the Wikivitals dataset of the netset collection. This dataset consists of the (approximately) top 10,000 (vital) articles of Wikipedia.

[1]:

from IPython.display import SVG

[2]:

import numpy as np
from scipy.cluster.hierarchy import linkage

[3]:

from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.clustering import Louvain
from sknetwork.classification import DiffusionClassifier
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram

Data

All datasets of the netset collection can be easily imported with scikit-network.

[4]:

wikivitals = load_netset('wikivitals')

Parsing files...
Done.

[5]:

# graph of links
adjacency = wikivitals.adjacency
names = wikivitals.names
labels = wikivitals.labels
names_labels = wikivitals.names_labels

[6]:

adjacency

[6]:

<10011x10011 sparse matrix of type '<class 'numpy.bool_'>'
        with 824999 stored elements in Compressed Sparse Row format>

[7]:

# categories
print(names_labels)

['Arts' 'Biological and health sciences' 'Everyday life' 'Geography'
 'History' 'Mathematics' 'People' 'Philosophy and religion'
 'Physical sciences' 'Society and social sciences' 'Technology']

[8]:

# get label
label_id = {name: i for i, name in enumerate(names_labels)}

Sample

Let’s have a look at an article.

[9]:

i = 10000
print(names[i])

Édouard Manet

[10]:

# label
label = labels[i]
print(names_labels[label])

People

[11]:

# some hyperlinks
neighbors = get_neighbors(adjacency, i)
print(names[neighbors[:10]])

['Adolphe Thiers' 'American Civil War' 'Bordeaux' 'Camille Pissarro'
 'Carmen' 'Charles Baudelaire' 'Claude Monet' 'Diego Velázquez'
 'Edgar Allan Poe' 'Edgar Degas']

[12]:

len(neighbors)

[12]:

PageRank

We first use (personalized) PageRank to select typical articles of each category.

[13]:

pagerank = PageRank()

[14]:

# number of articles per category
n_selection = 50

[15]:

# selection of articles
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(adjacency, weights=(labels==label))
    scores = ppr * (labels==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[16]:

selection.shape

[16]:

(11, 50)

[17]:

# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])

---
0 Arts
['Encyclopædia Britannica' 'Romanticism' 'Jazz' 'Modernism' 'Baroque']
---
1 Biological and health sciences
['Taxonomy (biology)' 'Animal' 'Chordate' 'Plant' 'Species']
---
2 Everyday life
['Olympic Games' 'Association football' 'Basketball' 'Baseball' 'Softball']
---
3 Geography
['Geographic coordinate system' 'United States' 'China' 'France' 'India']
---
4 History
['World War II' 'World War I' 'Roman Empire' 'Ottoman Empire'
 'Middle Ages']
---
5 Mathematics
['Real number' 'Function (mathematics)' 'Complex number'
 'Set (mathematics)' 'Integer']
---
6 People
['Aristotle' 'Plato' 'Augustine of Hippo' 'Winston Churchill'
 'Thomas Aquinas']
---
7 Philosophy and religion
['Christianity' 'Islam' 'Buddhism' 'Hinduism' 'Catholic Church']
---
8 Physical sciences
['Oxygen' 'Hydrogen' 'Earth' 'Kelvin' 'Density']
---
9 Society and social sciences
['The New York Times' 'Latin' 'English language' 'French language'
 'United Nations']
---
10 Technology
['NASA' 'Internet' 'Operating system' 'Computer network' 'Computer']

Embedding

We now represent each node of the graph by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding.

[18]:

# dimension of the embedding
n_components = 20

[19]:

# embedding
spectral = Spectral(n_components)
embedding = spectral.fit_transform(adjacency)

[20]:

embedding.shape

[20]:

(10011, 20)

[21]:

# hierarchy of articles
label = label_id['Physical sciences']
index = selection[label]
dendrogram_articles = linkage(embedding[index], method='ward')

[22]:

# visualization
image = visualize_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)

[22]:

Clustering

We now apply Louvain to get a clustering of the graph, independently of the known labels.

[23]:

algo = Louvain()

[24]:

labels_pred = algo.fit_predict(adjacency)

[25]:

np.unique(labels_pred, return_counts=True)

[25]:

(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 array([1836, 1800, 1315, 1262, 1225, 1067,  804,  672,   30]))

We use again PageRank to get the top pages of each cluster.

[26]:

n_selection = 5

[27]:

selection = []
for label in np.arange(len(set(labels_pred))):
    ppr = pagerank.fit_predict(adjacency, weights=(labels_pred==label))
    scores = ppr * (labels_pred==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[28]:

# show selection
for label in np.arange(len(set(labels_pred))):
    print('---')
    print(label)
    print(names[selection[label]])

---
0
['Physics' 'Hydrogen' 'Oxygen' 'Kelvin' 'Albert Einstein']
---
1
['Taxonomy (biology)' 'Animal' 'Plant' 'Protein' 'Species']
---
2
['Latin' 'World War I' 'Roman Empire' 'Middle Ages' 'Greek language']
---
3
['Christianity' 'Aristotle' 'Catholic Church' 'Plato'
 'Age of Enlightenment']
---
4
['United States' 'World War II' 'Geographic coordinate system'
 'United Kingdom' 'France']
---
5
['China' 'India' 'Buddhism' 'Islam' 'Chinese language']
---
6
['The New York Times' 'New York City' 'Time (magazine)' 'BBC'
 'The Washington Post']
---
7
['Earth' 'Atlantic Ocean' 'Europe' 'Drainage basin' 'Pacific Ocean']
---
8
['Handbag' 'Hat' 'Veil' 'Uniform' 'Clothing']

Classification

Finally, we use heat diffusion to predict the closest category for each page in the People category.

[29]:

algo = DiffusionClassifier()

[30]:

people = label_id['People']

[31]:

labels_people = algo.fit_predict(adjacency, labels = {i: label for i, label in enumerate(labels) if label != people})

[32]:

n_selection = 5

[33]:

selection = []
for label in np.arange(len(names_labels)):
    if label != people:
        ppr = pagerank.fit_predict(adjacency, weights=(labels==people)*(labels_people==label))
        scores = ppr * (labels==people)*(labels_people==label)
        selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[34]:

# show selection
i = 0
for label, name_label in enumerate(names_labels):
    if label != people:
        print('---')
        print(label, name_label)
        print(names[selection[i]])
        i += 1

---
0 Arts
['Richard Wagner' 'Igor Stravinsky' 'Bob Dylan' 'Fred Astaire'
 'Ludwig van Beethoven']
---
1 Biological and health sciences
['Charles Darwin' 'Francis Crick' 'Robert Koch' 'Alexander Fleming'
 'Carl Linnaeus']
---
2 Everyday life
['Wayne Gretzky' 'Jim Thorpe' 'Jackie Robinson' 'LeBron James'
 'Willie Mays']
---
3 Geography
['Elizabeth II' 'Carl Lewis' 'Dwight D. Eisenhower' 'Vladimir Putin'
 'Muhammad Ali']
---
4 History
['Alexander the Great' 'Napoleon' 'Charlemagne' 'Philip II of Spain'
 'Charles V, Holy Roman Emperor']
---
5 Mathematics
['Euclid' 'Augustin-Louis Cauchy' 'Archimedes' 'John von Neumann'
 'Pierre de Fermat']
---
7 Philosophy and religion
['Augustine of Hippo' 'Aristotle' 'Thomas Aquinas' 'Plato' 'Immanuel Kant']
---
8 Physical sciences
['Albert Einstein' 'Isaac Newton' 'J. J. Thomson' 'Marie Curie'
 'Niels Bohr']
---
9 Society and social sciences
['Barack Obama' 'Noam Chomsky' 'Karl Marx' 'Ralph Waldo Emerson'
 'Jean-Paul Sartre']
---
10 Technology
['Tim Berners-Lee' 'Donald Knuth' 'Edsger W. Dijkstra' 'Douglas Engelbart'
 'Dennis Ritchie']