Load your data

In scikit-network, a graph is represented by its adjacency matrix (or biadjacency matrix for a bipartite graph) in the Compressed Sparse Row format of SciPy.

In this tutorial, we present a few methods to instantiate a graph in this format.

[1]:
from IPython.display import SVG

import numpy as np
from scipy import sparse
import pandas as pd

from sknetwork.data import from_edge_list, from_adjacency_list, from_graphml, from_csv
from sknetwork.visualization import visualize_graph, visualize_bigraph

From a NumPy array

For small graphs, you can instantiate the adjacency matrix as a dense NumPy array and convert it into a sparse matrix in CSR format.

[2]:
adjacency = np.array([[0, 1, 1, 0], [1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 0]])
adjacency = sparse.csr_matrix(adjacency)

image = visualize_graph(adjacency)
SVG(image)
[2]:
../../_images/tutorials_data_load_data_4_0.svg

From an edge list

Another natural way to build a graph is from a list of edges.

[3]:
edge_list = [(0, 1), (1, 2), (2, 3), (3, 0), (0, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)
[3]:
../../_images/tutorials_data_load_data_6_0.svg

By default, the graph is undirected, but you can easily make it directed.

[4]:
adjacency = from_edge_list(edge_list, directed=True)

image = visualize_graph(adjacency)
SVG(image)
[4]:
../../_images/tutorials_data_load_data_8_0.svg

You might also want to add weights to your edges. Just use triplets instead of pairs!

[5]:
edge_list = [(0, 1, 1), (1, 2, 0.5), (2, 3, 1), (3, 0, 0.5), (0, 2, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)
[5]:
../../_images/tutorials_data_load_data_10_0.svg

You can instantiate a bipartite graph as well.

[6]:
edge_list = [(0, 0), (1, 0), (1, 1), (2, 1)]
biadjacency = from_edge_list(edge_list, bipartite=True)

image = visualize_bigraph(biadjacency)
SVG(image)
[6]:
../../_images/tutorials_data_load_data_12_0.svg

If nodes are not indexed, you get an object of type Bunch with graph attributes (node names).

[7]:
edge_list = [("Alice", "Bob"), ("Bob", "Carey"), ("Alice", "David"), ("Carey", "David"), ("Bob", "David")]
graph = from_edge_list(edge_list)
[8]:
graph
[8]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
        with 10 stored elements in Compressed Sparse Row format>}
[9]:
adjacency = graph.adjacency
names = graph.names
[10]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[10]:
../../_images/tutorials_data_load_data_17_0.svg

By default, the weight of each edge is the number of occurrences of the corresponding link:

[11]:
edge_list_new = edge_list + [("Alice", "Bob"), ("Alice", "David"), ("Alice", "Bob")]
graph = from_edge_list(edge_list_new)
[12]:
adjacency = graph.adjacency
names = graph.names
[13]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[13]:
../../_images/tutorials_data_load_data_21_0.svg

You can make the graph unweighted.

[14]:
graph = from_edge_list(edge_list_new, weighted=False)
[15]:
adjacency = graph.adjacency
names = graph.names
[16]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[16]:
../../_images/tutorials_data_load_data_25_0.svg

Again, you can make the graph directed:

[17]:
graph = from_edge_list(edge_list, directed=True)
[18]:
graph
[18]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
        with 5 stored elements in Compressed Sparse Row format>}
[19]:
adjacency = graph.adjacency
names = graph.names
[20]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[20]:
../../_images/tutorials_data_load_data_30_0.svg

The graph can also have explicit weights:

[21]:
edge_list = [("Alice", "Bob", 3), ("Bob", "Carey", 2), ("Alice", "David", 1), ("Carey", "David", 2), ("Bob", "David", 3)]
graph = from_edge_list(edge_list)
[22]:
adjacency = graph.adjacency
names = graph.names
[23]:
image = visualize_graph(adjacency, names=names, display_edge_weight=True, display_node_weight=True)
SVG(image)
[23]:
../../_images/tutorials_data_load_data_34_0.svg

For a bipartite graph:

[24]:
edge_list = [("Alice", "Football"), ("Bob", "Tennis"), ("David", "Football"), ("Carey", "Tennis"), ("Carey", "Football")]
graph = from_edge_list(edge_list, bipartite=True)
[25]:
biadjacency = graph.biadjacency
names = graph.names
names_col = graph.names_col
[26]:
image = visualize_bigraph(biadjacency, names_row=names, names_col=names_col)
SVG(image)
[26]:
../../_images/tutorials_data_load_data_38_0.svg

From an adjacency list

You can also load a graph from an adjacency list, given as a list of lists or a dictionary of lists:

[27]:
adjacency_list =[[0, 1, 2], [2, 3]]
adjacency = from_adjacency_list(adjacency_list, directed=True)
[28]:
image = visualize_graph(adjacency)
SVG(image)
[28]:
../../_images/tutorials_data_load_data_41_0.svg
[29]:
adjacency_dict = {"Alice": ["Bob", "David"], "Bob": ["Carey", "David"]}
graph = from_adjacency_list(adjacency_dict, directed=True)
[30]:
adjacency = graph.adjacency
names = graph.names
[31]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[31]:
../../_images/tutorials_data_load_data_44_0.svg

From a dataframe

Your dataframe might consist of a list of edges.

[32]:
df = pd.read_csv('miserables.tsv', sep='\t', names=['character_1', 'character_2'])
[33]:
df.head()
[33]:
character_1 character_2
0 Myriel Napoleon
1 Myriel Mlle Baptistine
2 Myriel Mme Magloire
3 Myriel Countess de Lo
4 Myriel Geborand
[34]:
edge_list = list(df.itertuples(index=False))
[35]:
graph = from_edge_list(edge_list)
[36]:
graph
[36]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>}
[37]:
df = pd.read_csv('movie_actor.tsv', sep='\t', names=['movie', 'actor'])
[38]:
df.head()
[38]:
movie actor
0 Inception Leonardo DiCaprio
1 Inception Marion Cotillard
2 Inception Joseph Gordon Lewitt
3 The Dark Knight Rises Marion Cotillard
4 The Dark Knight Rises Joseph Gordon Lewitt
[39]:
edge_list = list(df.itertuples(index=False))
[40]:
graph = from_edge_list(edge_list, bipartite=True)
[41]:
graph
[41]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
        with 41 stored elements in Compressed Sparse Row format>}

For categorical data, you can use pandas to get a bipartite graph between samples and features. We show an example taken from the Adult Income dataset.

[42]:
df = pd.read_csv('adult-income.csv')
[43]:
df.head()
[43]:
age workclass occupation relationship gender income
0 40-49 State-gov Adm-clerical Not-in-family Male <=50K
1 50-59 Self-emp-not-inc Exec-managerial Husband Male <=50K
2 40-49 Private Handlers-cleaners Not-in-family Male <=50K
3 50-59 Private Handlers-cleaners Husband Male <=50K
4 30-39 Private Prof-specialty Wife Female <=50K
[44]:
df_binary = pd.get_dummies(df, sparse=True)
[45]:
df_binary.head()
[45]:
age_20-29 age_30-39 age_40-49 age_50-59 age_60-69 age_70-79 age_80-89 age_90-99 workclass_ ? workclass_ Federal-gov ... relationship_ Husband relationship_ Not-in-family relationship_ Other-relative relationship_ Own-child relationship_ Unmarried relationship_ Wife gender_ Female gender_ Male income_ <=50K income_ >50K
0 False False True False False False False False False False ... False True False False False False False True True False
1 False False False True False False False False False False ... True False False False False False False True True False
2 False False True False False False False False False False ... False True False False False False False True True False
3 False False False True False False False False False False ... True False False False False False False True True False
4 False True False False False False False False False False ... False False False False False True True False True False

5 rows × 42 columns

[46]:
biadjacency = df_binary.sparse.to_coo()
[47]:
biadjacency = sparse.csr_matrix(biadjacency)
[48]:
# biadjacency matrix of the bipartite graph
biadjacency
[48]:
<32561x42 sparse matrix of type '<class 'numpy.bool'>'
        with 195366 stored elements in Compressed Sparse Row format>
[49]:
# names of columns
names_col = list(df_binary)
[50]:
len(names_col)
[50]:
42
[51]:
names_col[:8]
[51]:
['age_20-29',
 'age_30-39',
 'age_40-49',
 'age_50-59',
 'age_60-69',
 'age_70-79',
 'age_80-89',
 'age_90-99']

From a CSV file

You can directly load a graph from a CSV or TSV file:

[52]:
graph = from_csv('miserables.tsv')
[53]:
graph
[53]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
        with 508 stored elements in Compressed Sparse Row format>}
[54]:
graph = from_csv('movie_actor.tsv', bipartite=True)
[55]:
graph
[55]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
        with 41 stored elements in Compressed Sparse Row format>}

The graph can also be given in the form of adjacency lists (check the function from_csv).

From a GraphML file

You can also load a graph stored in the GraphML format.

[56]:
graph = from_graphml('miserables.graphml')
adjacency = graph.adjacency
names = graph.names
[57]:
# Directed graph
graph = from_graphml('painters.graphml')
adjacency = graph.adjacency
names = graph.names

From NetworkX

NetworkX has import and export functions from and towards the CSR format.

Other options

  • You want to test our toy graphs

  • You want to generate a graph from a model

  • You want to load a graph from existing repositories (see NetSet and KONECT)

Take a look at the other tutorials of the data section!