Load your data

In scikit-network, a graph is represented by its adjacency matrix (or biadjacency matrix for a bipartite graph) in the Compressed Sparse Row format of SciPy.

In this tutorial, we present a few methods to instantiate a graph in this format.

[1]:

from IPython.display import SVG

import numpy as np
from scipy import sparse
import pandas as pd

from sknetwork.data import from_edge_list, from_adjacency_list, from_graphml, from_csv
from sknetwork.visualization import visualize_graph, visualize_bigraph

From a NumPy array

For small graphs, you can instantiate the adjacency matrix as a dense NumPy array and convert it into a sparse matrix in CSR format.

[2]:

adjacency = np.array([[0, 1, 1, 0], [1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 0]])
adjacency = sparse.csr_matrix(adjacency)

image = visualize_graph(adjacency)
SVG(image)

[2]:

../../_images/tutorials_data_load_data_4_0.svg

From an edge list

Another natural way to build a graph is from a list of edges.

[3]:

edge_list = [(0, 1), (1, 2), (2, 3), (3, 0), (0, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)

[3]:

../../_images/tutorials_data_load_data_6_0.svg

By default, the graph is undirected, but you can easily make it directed.

[4]:

adjacency = from_edge_list(edge_list, directed=True)

image = visualize_graph(adjacency)
SVG(image)

[4]:

../../_images/tutorials_data_load_data_8_0.svg

You might also want to add weights to your edges. Just use triplets instead of pairs!

[5]:

edge_list = [(0, 1, 1), (1, 2, 0.5), (2, 3, 1), (3, 0, 0.5), (0, 2, 2)]
adjacency = from_edge_list(edge_list)

image = visualize_graph(adjacency)
SVG(image)

[5]:

../../_images/tutorials_data_load_data_10_0.svg

You can instantiate a bipartite graph as well.

[6]:

edge_list = [(0, 0), (1, 0), (1, 1), (2, 1)]
biadjacency = from_edge_list(edge_list, bipartite=True)

image = visualize_bigraph(biadjacency)
SVG(image)

[6]:

../../_images/tutorials_data_load_data_12_0.svg

If nodes are not indexed, you get an object of type Bunch with graph attributes (node names).

[7]:

edge_list = [("Alice", "Bob"), ("Bob", "Carey"), ("Alice", "David"), ("Carey", "David"), ("Bob", "David")]
graph = from_edge_list(edge_list)

[8]:

graph

[8]:

{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 10 stored elements and shape (4, 4)>}

[9]:

adjacency = graph.adjacency
names = graph.names

[10]:

image = visualize_graph(adjacency, names=names)
SVG(image)

[10]:

../../_images/tutorials_data_load_data_17_0.svg

By default, the weight of each edge is the number of occurrences of the corresponding link:

[11]:

edge_list_new = edge_list + [("Alice", "Bob"), ("Alice", "David"), ("Alice", "Bob")]
graph = from_edge_list(edge_list_new)

[12]:

adjacency = graph.adjacency
names = graph.names

[13]:

image = visualize_graph(adjacency, names=names)
SVG(image)

[13]:

../../_images/tutorials_data_load_data_21_0.svg

You can make the graph unweighted.

[14]:

graph = from_edge_list(edge_list_new, weighted=False)

[15]:

adjacency = graph.adjacency
names = graph.names

[16]:

image = visualize_graph(adjacency, names=names)
SVG(image)

[16]:

../../_images/tutorials_data_load_data_25_0.svg

Again, you can make the graph directed:

[17]:

graph = from_edge_list(edge_list, directed=True)

[18]:

graph

[18]:

{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
 'adjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 5 stored elements and shape (4, 4)>}

[19]:

adjacency = graph.adjacency
names = graph.names

[20]:

image = visualize_graph(adjacency, names=names)
SVG(image)

[20]:

../../_images/tutorials_data_load_data_30_0.svg

The graph can also have explicit weights:

[21]:

edge_list = [("Alice", "Bob", 3), ("Bob", "Carey", 2), ("Alice", "David", 1), ("Carey", "David", 2), ("Bob", "David", 3)]
graph = from_edge_list(edge_list)

[22]:

adjacency = graph.adjacency
names = graph.names

[23]:

image = visualize_graph(adjacency, names=names, display_edge_weight=True, display_node_weight=True)
SVG(image)

[23]:

../../_images/tutorials_data_load_data_34_0.svg

For a bipartite graph:

[24]:

edge_list = [("Alice", "Football"), ("Bob", "Tennis"), ("David", "Football"), ("Carey", "Tennis"), ("Carey", "Football")]
graph = from_edge_list(edge_list, bipartite=True)

[25]:

biadjacency = graph.biadjacency
names = graph.names
names_col = graph.names_col

[26]:

image = visualize_bigraph(biadjacency, names_row=names, names_col=names_col)
SVG(image)

[26]:

../../_images/tutorials_data_load_data_38_0.svg

From an adjacency list

You can also load a graph from an adjacency list, given as a list of lists or a dictionary of lists:

[27]:

adjacency_list =[[0, 1, 2], [2, 3]]
adjacency = from_adjacency_list(adjacency_list, directed=True)

[28]:

image = visualize_graph(adjacency)
SVG(image)

[28]:

../../_images/tutorials_data_load_data_41_0.svg

[29]:

adjacency_dict = {"Alice": ["Bob", "David"], "Bob": ["Carey", "David"]}
graph = from_adjacency_list(adjacency_dict, directed=True)

[30]:

adjacency = graph.adjacency
names = graph.names

[31]:

image = visualize_graph(adjacency, names=names)
SVG(image)

[31]:

../../_images/tutorials_data_load_data_44_0.svg

From a dataframe

Your dataframe might consist of a list of edges.

[32]:

df = pd.read_csv('miserables.tsv', sep='\t', names=['character_1', 'character_2'])

[33]:

df.head()

[33]:

	character_1	character_2
0	Myriel	Napoleon
1	Myriel	Mlle Baptistine
2	Myriel	Mme Magloire
3	Myriel	Countess de Lo
4	Myriel	Geborand

[34]:

edge_list = list(df.itertuples(index=False))

[35]:

graph = from_edge_list(edge_list)

[36]:

graph

[36]:

{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 508 stored elements and shape (77, 77)>}

[37]:

df = pd.read_csv('movie_actor.tsv', sep='\t', names=['movie', 'actor'])

[38]:

df.head()

[38]:

	movie	actor
0	Inception	Leonardo DiCaprio
1	Inception	Marion Cotillard
2	Inception	Joseph Gordon Lewitt
3	The Dark Knight Rises	Marion Cotillard
4	The Dark Knight Rises	Joseph Gordon Lewitt

[39]:

edge_list = list(df.itertuples(index=False))

[40]:

graph = from_edge_list(edge_list, bipartite=True)

[41]:

graph

[41]:

{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 41 stored elements and shape (15, 16)>}

For categorical data, you can use pandas to get a bipartite graph between samples and features. We show an example taken from the Adult Income dataset.

[42]:

df = pd.read_csv('adult-income.csv')

[43]:

df.head()

[43]:

	age	workclass	occupation	relationship	gender	income
0	40-49	State-gov	Adm-clerical	Not-in-family	Male	<=50K
1	50-59	Self-emp-not-inc	Exec-managerial	Husband	Male	<=50K
2	40-49	Private	Handlers-cleaners	Not-in-family	Male	<=50K
3	50-59	Private	Handlers-cleaners	Husband	Male	<=50K
4	30-39	Private	Prof-specialty	Wife	Female	<=50K

[44]:

df_binary = pd.get_dummies(df, sparse=True)

[45]:

df_binary.head()

[45]:

	age_20-29	age_30-39	age_40-49	age_50-59	age_60-69	age_70-79	age_80-89	age_90-99	workclass_ ?	workclass_ Federal-gov	...	relationship_ Husband	relationship_ Not-in-family	relationship_ Other-relative	relationship_ Own-child	relationship_ Unmarried	relationship_ Wife	gender_ Female	gender_ Male	income_ <=50K	income_ >50K
0	False	False	True	False	False	False	False	False	False	False	...	False	True	False	False	False	False	False	True	True	False
1	False	False	False	True	False	False	False	False	False	False	...	True	False	False	False	False	False	False	True	True	False
2	False	False	True	False	False	False	False	False	False	False	...	False	True	False	False	False	False	False	True	True	False
3	False	False	False	True	False	False	False	False	False	False	...	True	False	False	False	False	False	False	True	True	False
4	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	True	True	False	True	False

5 rows × 42 columns

[46]:

biadjacency = df_binary.sparse.to_coo()

[47]:

biadjacency = sparse.csr_matrix(biadjacency)

[48]:

# biadjacency matrix of the bipartite graph
biadjacency

[48]:

<Compressed Sparse Row sparse matrix of dtype 'bool'
        with 195366 stored elements and shape (32561, 42)>

[49]:

# names of columns
names_col = list(df_binary)

[50]:

len(names_col)

[50]:

[51]:

names_col[:8]

[51]:

['age_20-29',
 'age_30-39',
 'age_40-49',
 'age_50-59',
 'age_60-69',
 'age_70-79',
 'age_80-89',
 'age_90-99']

From a CSV file

You can directly load a graph from a CSV or TSV file:

[52]:

graph = from_csv('miserables.tsv')

[53]:

graph

[53]:

{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
        'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
        'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
        'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
        'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
        'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
        'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
        'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
        'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
        'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
        'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
        'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
        'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
        'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
        'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
        'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
 'adjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 508 stored elements and shape (77, 77)>}

[54]:

graph = from_csv('movie_actor.tsv', bipartite=True)

[55]:

graph

[55]:

{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
        'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
        'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
        'The Big Short', 'The Dark Knight Rises',
        'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
       dtype='<U28'),
 'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
        'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
        'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
        'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
        'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
       dtype='<U28'),
 'biadjacency': <Compressed Sparse Row sparse matrix of dtype 'int64'
        with 41 stored elements and shape (15, 16)>}

The graph can also be given in the form of adjacency lists (check the function from_csv).

From a GraphML file

You can also load a graph stored in the GraphML format.

[56]:

graph = from_graphml('miserables.graphml')
adjacency = graph.adjacency
names = graph.names

[57]:

# Directed graph
graph = from_graphml('painters.graphml')
adjacency = graph.adjacency
names = graph.names

From NetworkX

NetworkX has import and export functions from and towards the CSR format.

Other options

You want to test our toy graphs
You want to generate a graph from a model
You want to load a graph from existing repositories (see NetSet and KONECT)

Take a look at the other tutorials of the data section!