Load your data
In scikit-network, a graph is represented by its adjacency matrix (or biadjacency matrix for a bipartite graph) in the Compressed Sparse Row format of SciPy.
In this tutorial, we present a few methods to instantiate a graph in this format.
[1]:
from IPython.display import SVG
import numpy as np
from scipy import sparse
import pandas as pd
from sknetwork.data import from_edge_list, from_adjacency_list, from_graphml, from_csv
from sknetwork.visualization import visualize_graph, visualize_bigraph
From a NumPy array
For small graphs, you can instantiate the adjacency matrix as a dense NumPy array and convert it into a sparse matrix in CSR format.
[2]:
adjacency = np.array([[0, 1, 1, 0], [1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 0]])
adjacency = sparse.csr_matrix(adjacency)
image = visualize_graph(adjacency)
SVG(image)
[2]:
From an edge list
Another natural way to build a graph is from a list of edges.
[3]:
edge_list = [(0, 1), (1, 2), (2, 3), (3, 0), (0, 2)]
adjacency = from_edge_list(edge_list)
image = visualize_graph(adjacency)
SVG(image)
[3]:
By default, the graph is undirected, but you can easily make it directed.
[4]:
adjacency = from_edge_list(edge_list, directed=True)
image = visualize_graph(adjacency)
SVG(image)
[4]:
You might also want to add weights to your edges. Just use triplets instead of pairs!
[5]:
edge_list = [(0, 1, 1), (1, 2, 0.5), (2, 3, 1), (3, 0, 0.5), (0, 2, 2)]
adjacency = from_edge_list(edge_list)
image = visualize_graph(adjacency)
SVG(image)
[5]:
You can instantiate a bipartite graph as well.
[6]:
edge_list = [(0, 0), (1, 0), (1, 1), (2, 1)]
biadjacency = from_edge_list(edge_list, bipartite=True)
image = visualize_bigraph(biadjacency)
SVG(image)
[6]:
If nodes are not indexed, you get an object of type Bunch
with graph attributes (node names).
[7]:
edge_list = [("Alice", "Bob"), ("Bob", "Carey"), ("Alice", "David"), ("Carey", "David"), ("Bob", "David")]
graph = from_edge_list(edge_list)
[8]:
graph
[8]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>}
[9]:
adjacency = graph.adjacency
names = graph.names
[10]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[10]:
By default, the weight of each edge is the number of occurrences of the corresponding link:
[11]:
edge_list_new = edge_list + [("Alice", "Bob"), ("Alice", "David"), ("Alice", "Bob")]
graph = from_edge_list(edge_list_new)
[12]:
adjacency = graph.adjacency
names = graph.names
[13]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[13]:
You can make the graph unweighted.
[14]:
graph = from_edge_list(edge_list_new, weighted=False)
[15]:
adjacency = graph.adjacency
names = graph.names
[16]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[16]:
Again, you can make the graph directed:
[17]:
graph = from_edge_list(edge_list, directed=True)
[18]:
graph
[18]:
{'names': array(['Alice', 'Bob', 'Carey', 'David'], dtype='<U5'),
'adjacency': <4x4 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>}
[19]:
adjacency = graph.adjacency
names = graph.names
[20]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[20]:
The graph can also have explicit weights:
[21]:
edge_list = [("Alice", "Bob", 3), ("Bob", "Carey", 2), ("Alice", "David", 1), ("Carey", "David", 2), ("Bob", "David", 3)]
graph = from_edge_list(edge_list)
[22]:
adjacency = graph.adjacency
names = graph.names
[23]:
image = visualize_graph(adjacency, names=names, display_edge_weight=True, display_node_weight=True)
SVG(image)
[23]:
For a bipartite graph:
[24]:
edge_list = [("Alice", "Football"), ("Bob", "Tennis"), ("David", "Football"), ("Carey", "Tennis"), ("Carey", "Football")]
graph = from_edge_list(edge_list, bipartite=True)
[25]:
biadjacency = graph.biadjacency
names = graph.names
names_col = graph.names_col
[26]:
image = visualize_bigraph(biadjacency, names_row=names, names_col=names_col)
SVG(image)
[26]:
From an adjacency list
You can also load a graph from an adjacency list, given as a list of lists or a dictionary of lists:
[27]:
adjacency_list =[[0, 1, 2], [2, 3]]
adjacency = from_adjacency_list(adjacency_list, directed=True)
[28]:
image = visualize_graph(adjacency)
SVG(image)
[28]:
[29]:
adjacency_dict = {"Alice": ["Bob", "David"], "Bob": ["Carey", "David"]}
graph = from_adjacency_list(adjacency_dict, directed=True)
[30]:
adjacency = graph.adjacency
names = graph.names
[31]:
image = visualize_graph(adjacency, names=names)
SVG(image)
[31]:
From a dataframe
Your dataframe might consist of a list of edges.
[32]:
df = pd.read_csv('miserables.tsv', sep='\t', names=['character_1', 'character_2'])
[33]:
df.head()
[33]:
character_1 | character_2 | |
---|---|---|
0 | Myriel | Napoleon |
1 | Myriel | Mlle Baptistine |
2 | Myriel | Mme Magloire |
3 | Myriel | Countess de Lo |
4 | Myriel | Geborand |
[34]:
edge_list = list(df.itertuples(index=False))
[35]:
graph = from_edge_list(edge_list)
[36]:
graph
[36]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
with 508 stored elements in Compressed Sparse Row format>}
[37]:
df = pd.read_csv('movie_actor.tsv', sep='\t', names=['movie', 'actor'])
[38]:
df.head()
[38]:
movie | actor | |
---|---|---|
0 | Inception | Leonardo DiCaprio |
1 | Inception | Marion Cotillard |
2 | Inception | Joseph Gordon Lewitt |
3 | The Dark Knight Rises | Marion Cotillard |
4 | The Dark Knight Rises | Joseph Gordon Lewitt |
[39]:
edge_list = list(df.itertuples(index=False))
[40]:
graph = from_edge_list(edge_list, bipartite=True)
[41]:
graph
[41]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
'The Big Short', 'The Dark Knight Rises',
'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
dtype='<U28'),
'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
'The Big Short', 'The Dark Knight Rises',
'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
dtype='<U28'),
'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
dtype='<U28'),
'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
with 41 stored elements in Compressed Sparse Row format>}
For categorical data, you can use pandas
to get a bipartite graph between samples and features. We show an example taken from the Adult Income dataset.
[42]:
df = pd.read_csv('adult-income.csv')
[43]:
df.head()
[43]:
age | workclass | occupation | relationship | gender | income | |
---|---|---|---|---|---|---|
0 | 40-49 | State-gov | Adm-clerical | Not-in-family | Male | <=50K |
1 | 50-59 | Self-emp-not-inc | Exec-managerial | Husband | Male | <=50K |
2 | 40-49 | Private | Handlers-cleaners | Not-in-family | Male | <=50K |
3 | 50-59 | Private | Handlers-cleaners | Husband | Male | <=50K |
4 | 30-39 | Private | Prof-specialty | Wife | Female | <=50K |
[44]:
df_binary = pd.get_dummies(df, sparse=True)
[45]:
df_binary.head()
[45]:
age_20-29 | age_30-39 | age_40-49 | age_50-59 | age_60-69 | age_70-79 | age_80-89 | age_90-99 | workclass_ ? | workclass_ Federal-gov | ... | relationship_ Husband | relationship_ Not-in-family | relationship_ Other-relative | relationship_ Own-child | relationship_ Unmarried | relationship_ Wife | gender_ Female | gender_ Male | income_ <=50K | income_ >50K | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | False | False | False | False | False | False | False | ... | False | True | False | False | False | False | False | True | True | False |
1 | False | False | False | True | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | True | True | False |
2 | False | False | True | False | False | False | False | False | False | False | ... | False | True | False | False | False | False | False | True | True | False |
3 | False | False | False | True | False | False | False | False | False | False | ... | True | False | False | False | False | False | False | True | True | False |
4 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | True | True | False | True | False |
5 rows × 42 columns
[46]:
biadjacency = df_binary.sparse.to_coo()
[47]:
biadjacency = sparse.csr_matrix(biadjacency)
[48]:
# biadjacency matrix of the bipartite graph
biadjacency
[48]:
<32561x42 sparse matrix of type '<class 'numpy.bool'>'
with 195366 stored elements in Compressed Sparse Row format>
[49]:
# names of columns
names_col = list(df_binary)
[50]:
len(names_col)
[50]:
42
[51]:
names_col[:8]
[51]:
['age_20-29',
'age_30-39',
'age_40-49',
'age_50-59',
'age_60-69',
'age_70-79',
'age_80-89',
'age_90-99']
From a CSV file
You can directly load a graph from a CSV or TSV file:
[52]:
graph = from_csv('miserables.tsv')
[53]:
graph
[53]:
{'names': array(['Anzelma', 'Babet', 'Bahorel', 'Bamatabois', 'Baroness',
'Blacheville', 'Bossuet', 'Boulatruelle', 'Brevet', 'Brujon',
'Champmathieu', 'Champtercier', 'Chenildieu', 'Child1', 'Child2',
'Claquesous', 'Cochepaille', 'Combeferre', 'Cosette', 'Count',
'Countess de Lo', 'Courfeyrac', 'Cravatte', 'Dahlia', 'Enjolras',
'Eponine', 'Fameuil', 'Fantine', 'Fauchelevent', 'Favourite',
'Feuilly', 'Gavroche', 'Geborand', 'Gervais', 'Gillenormand',
'Grantaire', 'Gribier', 'Gueulemer', 'Isabeau', 'Javert', 'Joly',
'Jondrette', 'Judge', 'Labarre', 'Listolier', 'Lt Gillenormand',
'Mabeuf', 'Magnon', 'Marguerite', 'Marius', 'Mlle Baptistine',
'Mlle Gillenormand', 'Mlle Vaubois', 'Mme Burgon', 'Mme Der',
'Mme Hucheloup', 'Mme Magloire', 'Mme Pontmercy', 'Mme Thenardier',
'Montparnasse', 'MotherInnocent', 'MotherPlutarch', 'Myriel',
'Napoleon', 'Old man', 'Perpetue', 'Pontmercy', 'Prouvaire',
'Scaufflaire', 'Simplice', 'Thenardier', 'Tholomyes', 'Toussaint',
'Valjean', 'Woman1', 'Woman2', 'Zephine'], dtype='<U17'),
'adjacency': <77x77 sparse matrix of type '<class 'numpy.int64'>'
with 508 stored elements in Compressed Sparse Row format>}
[54]:
graph = from_csv('movie_actor.tsv', bipartite=True)
[55]:
graph
[55]:
{'names_row': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
'The Big Short', 'The Dark Knight Rises',
'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
dtype='<U28'),
'names': array(['007 Spectre', 'Aviator', 'Crazy Stupid Love', 'Drive',
'Fantastic Beasts 2', 'Inception', 'Inglourious Basterds',
'La La Land', 'Midnight In Paris', 'Murder on the Orient Express',
'The Big Short', 'The Dark Knight Rises',
'The Grand Budapest Hotel', 'The Great Gatsby', 'Vice'],
dtype='<U28'),
'names_col': array(['Brad Pitt', 'Carey Mulligan', 'Christian Bale',
'Christophe Waltz', 'Emma Stone', 'Johnny Depp',
'Joseph Gordon Lewitt', 'Jude Law', 'Lea Seydoux',
'Leonardo DiCaprio', 'Marion Cotillard', 'Owen Wilson',
'Ralph Fiennes', 'Ryan Gosling', 'Steve Carell', 'Willem Dafoe'],
dtype='<U28'),
'biadjacency': <15x16 sparse matrix of type '<class 'numpy.int64'>'
with 41 stored elements in Compressed Sparse Row format>}
The graph can also be given in the form of adjacency lists (check the function from_csv
).
From a GraphML file
You can also load a graph stored in the GraphML format.
[56]:
graph = from_graphml('miserables.graphml')
adjacency = graph.adjacency
names = graph.names
[57]:
# Directed graph
graph = from_graphml('painters.graphml')
adjacency = graph.adjacency
names = graph.names
From NetworkX
NetworkX has import and export functions from and towards the CSR format.
Other options
You want to test our toy graphs
You want to generate a graph from a model
You want to load a graph from existing repositories (see NetSet and KONECT)
Take a look at the other tutorials of the data section!