Recommendation
This notebook shows how to apply scikit-network for content recommendation.
We use consider the Movielens dataset of the netset collection, corresponding to ratings of 9066 movies by 671 users.
[1]:
from IPython.display import SVG
[2]:
import numpy as np
from scipy.cluster.hierarchy import linkage
[3]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram
Data
[4]:
dataset = load_netset('movielens')
Downloading movielens from NetSet...
Unpacking archive...
Parsing files...
Done.
[5]:
biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels
[6]:
biadjacency
[6]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 100004 stored elements and shape (9066, 671)>
[7]:
n_movies, n_users = biadjacency.shape
[8]:
# ratings
np.unique(biadjacency.data, return_counts=True)
[8]:
(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
array([ 1101, 3326, 1687, 7271, 4449, 20064, 10538, 28750, 7723,
15095]))
[9]:
# positive ratings
positive = biadjacency >= 3
[10]:
positive
[10]:
<Compressed Sparse Row sparse matrix of dtype 'bool'
with 82170 stored elements and shape (9066, 671)>
[11]:
names_labels
[11]:
array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
'Western'], dtype='<U11')
[12]:
labels.shape
[12]:
(9066, 19)
PageRank
We first use (personalized) PageRank to get the most popular movies of each category.
[13]:
pagerank = PageRank()
[14]:
# top-10 movies
scores = pagerank.fit_predict(positive)
names[top_k(scores, 10)]
[14]:
array(['Forrest Gump (1994)', 'Pulp Fiction (1994)',
'Shawshank Redemption, The (1994)',
'Silence of the Lambs, The (1991)',
'Star Wars: Episode IV - A New Hope (1977)', 'Matrix, The (1999)',
'Jurassic Park (1993)', "Schindler's List (1993)",
'Back to the Future (1985)',
'Star Wars: Episode V - The Empire Strikes Back (1980)'],
dtype=object)
[15]:
# number of movies per genre
n_selection = 10
[16]:
# selection
selection = []
for label in np.arange(len(names_labels)):
ppr = pagerank.fit_predict(positive, weights=labels[:, label])
scores = ppr * labels[:, label]
selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[17]:
# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
print('---')
print(label, name_label)
print(names[selection[label, :5]])
---
0 Action
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
'Jurassic Park (1993)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Terminator 2: Judgment Day (1991)']
---
1 Adventure
['Star Wars: Episode IV - A New Hope (1977)' 'Jurassic Park (1993)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Back to the Future (1985)' 'Toy Story (1995)']
---
2 Animation
['Rio 2 (2014)' 'Tangled Ever After (2012)' 'Planes: Fire & Rescue (2014)'
'Werner - Beinhart! (1990)' 'Planes (2013)']
---
3 Children
['Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)'
'Baby Take a Bow (1934)' 'Snowman, The (1982)'
'Wild Thornberrys Movie, The (2002)'
'Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie) (2002)']
---
4 Comedy
['Forrest Gump (1994)' 'Pulp Fiction (1994)' 'Back to the Future (1985)'
'Toy Story (1995)' 'Fargo (1996)']
---
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
---
6 Documentary
['My Friend Rockefeller (2015)' 'Cruise, The (1998)'
'Other Shore, The (2013)' "Stephen Tobolowsky's Birthday Party (2005)"
'Unmade Beds (1997)']
---
7 Drama
['Pulp Fiction (1994)' 'Forrest Gump (1994)'
'Shawshank Redemption, The (1994)' "Schindler's List (1993)"
'American Beauty (1999)']
---
8 Fantasy
["The Huntsman Winter's War (2016)" "Winter's Tale (2014)" 'Solace (2015)'
'Bogus (1996)' 'Knights of Badassdom (2013)']
---
9 Film-Noir
['Bullet to the Head (2012)' 'No Way Out (1950)'
'Lady from Shanghai, The (1947)' 'Johnny Eager (1942)'
'Kiss Before Dying, A (1956)']
---
10 Horror
['Silence of the Lambs, The (1991)' 'Sharknado 4: The 4th Awakens (2016)'
'The Purge: Election Year (2016)' 'Body (2015)' 'Infini (2015)']
---
11 IMAX
['Jack Ryan: Shadow Recruit (2014)' 'I, Frankenstein (2014)'
'White House Down (2013)' 'Man of Tai Chi (2013)' 'After Earth (2013)']
---
12 Musical
['First Nudie Musical, The (1976)' 'Stand Up and Cheer! (1934)'
'Dance Flick (2009)' 'Stowaway (1936)' 'Gypsy (1993)']
---
13 Mystery
['Narcopolis (2014)'
'Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)'
'Raven, The (2012)' 'Solace (2015)' 'Blackhat (2015)']
---
14 Romance
['Forrest Gump (1994)' 'American Beauty (1999)'
'Princess Bride, The (1987)' 'Beauty and the Beast (1991)'
'Good Will Hunting (1997)']
---
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
'Star Wars: Episode V - The Empire Strikes Back (1980)'
'Jurassic Park (1993)' 'Back to the Future (1985)']
---
16 Thriller
['Pulp Fiction (1994)' 'Silence of the Lambs, The (1991)'
'Matrix, The (1999)' 'Jurassic Park (1993)' 'Fargo (1996)']
---
17 War
['Inescapable (2012)' 'Monuments Men, The (2014)'
'Starship Troopers 3: Marauder (2008)' '100 Rifles (1969)'
'Wind That Shakes the Barley, The (2006)']
---
18 Western
['The Ridiculous 6 (2015)' 'Hearts of the West (1975)' 'Stagecoach (1966)'
'Rainmaker, The (1956)' 'Bandidas (2006)']
We now apply PageRank to get the most relevant movies associated with a given movie.
[18]:
target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}
[19]:
target
[19]:
{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}
[20]:
scores_ppr = pagerank.fit_predict(positive, weights={175:1})
[21]:
names[top_k(scores_ppr - scores, 10)]
[21]:
array(['Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
'Fargo (1996)', 'Pulp Fiction (1994)',
'Star Wars: Episode IV - A New Hope (1977)',
'L.A. Confidential (1997)', 'Matrix, The (1999)',
'Shawshank Redemption, The (1994)', 'American Beauty (1999)',
'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)
We can also apply PageRank to make recommend movies to a user.
[22]:
user = 1
targets = get_neighbors(positive, user, transpose=True)
[23]:
# seen movies (sample)
names[targets][:10]
[23]:
array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
"Mr. Holland's Opus (1995)", 'Braveheart (1995)',
'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)
[24]:
mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1
[25]:
scores_ppr = pagerank.fit_predict(positive, weights=mask)
[26]:
# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]
[26]:
array(['Shawshank Redemption, The (1994)', 'True Lies (1994)',
'Star Wars: Episode IV - A New Hope (1977)',
'Beauty and the Beast (1991)', 'Toy Story (1995)',
'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Fargo (1996)',
'Independence Day (a.k.a. ID4) (1996)', 'Matrix, The (1999)',
'Star Wars: Episode V - The Empire Strikes Back (1980)'],
dtype=object)
Embedding
We now represent each movie by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding for top-100 movies.
[27]:
# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)
[28]:
# top-100 movies
scores = pagerank.fit_predict(positive)
index = top_k(scores, 100)
dendrogram = linkage(embedding[index], method='ward')
[29]:
# visualization
image = visualize_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
SVG(image)
[29]: