Recommendation

This notebook shows how to apply scikit-network for content recommendation.

We use consider the Movielens dataset of the netset collection, corresponding to ratings of 9066 movies by 671 users.

[1]:
from IPython.display import SVG
[2]:
import numpy as np
[3]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import WardDense, get_neighbors
from sknetwork.visualization import svg_dendrogram

Data

[4]:
dataset = load_netset('movielens')
Downloading movielens from NetSet...
Unpacking archive...
Parsing files...
Done.
[5]:
biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels
[6]:
biadjacency
[6]:
<9066x671 sparse matrix of type '<class 'numpy.float64'>'
        with 100004 stored elements in Compressed Sparse Row format>
[7]:
n_movies, n_users = biadjacency.shape
[8]:
# ratings
np.unique(biadjacency.data, return_counts=True)
[8]:
(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
 array([ 1101,  3326,  1687,  7271,  4449, 20064, 10538, 28750,  7723,
        15095]))
[9]:
# positive ratings
positive = biadjacency >= 3
[10]:
positive
[10]:
<9066x671 sparse matrix of type '<class 'numpy.bool_'>'
        with 82170 stored elements in Compressed Sparse Row format>
[11]:
names_labels
[11]:
array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'], dtype='<U11')
[12]:
labels.shape
[12]:
(9066, 19)

PageRank

We first use (personalized) PageRank to get the most popular movies of each category.

[13]:
pagerank = PageRank()
[14]:
# top-10 movies
scores = pagerank.fit_predict(positive)
names[top_k(scores, 10)]
[14]:
array(['Silence of the Lambs, The (1991)', 'Jurassic Park (1993)',
       'Star Wars: Episode IV - A New Hope (1977)', 'Forrest Gump (1994)',
       'Pulp Fiction (1994)', 'Matrix, The (1999)',
       'Shawshank Redemption, The (1994)', "Schindler's List (1993)",
       'Star Wars: Episode V - The Empire Strikes Back (1980)',
       'Back to the Future (1985)'], dtype=object)
[15]:
# number of movies per genre
n_selection = 10
[16]:
# selection
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(positive, seeds=labels[:, label])
    scores = ppr * labels[:, label]
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
[17]:
# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])
---
0 Action
['Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Jurassic Park (1993)' 'Terminator 2: Judgment Day (1991)'
 'Star Wars: Episode VI - Return of the Jedi (1983)'
 'Star Wars: Episode IV - A New Hope (1977)']
---
1 Adventure
['Aladdin (1992)' 'Toy Story (1995)'
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)'
 'Lord of the Rings: The Two Towers, The (2002)'
 'Star Wars: Episode IV - A New Hope (1977)']
---
2 Animation
['SpongeBob SquarePants Movie, The (2004)' 'Tangled Ever After (2012)'
 'Space Chimps (2008)' 'Pokémon 3: The Movie (2001)' 'Valiant (2005)']
---
3 Children
['Thomas and the Magic Railroad (2000)' 'Smurfs 2, The (2013)'
 'Like Mike (2002)' 'Hey Arnold! The Movie (2002)'
 'Race to Witch Mountain (2009)']
---
4 Comedy
['Back to the Future (1985)' 'Fargo (1996)' 'Forrest Gump (1994)'
 'Pulp Fiction (1994)' 'Toy Story (1995)']
---
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
 'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
---
6 Documentary
['SOMM: Into the Bottle (2016)' 'Cocaine Cowboys: Reloaded (2014)'
 "Cocaine Cowboys II: Hustlin' With the Godmother (2008)"
 'Agony and the Ecstasy of Phil Spector, The (2009)' 'Promises (2001)']
---
7 Drama
['American Beauty (1999)' 'Fight Club (1999)' 'Braveheart (1995)'
 'Fargo (1996)' "Schindler's List (1993)"]
---
8 Fantasy
['Twilight Saga: Eclipse, The (2010)' 'Fat Albert (2004)'
 'Nightbreed (1990)' 'Beastmaster 2: Through the Portal of Time (1991)'
 'Solace (2015)']
---
9 Film-Noir
['Kiss Before Dying, A (1956)' 'T-Men (1947)' 'No Way Out (1950)'
 'Force of Evil (1948)' 'Bullet to the Head (2012)']
---
10 Horror
['Rogue (2007)' 'Paranormal Activity: The Marked Ones (2014)'
 'Ring of Terror (1962)' 'Silence of the Lambs, The (1991)'
 'Carnosaur 3: Primal Species (1996)']
---
11 IMAX
['Jack the Giant Slayer (2013)' "Dr. Seuss' The Lorax (2012)"
 'After Earth (2013)' 'Resident Evil: Retribution (2012)'
 'Mars Needs Moms (2011)']
---
12 Musical
['First Nudie Musical, The (1976)' 'Zoot Suit (1981)' 'Yentl (1983)'
 "Dr. Seuss' The Lorax (2012)" 'Singing Detective, The (2003)']
---
13 Mystery
['Spirits of the Dead (1968)' 'Oscar (1991)' 'Solace (2015)'
 'Nomads (1986)'
 'Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)']
---
14 Romance
['Forrest Gump (1994)' 'Beauty and the Beast (1991)'
 'Princess Bride, The (1987)' 'Good Will Hunting (1997)'
 'True Lies (1994)']
---
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Beginning of the End (1957)'
 'Star Wars: Episode VI - Return of the Jedi (1983)'
 'Jurassic Park (1993)']
---
16 Thriller
['Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Fight Club (1999)'
 'Pulp Fiction (1994)' 'Matrix, The (1999)']
---
17 War
['Iron Eagle II (1988)' 'Dark Blue World (Tmavomodrý svet) (2001)'
 'Wind That Shakes the Barley, The (2006)' 'Pathfinder (2007)'
 'Night of the Generals, The (1967)']
---
18 Western
['The Ridiculous 6 (2015)' 'Shakiest Gun in the West, The (1968)'
 "'Neath the Arizona Skies (1934)" 'Stagecoach (1966)'
 'Missing, The (2003)']

We now apply PageRank to get the most relevant movies associated with a given movie.

[18]:
target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}
[19]:
target
[19]:
{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}
[20]:
scores_ppr = pagerank.fit_predict(positive, seeds={175:1})
[21]:
names[top_k(scores_ppr - scores, 10)]
[21]:
array(['Fargo (1996)', 'Pulp Fiction (1994)',
       'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'American Beauty (1999)', 'Shawshank Redemption, The (1994)',
       'Matrix, The (1999)', 'L.A. Confidential (1997)',
       'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)

We can also apply PageRank to make recommend movies to a user.

[22]:
user = 1
targets = get_neighbors(positive, user, transpose=True)
[23]:
# seen movies (sample)
names[targets][:10]
[23]:
array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
       'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
       'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
       "Mr. Holland's Opus (1995)", 'Braveheart (1995)',
       'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)
[24]:
mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1
[25]:
scores_ppr = pagerank.fit_predict(positive, seeds=mask)
[26]:
# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]
[26]:
array(['Matrix, The (1999)', 'Fargo (1996)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'Beauty and the Beast (1991)',
       'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'True Lies (1994)',
       'Toy Story (1995)', 'Shawshank Redemption, The (1994)',
       'Independence Day (a.k.a. ID4) (1996)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
      dtype=object)

Embedding

We now represent each movie by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding for top-100 movies.

[27]:
# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)
[28]:
ward = WardDense()
[29]:
# top-100 movies
scores = pagerank.fit_predict(positive)
index = top_k(scores, 100)
dendrogram = ward.fit_transform(embedding[index])
[30]:
# visualization
image = svg_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
SVG(image)
[30]:
../_images/use_cases_recommendation_37_0.svg