Recommendation

This notebook shows how to apply scikit-network for content recommendation.

We use consider the Movielens dataset of the netset collection, corresponding to ratings of 9066 movies by 671 users.

[1]:

from IPython.display import SVG

[2]:

import numpy as np
from scipy.cluster.hierarchy import linkage

[3]:

from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram

Data

[4]:

dataset = load_netset('movielens')

Downloading movielens from NetSet...

Unpacking archive...
Parsing files...
Done.

[5]:

biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels

[6]:

biadjacency

[6]:

<9066x671 sparse matrix of type '<class 'numpy.float64'>'
        with 100004 stored elements in Compressed Sparse Row format>

[7]:

n_movies, n_users = biadjacency.shape

[8]:

# ratings
np.unique(biadjacency.data, return_counts=True)

[8]:

(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
 array([ 1101,  3326,  1687,  7271,  4449, 20064, 10538, 28750,  7723,
        15095]))

[9]:

# positive ratings
positive = biadjacency >= 3

[10]:

positive

[10]:

<9066x671 sparse matrix of type '<class 'numpy.bool_'>'
        with 82170 stored elements in Compressed Sparse Row format>

[11]:

names_labels

[11]:

array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'], dtype='<U11')

[12]:

labels.shape

[12]:

(9066, 19)

PageRank

We first use (personalized) PageRank to get the most popular movies of each category.

[13]:

pagerank = PageRank()

[14]:

# top-10 movies
scores = pagerank.fit_predict(positive)
names[top_k(scores, 10)]

[14]:

array(['Forrest Gump (1994)', 'Pulp Fiction (1994)',
       'Shawshank Redemption, The (1994)',
       'Silence of the Lambs, The (1991)',
       'Star Wars: Episode IV - A New Hope (1977)', 'Matrix, The (1999)',
       'Jurassic Park (1993)', "Schindler's List (1993)",
       'Back to the Future (1985)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
      dtype=object)

[15]:

# number of movies per genre
n_selection = 10

[16]:

# selection
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(positive, weights=labels[:, label])
    scores = ppr * labels[:, label]
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[17]:

# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])

---
0 Action
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Terminator 2: Judgment Day (1991)']
---
1 Adventure
['Star Wars: Episode IV - A New Hope (1977)' 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Back to the Future (1985)' 'Toy Story (1995)']
---
2 Animation
['SpongeBob SquarePants Movie, The (2004)' 'Tangled Ever After (2012)'
 'Space Chimps (2008)' 'Pokémon 3: The Movie (2001)' 'Valiant (2005)']
---
3 Children
['Thomas and the Magic Railroad (2000)' 'Smurfs 2, The (2013)'
 'Like Mike (2002)' 'Hey Arnold! The Movie (2002)'
 'Race to Witch Mountain (2009)']
---
4 Comedy
['Forrest Gump (1994)' 'Pulp Fiction (1994)' 'Back to the Future (1985)'
 'Toy Story (1995)' 'Fargo (1996)']
---
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
 'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
---
6 Documentary
['SOMM: Into the Bottle (2016)' 'Cocaine Cowboys: Reloaded (2014)'
 "Cocaine Cowboys II: Hustlin' With the Godmother (2008)"
 'Agony and the Ecstasy of Phil Spector, The (2009)' 'Promises (2001)']
---
7 Drama
['Pulp Fiction (1994)' 'Forrest Gump (1994)'
 'Shawshank Redemption, The (1994)' "Schindler's List (1993)"
 'American Beauty (1999)']
---
8 Fantasy
['Twilight Saga: Eclipse, The (2010)' 'Fat Albert (2004)'
 'Nightbreed (1990)' 'Beastmaster 2: Through the Portal of Time (1991)'
 'Solace (2015)']
---
9 Film-Noir
['Kiss Before Dying, A (1956)' 'T-Men (1947)' 'No Way Out (1950)'
 'Force of Evil (1948)' 'Bullet to the Head (2012)']
---
10 Horror
['Silence of the Lambs, The (1991)' 'Rogue (2007)'
 'Paranormal Activity: The Marked Ones (2014)' 'Ring of Terror (1962)'
 'Carnosaur 3: Primal Species (1996)']
---
11 IMAX
['Jack the Giant Slayer (2013)' "Dr. Seuss' The Lorax (2012)"
 'After Earth (2013)' 'Resident Evil: Retribution (2012)'
 'Mars Needs Moms (2011)']
---
12 Musical
['First Nudie Musical, The (1976)' 'Zoot Suit (1981)' 'Yentl (1983)'
 "Dr. Seuss' The Lorax (2012)" 'Singing Detective, The (2003)']
---
13 Mystery
['Spirits of the Dead (1968)' 'Oscar (1991)' 'Solace (2015)'
 'Nomads (1986)'
 'Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)']
---
14 Romance
['Forrest Gump (1994)' 'American Beauty (1999)'
 'Princess Bride, The (1987)' 'Beauty and the Beast (1991)'
 'Good Will Hunting (1997)']
---
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Jurassic Park (1993)' 'Back to the Future (1985)']
---
16 Thriller
['Pulp Fiction (1994)' 'Silence of the Lambs, The (1991)'
 'Matrix, The (1999)' 'Jurassic Park (1993)' 'Fargo (1996)']
---
17 War
['Iron Eagle II (1988)' 'Dark Blue World (Tmavomodrý svet) (2001)'
 'Wind That Shakes the Barley, The (2006)' 'Pathfinder (2007)'
 'Night of the Generals, The (1967)']
---
18 Western
['The Ridiculous 6 (2015)' 'Shakiest Gun in the West, The (1968)'
 "'Neath the Arizona Skies (1934)" 'Stagecoach (1966)'
 'Missing, The (2003)']

We now apply PageRank to get the most relevant movies associated with a given movie.

[18]:

target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}

[19]:

target

[19]:

{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}

[20]:

scores_ppr = pagerank.fit_predict(positive, weights={175:1})

[21]:

names[top_k(scores_ppr - scores, 10)]

[21]:

array(['Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
       'Fargo (1996)', 'Pulp Fiction (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'L.A. Confidential (1997)', 'Matrix, The (1999)',
       'Shawshank Redemption, The (1994)', 'American Beauty (1999)',
       'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)

We can also apply PageRank to make recommend movies to a user.

[22]:

user = 1
targets = get_neighbors(positive, user, transpose=True)

[23]:

# seen movies (sample)
names[targets][:10]

[23]:

array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
       'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
       'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
       "Mr. Holland's Opus (1995)", 'Braveheart (1995)',
       'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)

[24]:

mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1

[25]:

scores_ppr = pagerank.fit_predict(positive, weights=mask)

[26]:

# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]

[26]:

array(['Shawshank Redemption, The (1994)', 'True Lies (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'Beauty and the Beast (1991)', 'Toy Story (1995)',
       'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Fargo (1996)',
       'Independence Day (a.k.a. ID4) (1996)', 'Matrix, The (1999)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
      dtype=object)

Embedding

We now represent each movie by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding for top-100 movies.

[27]:

# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)

[28]:

# top-100 movies
scores = pagerank.fit_predict(positive)
index = top_k(scores, 100)
dendrogram = linkage(embedding[index], method='ward')

[29]:

# visualization
image = visualize_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
SVG(image)

[29]: