Recommendation

This notebook shows how to apply scikit-network for content recommendation.

We use consider the Movielens dataset of the netset collection, corresponding to ratings of 9066 movies by 671 users.

[1]:

from IPython.display import SVG

[2]:

import numpy as np

[3]:

from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import WardDense, get_neighbors
from sknetwork.visualization import svg_dendrogram

Data

[4]:

dataset = load_netset('movielens')

Downloading movielens from NetSet...
Unpacking archive...
Parsing files...
Done.

[5]:

biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels

[6]:

biadjacency

[6]:

<9066x671 sparse matrix of type '<class 'numpy.float64'>'
        with 100004 stored elements in Compressed Sparse Row format>

[7]:

n_movies, n_users = biadjacency.shape

[8]:

# ratings
np.unique(biadjacency.data, return_counts=True)

[8]:

(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
 array([ 1101,  3326,  1687,  7271,  4449, 20064, 10538, 28750,  7723,
        15095]))

[9]:

# positive ratings
positive = biadjacency >= 3

[10]:

positive

[10]:

<9066x671 sparse matrix of type '<class 'numpy.bool_'>'
        with 82170 stored elements in Compressed Sparse Row format>

[11]:

names_labels

[11]:

array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'], dtype='<U11')

[12]:

labels.shape

[12]:

(9066, 19)

PageRank

We first use (personalized) PageRank to get the most popular movies of each category.

[13]:

pagerank = PageRank()

[14]:

# top-10 movies
scores = pagerank.fit_transform(positive)
names[top_k(scores, 10)]

[14]:

array(['Forrest Gump (1994)', 'Pulp Fiction (1994)',
       'Shawshank Redemption, The (1994)',
       'Silence of the Lambs, The (1991)',
       'Star Wars: Episode IV - A New Hope (1977)', 'Matrix, The (1999)',
       'Jurassic Park (1993)', "Schindler's List (1993)",
       'Back to the Future (1985)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
      dtype=object)

[15]:

# number of movies per genre
n_selection = 10

[16]:

# selection
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_transform(positive, seeds=labels[:, label])
    scores = ppr * labels[:, label]
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

[17]:

# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label, :5]])

---
0 Action
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Terminator 2: Judgment Day (1991)']
---
1 Adventure
['Star Wars: Episode IV - A New Hope (1977)' 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Back to the Future (1985)' 'Toy Story (1995)']
---
2 Animation
['Gnomeo & Juliet (2011)' 'Hop (2011)'
 "Lion King II: Simba's Pride, The (1998)" 'Mars Needs Moms (2011)'
 'Once Upon a Forest (1993)']
---
3 Children
['Spy Kids 3-D: Game Over (2003)' 'Race to Witch Mountain (2009)'
 'G-Force (2009)' 'Prancer (1989)' 'Diary of a Wimpy Kid (2010)']
---
4 Comedy
['Forrest Gump (1994)' 'Pulp Fiction (1994)' 'Back to the Future (1985)'
 'Toy Story (1995)' 'Fargo (1996)']
---
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
 'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
---
6 Documentary
['Queen of Versailles, The (2012)' 'Powaqqatsi (1988)'
 'Fragile Trust: Plagiarism, Power, and Jayson Blair at the New York Times, A (2013)'
 'African Cats (2011)' 'Eddie Murphy Delirious (1983)']
---
7 Drama
['Pulp Fiction (1994)' 'Forrest Gump (1994)'
 'Shawshank Redemption, The (1994)' "Schindler's List (1993)"
 'American Beauty (1999)']
---
8 Fantasy
['The Pumaman (1980)'
 'Golem, The (Golem, wie er in die Welt kam, Der) (1920)'
 'Twilight Saga: New Moon, The (2009)'
 'Highlander: Endgame (Highlander IV) (2000)'
 'Ghost Rider: Spirit of Vengeance (2012)']
---
9 Film-Noir
['No Way Out (1950)' 'Johnny Eager (1942)'
 'Lady from Shanghai, The (1947)' 'This World, Then the Fireworks (1997)'
 'T-Men (1947)']
---
10 Horror
['Silence of the Lambs, The (1991)' 'Carnosaur 3: Primal Species (1996)'
 "Devil's Chair, The (2006)" 'AVPR: Aliens vs. Predator - Requiem (2007)'
 'Jason Goes to Hell: The Final Friday (1993)']
---
11 IMAX
["Madagascar 3: Europe's Most Wanted (2012)" 'Final Destination 5 (2011)'
 'Jack the Giant Slayer (2013)' "Dr. Seuss' The Lorax (2012)"
 'After Earth (2013)']
---
12 Musical
['Gypsy (1993)' "Breakin' (1984)" "Breakin' 2: Electric Boogaloo (1984)"
 'True Stories (1986)' 'Camp Rock (2008)']
---
13 Mystery
['Double, The (2011)' 'In the Electric Mist (2009)'
 'Spirits of the Dead (1968)' 'Nomads (1986)'
 'Fast and the Furious, The (1955)']
---
14 Romance
['Forrest Gump (1994)' 'American Beauty (1999)'
 'Princess Bride, The (1987)' 'Beauty and the Beast (1991)'
 'Good Will Hunting (1997)']
---
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Jurassic Park (1993)' 'Back to the Future (1985)']
---
16 Thriller
['Pulp Fiction (1994)' 'Silence of the Lambs, The (1991)'
 'Matrix, The (1999)' 'Jurassic Park (1993)' 'Fargo (1996)']
---
17 War
['Pathfinder (2007)' 'Green Berets, The (1968)'
 'They Were Expendable (1945)' 'Legionnaire (1998)' 'Iron Eagle II (1988)']
---
18 Western
['Bandidas (2006)' "'Neath the Arizona Skies (1934)"
 'American Outlaws (2001)' 'The Missouri Breaks (1976)'
 'Stagecoach (1966)']

We now apply PageRank to get the most relevant movies associated with a given movie.

[18]:

target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}

[19]:

target

[19]:

{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}

[20]:

scores_ppr = pagerank.fit_transform(positive, seeds={175:1})

[21]:

names[top_k(scores_ppr - scores, 10)]

[21]:

array(['Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
       'Fargo (1996)', 'Pulp Fiction (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'L.A. Confidential (1997)', 'Matrix, The (1999)',
       'Shawshank Redemption, The (1994)', 'American Beauty (1999)',
       'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)

We can also apply PageRank to make recommend movies to a user.

[22]:

user = 1
targets = get_neighbors(positive, user, transpose=True)

[23]:

# seen movies (sample)
names[targets][:10]

[23]:

array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
       'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
       'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
       "Mr. Holland's Opus (1995)", 'Braveheart (1995)',
       'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)

[24]:

mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1

[25]:

scores_ppr = pagerank.fit_transform(positive, seeds=mask)

[26]:

# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]

[26]:

array(['Shawshank Redemption, The (1994)', 'True Lies (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'Beauty and the Beast (1991)', 'Toy Story (1995)',
       'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Fargo (1996)',
       'Independence Day (a.k.a. ID4) (1996)', 'Matrix, The (1999)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
      dtype=object)

Embedding

We now represent each movie by a vector in low dimension, and use hierarchical clustering to visualize the structure of this embedding for top-100 movies.

[27]:

# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)

[28]:

ward = WardDense()

[29]:

# top-100 movies
scores = pagerank.fit_transform(positive)
index = top_k(scores, 100)
dendrogram = ward.fit_transform(embedding[index])

[30]:

# visualization
image = svg_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
SVG(image)

[30]: