Recommending Movies at Scale (Python)

In this article, we will cover the following recipes:

  • Modeling preference expressions
  • Understanding the data
  • Ingesting the movie review data
  • Finding the highest-scoring movies
  • Improving the movie-rating system
  • Measuring the distance between users in the preference space
  • Computing the correlation between users
  • Finding the best critic for a user
  • Predicting movie ratings for users
  • Collaboratively filtering item by item
  • Building a nonnegative matrix factorization model
  • Loading the entire dataset into the memory
  • Dumping the SVD-based model to the disk
  • Training the SVD-based model
  • Testing the SVD-based model

(For more resources related to this topic, see here.)

Introduction

From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting.

Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also "anti-recommendations"— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience.

It is this basic premise that a group of techniques called "collaborative filtering" use to make recommendations. Simply stated, this premise can be boiled down to the assumption that those who have similar past preferences will share the same preferences in the future. This is from a human perspective, of course, and a typical corollary to this assumption is from the perspective of the things being preferred—sets of items that are preferred by the same people will be more likely to preferred together in the future—and this is the basis for what is commonly described in the literature as user-centric collaborative filtering versus item-centric collaborative filtering.

The term collaborative filtering was coined by David Goldberg in a paper titled Using collaborative filtering to weave an information tapestry, ACM, where he proposed a system called Tapestry, which was designed at Xerox PARC in 1992, to annotate documents as interesting or uninteresting and to give document recommendations to people who are searching for good reads.

Collaborative filtering algorithms search large groupings of preference expressions to find similarities to some input preference or preferences. The output from these algorithms is a ranked list of suggestions that is a subset of all possible preferences, and hence, it's called "filtering". The "collaborative" comes from the use of many other peoples' preferences in order to find suggestions for themselves. This can be seen either as a search of the space of preferences (for brute-force techniques), a clustering problem (grouping similarly preferred items), or even some other predictive model. Many algorithmic attempts have been created in order to optimize or solve this problem across sparse or large datasets, and we will discuss a few of them in this article.

The goals of this article are:

  • Understanding how to model preferences from a variety of sources
  • Learning how to compute similarities using distance metrics
  • Modeling recommendations using matrix factorization for star ratings

These two different models will be implemented in Python using readily available datasets on the Web. To demonstrate the techniques in this article, we will use the oft-cited MovieLens database from the University of Minnesota that contains star ratings of moviegoers for their preferred movies.

Modeling preference expressions

We have already pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering.

More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice's taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products.

How to do it…

We will model preference expressions over the next few recipes, including:

  • Understanding the data
  • Ingesting the movie review data
  • Finding the highest rated movies
  • Improving the movie-rating system

How it works…

A preference expression is an instance of a model of demonstrable relative selection. That is to say, preference expressions are data points that are used to show subjective ranking between a group of items for a person. Even more formally, we should say that preference expressions are not simply relative, but also temporal—for example, the statement of preference also has a fixed time relativity as well as item relativity.

Preference expression is an instance of a model of demonstrable relative selection.

While it would be nice to think that we can subjectively and accurately express our preferences in a global context (for example, rate a movie as compared to all other movies), our tastes, in fact, change over time, and we can really only consider how we rank items relative to each other. Models of preference must take this into account and attempt to alleviate biases that are caused by it. The most common types of preference expression models simplify the problem of ranking by causing the expression to be numerically fuzzy, for example:

  • Boolean expressions (yes or no)
  • Up and down voting (such as abstain, dislike)
  • Weighted signaling (the number of clicks or actions)
  • Broad ranked classification (stars, hated or loved)

The idea is to create a preference model for an individual user—a numerical model of the set of preference expressions for a particular individual. Models build the individual preference expressions into a useful user-specific context that can be computed against. Further reasoning can be performed on the models in order to alleviate time-based biases or to perform ontological reasoning or other categorizations.

As the relationships between entities get more complex, you can express their relative preferences by assigning behavioral weights to each type of semantic connection. However, choosing the weight is difficult and requires research to decide relative weights, which is
why fuzzy generalizations are preferred. As an example, the following table shows you some well-known ranking preference systems:

Reddit Voting

 

Online Shopping

 

Star Reviews

 

Up Vote

1

Bought

2

Love

5

No Vote

0

Viewed

1

Liked

4

Down Vote

-1

No purchase

0

Neutral

3

 

 

 

 

Dislike

2

 

 

 

 

Hate

1

For the rest of this article, we will only consider a single, very common preference expression: star ratings on a scale of 1 to 5.

Understanding the data

Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.

Getting ready

To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.

In this article, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.

How to do it…

Perform the following steps to better understand the data that we will be working with throughout this article:

  1. Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip).
  2. Unzip the downloaded data into the directory of your choice.
  3. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows:
    head -n 5 u.item

    Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command:

    more u.item 2 n

    The preceding command gives you the following output:

    1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-
    exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0
    |0
    2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-
    exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-
    exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|
    0|0
    4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-
    exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|
    0|0
    5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-
    exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

    The following command will produce the given output:

    head -n 5 u.data

    For Windows, you can use the following command:

    more u.item 2 n
    196 242 3 881250949
    186 302 3 891717742
    22 377 1 878887116
    244 51 2 880606923
    166 346 1 886397596

How it works…

The two main files that we will be using are as follows:

  • u.data: This contains the user moving ratings
  • u.item: This contains the movie information and other details

Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited.

For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.

There's more…

Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

Ingesting the movie review data

Recommendation engines require large amounts of training data in order to do a good job, which is why they're often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how.

Getting ready

You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. Ensure that you have NumPy correctly installed.

How to do it…

The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory:

  1. Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read Eval Print Loop (REPL).
  2. We create a function to import the movie reviews:
    import csv
    import datetime
    
    def load_reviews(path, **kwargs):
         """
         Loads MovieLens reviews
         """
         options = {
             'fieldnames': ('userid', 'movieid', 'rating',
    'timestamp'),
             'delimiter': '\t',
         }
         options.update(kwargs)
    
         parse_date = lambda r,k:
    datetime.fromtimestamp(float(r[k]))
         parse_int = lambda r,k: int(r[k])
    
         with open(path, 'rb') as reviews:
            reader = csv.DictReader(reviews, **options)
             for row in reader:
                 row['userid'] = parse_int(row, 'userid')
                 row['movieid'] = parse_int(row, 'movieid')
                 row['rating'] = parse_int(row, 'rating')
                 row['timestamp'] = parse_date(row, 'timestamp')
                 yield row

    We create a helper function to help import the data:

    import os
    
    def relative_path(path):
         """
         Returns a path relative from this code file
         """
         dirname = os.path.dirname(os.path.realpath('__file__'))
         path = os.path.join(dirname, path)
        return os.path.normpath(path)
  1.  We create another function to load the movie information:
    def load_movies(path, **kwargs):
         """
         Loads MovieLens movies
         """
    
         options = {
             'fieldnames': ('movieid', 'title', 'release',
    'video', 'url'),'delimiter': '|','restkey': 'genre',
         }
         options.update(kwargs)
    
         parse_int = lambda r,k: int(r[k])
         parse_date = lambda r,k: datetime.strptime(r[k], '%d-%b-
    %Y') if r[k] else None
    
         with open(path, 'rb') as movies:
             reader = csv.DictReader(movies, **options)
             for row in reader:
                 row['movieid'] = parse_int(row, 'movieid')
                 row['release'] = parse_date(row, 'release')
                row['video']   = parse_date(row, 'video')
                 yield row
  1. Finally, we start creating a MovieLens class that will be augmented in later recipes:
    from collections import defaultdict
    
    class MovieLens(object):
         """
         Data structure to build our recommender model on.
         """
    
         def __init__(self, udata, uitem):
             """
             Instantiate with a path to u.data and u.item
             """
             self.udata   = udata
             self.uitem   = uitem
             self.movies = {}
             self.reviews = defaultdict(dict)
             self.load_dataset()
    
         def load_dataset(self):
             """
             Loads the two datasets into memory, indexed on the ID.
             """
             for movie in load_movies(self.uitem):
                 self.movies[movie['movieid']] = movie
    
             for review in load_reviews(self.udata):
                self.reviews[review['userid']][review['movieid']]
    = review
  1.  Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system:
    data = relative_path('data/ml-100k/u.data')
    item = relative_path('data/ml-100k/u.item')
    model = MovieLens(data, item)

How it works…

The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is \t. The options.update(kwargs) line means that we'll accept whatever users pass to this function.

We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row.

Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory.

We'll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We'll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There's more… section, we'll discuss a Python pro-tip to alleviate this concern.

Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews' data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows:

data = relative_path('../data/ml-100k/u.data')
item = relative_path('../data/ml-100k/u.item')
model = MovieLens(data, item)

Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset.

Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself:

 import os

def relative_path(path):
     """
     Returns a path relative from this code file
     """
     dirname = os.path.dirname(os.path.realpath('__file__'))
     path = os.path.join(dirname, path)
     return os.path.normpath(path)

Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn't too bad for modern machines. However, we should also keep in mind that we'll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods.

Finding the highest-scoring movies

If you're looking for a good movie, you'll often want to see the most popular or best rated movies overall. Initially, we'll take a naïve approach to compute a movie's aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class.

Getting ready

These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one.

How to do it…

Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list:

  1. Augment the MovieLens class with a new method to get all reviews for a particular movie:
    class MovieLens(object):
    
         ...
    
         def reviews_for_movie(self, movieid):
             """
             Yields the reviews for a given movie
             """
             for review in self.reviews.values():
                 if movieid in review:
                     yield review[movieid]
  2. Then, add an additional method to compute the top 10 movies reviewed by users:
    import heapq
    from operator import itemgetter
    
    class MovieLens(object):
    
         ...
    
         def average_reviews(self):
            """
             Averages the star rating for all movies.
       Yields a tuple of movieid,
             the average rating, and the number of reviews.
             """
             for movieid in self.movies:
                 reviews = list(r['rating'] for r in
    self.reviews_for_movie(movieid))
                 average = sum(reviews) / float(len(reviews))
                 yield (movieid, average, len(reviews))
    
         def top_rated(self, n=10):
             """
             Yields the n top rated movies
             """
             return heapq.nlargest(n, self.average_reviews(),
    key=itemgetter(1))

    Note that the notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class.

  3. Now, let's print the top-rated results:
    for mid, avg, num in model.top_rated(10):
         title = model.movies[mid]['title']
         print "[%0.3f average rating (%i reviews)] %s" % (avg,
    num,title)
  4. Executing the preceding commands in your REPL should produce the following output:
    [5.000 average rating (1 reviews)] Entertaining Angels: The
    Dorothy Day Story (1996)
    [5.000 average rating (2 reviews)] Santa with Muscles (1996)
    [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994)
    [5.000 average rating (1 reviews)] They Made Me a Criminal (1939)
    [5.000 average rating (1 reviews)] Aiqing wansui (1994)
    [5.000 average rating (1 reviews)] Someone Else's America (1995)
    [5.000 average rating (2 reviews)] Saint of Fort Washington,
    The (1993)
    [5.000 average rating (3 reviews)] Prefontaine (1997)
    [5.000 average rating (3 reviews)] Star Kid (1997)
    [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow
    and Light (1996)

How it works…

The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method.

With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average.

The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values.

There's more…

Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe.

See also

Improving the movie-rating system

We don't want to build a recommendation engine with a system that considers the likely straight-to-DVD Santa with Muscles as generally superior to Casablanca. Thus, the naïve scoring approach used previously must be improved upon and is the focus of this recipe.

Getting ready

Make sure that you have completed the previous recipes in this article first.

How to do it…

The following steps implement and test a new movie-scoring algorithm:

  1. Let's implement a new Bayesian movie-scoring algorithm as shown in the following function, adding it to the MovieLens class:
    def bayesian_average(self, c=59, m=3):
         """
         Reports the Bayesian average with parameters c and m.
         """
         for movieid in self.movies:
             reviews = list(r['rating'] for r in self.reviews_for_movie(movieid))
             average = ((c * m) + sum(reviews)) /
                   float(c + len(reviews))
           yield (movieid, average, len(reviews))
  2.   Next, we will replace the top_rated method in the MovieLens class with the version in the following commands that uses the new Bayesian_average method from the preceding step:
    def top_rated(self, n=10):
         """
         Yields the n top rated movies
         """
        return heapq.nlargest(n, self.bayesian_average(),
    key=itemgetter(1))
  3.  Printing our new top-10 list looks a bit more familiar to us and Casablanca is now happily rated number 4:
    [4.234 average rating (583 reviews)] Star Wars (1977)
    [4.224 average rating (298 reviews)] Schindler's List (1993)
    [4.196 average rating (283 reviews)] Shawshank Redemption,
    The (1994)
    [4.172 average rating (243 reviews)] Casablanca (1942)
    [4.135 average rating (267 reviews)] Usual Suspects, The (1995)
    [4.123 average rating (413 reviews)] Godfather, The (1972)
    [4.120 average rating (390 reviews)] Silence of the Lambs,
    The (1991)
    [4.098 average rating (420 reviews)] Raiders of the Lost Ark (1981)
    [4.082 average rating (209 reviews)] Rear Window (1954)
    [4.066 average rating (350 reviews)] Titanic (1997)

How it works…

Taking the average of movie reviews, as in shown the previous recipe, simply did not work because some movies did not have enough ratings to give a meaningful comparison to movies with more ratings. What we'd really like is to have every single movie critic rate every single movie. Given that this is impossible, we could derive an estimate for how the movie would be rated if an infinite number of people rated the movie; this is hard to infer from one data point, so we should say that we would like to estimate the movie rating if the same number of people gave it a rating on an average (for example, filtering our results based on the number of reviews).

This estimate can be computed with a Bayesian average, implemented in the bayesian_average() function, to infer these ratings based on the following equation:

 

Here, m is our prior for the average of stars, and C is a confidence parameter that is equivalent to the number of observations in our posterior.

Determining priors can be a complicated and magical art. Rather than taking the complex path of fitting a Dirichlet distribution to our data, we can simply choose an m prior of 3 with our 5-star rating system, which means that our prior assumes that star ratings tend to be reviewed around the median value. In choosing C, you are expressing how many reviews are needed to get away from the prior; we can compute this by looking at the average number of reviews per movie:

print float(sum(num for mid, avg, num in model.average_reviews())) /
len(model.movies)

This gives us an average number of 59.4, which we use as the default value in our
function definition.

There's more…

Play around with the C parameter. You should find that if you change the parameter so that C = 50, the top-10 list subtly shifts; in this case, Schindler's List and Star Wars are swapped in rankings, as are Raiders of the Lost Ark and Rear Window— note that both the swapped movies have far more reviews than the former, which means that the higher C parameter was balancing the fewer ratings of the other movie.

See also

Measuring the distance between users in the preference space

The two most recognizable types of collaborative filtering systems are user-based recommenders and item-based recommenders. If one were to imagine that the preference space is an N-dimensional feature space where either users or items are plotted, then we would say that similar users or items tend to cluster near each other in this preference space; hence, an alternative name for this type of collaborative filtering is nearest neighbor recommenders.

A crucial step in this process is to come up with a similarity or distance metric with which we can compare critics to each other or mutually preferred items. This metric is then used to perform pairwise comparisons of a particular user to all other users, or conversely, for an item to be compared to all other items. Normalized comparisons are then used to determine recommendations. Although the computational space can become exceedingly large, distance metrics themselves are not difficult to compute, and in this recipe, we will explore a few as well as implement our first recommender system.

In this recipe, we will measure the distance between users; in the recipe after this one, we will look at another similarity distance indicator.

Getting ready

We will continue to build on the MovieLens class from the section titled Modeling Preference. If you have not had the opportunity to review this section, please have the code for that class ready. Importantly, we will want to access the data structures, MovieLens.movies and MovieLens.reviews, that have been loaded from the CSV files on the disk.

How to do it…

The following set of steps provide instructions on how to compute the Euclidean distance between users:

  1. Augment the MovieLens class with a new method, shared_preferences, to pull out movies that have been rated by two critics, A and B:
    class MovieLens(objects):
    
         ...
    
         def shared_preferences(self, criticA, criticB):
             """
             Returns the intersection of ratings for two critics
             """
             if criticA not in self.reviews:
                raise KeyError("Couldn't find critic '%s' in
    data" % criticA)
             if criticB not in self.reviews:
                 raise KeyError("Couldn't find critic '%s' in
    data" % criticB)
    
             moviesA = set(self.reviews[criticA].keys())
             moviesB = set(self.reviews[criticB].keys())
             shared   = moviesA & moviesB # Intersection operator
    
             # Create a reviews dictionary to return
             reviews = {}
             for movieid in shared:
                 reviews[movieid] = (
                   self.reviews[criticA][movieid]['rating'],
                     self.reviews[criticB][movieid]['rating'],
                 )
             return reviews
  2. Then, implement a function that computes the Euclidean distance between two critics using their shared movie preferences as a vector for the computation. This method will also be part of the MovieLens class:
    from math import sqrt
    
    ...
    
         def euclidean_distance(self, criticA, criticB):
             """
             Reports the Euclidean distance of two critics, A&B by
            performing a J-dimensional Euclidean calculation of
       each of their preference vectors for the intersection
       of movies the critics have rated.
             """
             # Get the intersection of the rated titles in the
    data.
             preferences = self.shared_preferences(criticA,
    criticB)
    
             # If they have no rankings in common, return 0.
             if len(preferences) == 0: return 0
    
             # Sum the squares of the differences
             sum_of_squares = sum([pow(a-b, 2) for a, b in
    preferences.values()])
    
             # Return the inverse of the distance to give a higher
    score to
             # folks who are more similar (e.g. less distance) add
    1 to prevent
             # division by zero errors and normalize ranks in [0,
    1]
             return 1 / (1 + sqrt(sum_of_squares))
  3. With the preceding code implemented, test it in REPL:
    >>> data = relative_path('data/ml-100k/u.data')
    >>> item = relative_path('data/ml-100k/u.item')
    >>> model = MovieLens(data, item)
    >>> print model.euclidean_distance(232, 532)
    0.1023021629920016

How it works…

The new shared_preferences() method of the MovieLens class determines the shared preference space of two users. Critically, we can only compare users (the criticA and criticB input parameters) based on the things that they have both rated. This function uses Python sets to determine the list of movies that both A and B reviewed (the intersection of the movies A has rated and the movies B has rated). The function then iterates over this set, returning a dictionary whose keys are the movie IDs and the values are a tuple of ratings, for example, (ratingA, ratingB) for each movie that both users have rated. We can now use this dataset to compute similarity scores, which is done by the second function.

The euclidean_distance() function takes two critics as the input, A and B, and computes the distance between users in preference space. Here, we have chosen to implement the Euclidean distance metric (the two-dimensional variation is well known to those who remember the Pythagorean theorem), but we could have implemented other metrics as well. This function will return a real number from 0 to 1, where 0 is less similar (farther apart) critics and 1 is more similar (closer together) critics.

There's more…

The Manhattan distance is another very popular metric and a very simple one to understand. It can simply sum the absolute values of the pairwise differences between elements of each vector. Or, in code, it can be executed in this manner:

manhattan = sum([abs(a-b) for a, b in preferences.values()])

This metric is also called the city-block distance because, conceptually, it is as if you were counting the number of blocks north/south and east/west one would have to walk between two points in the city. Before implementing it for this recipe, you would also want to invert and normalize the value in some fashion to return a value in the [0, 1] range.

See also

Computing the correlation between users

In the previous recipe, we used one out of many possible distance measures to capture the distance between the movie reviews of users. This distance between two specific users is not changed even if there are five or five million other users.

In this recipe, we will compute the correlation between users in the preference space. Like distance metrics, there are many correlation metrics. The most popular of these are Pearson or Spearman correlations or Cosine distance. Unlike distance metrics, the correlation will change depending on the number of users and movies.

Getting ready

We will be continuing the efforts of the previous recipes again, so make sure you understand each one.

How to do it…

The following function implements the computation of the pearson_correlation function for two critics, which are criticA and criticB, and is added to the MovieLens class:

     def pearson_correlation(self, criticA, criticB):
         """
         Returns the Pearson Correlation of two critics, A and B by
         performing the PPMC calculation on the scatter plot of (a, b)
         ratings on the shared set of critiqued titles.
         """

         # Get the set of mutually rated items
         preferences = self.shared_preferences(criticA, criticB)

         # Store the length to save traversals of the len computation.
         # If they have no rankings in common, return 0.
         length = len(preferences)
         if length == 0: return 0

         # Loop through the preferences of each critic once and
compute the
         # various summations that are required for our final
calculation.
         sumA = sumB = sumSquareA = sumSquareB = sumProducts = 0
         for a, b in preferences.values():
             sumA += a
             sumB += b
             sumSquareA += pow(a, 2)
             sumSquareB += pow(b, 2)
             sumProducts += a*b

         # Calculate Pearson Score
         numerator   = (sumProducts*length) - (sumA*sumB)
         denominator = sqrt(((sumSquareA*length) - pow(sumA, 2)) *
((sumSquareB*length) - pow(sumB, 2)))

         # Prevent division by zero.
         if denominator == 0: return 0

         return abs(numerator / denominator)

How it works…

The Pearson correlation computes the "product moment", which is the mean of the product of mean adjusted random variables and is defined as the covariance of two variables (a and b, in our case) divided by the product of the standard deviation of a and the standard deviation of b. As a formula, this looks like the following:

 

For a finite sample, which is what we have, the detailed formula, which was implemented in the preceding function, is as follows:

 

Another way to think about the Pearson correlation is as a measure of the linear dependence between two variables. It returns a score of -1 to 1, where negative scores closer to -1 indicate a stronger negative correlation, and positive scores closer to 1 indicate a stronger, positive correlation. A score of 0 means that the two variables are not correlated.

In order for us to perform comparisons, we want to normalize our similarity metrics in the space of [0, 1] so that 0 means less similar and 1 means more similar, so we return the absolute value:

 >>> print model.pearson_correlation(232, 532)
0.06025793538385047

There's more…

We have explored two distance metrics: the Euclidean distance and the Pearson correlation. There are many more, including the Spearman correlation, Tantimoto scores, Jaccard distance, Cosine similarity, and Manhattan distance, to name a few. Choosing the right distance metric for the dataset of your recommender along with the type of preference expression used is crucial to ensuring success in this style of recommender. It's up to the reader to explore this space further based on his or her interest and particular dataset.

Finding the best critic for a user

Now that we have two different ways to compute a similarity distance between users, we
can determine the best critics for a particular user and see how similar they are to an individual's preferences.

Getting ready

Make sure that you have completed the previous recipes before tackling this one.

How to do it…

Implement a new method for the MovieLens class, similar_critics(), that locates the best match for a user:

import heapq

...

     def similar_critics(self, user, metric='euclidean', n=None):
         """
         Finds, ranks similar critics for the user according to the
         specified distance metric. Returns the top n similar critics
         if n is specified.
         """

         # Metric jump table
         metrics = {
             'euclidean': self.euclidean_distance,
             'pearson':   self.pearson_correlation,
         }

         distance = metrics.get(metric, None)

         # Handle problems that might occur
         if user not in self.reviews:
             raise KeyError("Unknown user, '%s'." % user)
         if not distance or not callable(distance):
             raise KeyError("Unknown or unprogrammed distance metric
'%s'." % metric)

         # Compute user to critic similarities for all critics
         critics = {}
         for critic in self.reviews:
             # Don't compare against yourself!
             if critic == user:
                 continue
           critics[critic] = distance(user, critic)

         if n:
             return heapq.nlargest(n, critics.items(),
key=itemgetter(1))
         return critics

How it works…

The similar_critics method, added to the MovieLens class, serves as the heart of this recipe. It takes as parameters the targeted user and two optional parameters: the metric to be used, which defaults to euclidean, and the number of results to be returned, which defaults to None. As you can see, this flexible method uses a jump table to determine what algorithm is to be used (you can pass in euclidean or pearson to choose the distance metric). Every other critic is compared to the current user (except a comparison of the user against themselves). The results are then sorted using the flexible heapq module and the top n results are returned.

To test out our implementation, print out the results of the run for both similarity distances:

>>> for item in model.similar_critics(232, 'euclidean', n=10):
print "%4i: %0.3f" % item
688: 1.000
914: 1.000
   47: 0.500
   78: 0.500
170: 0.500
335: 0.500
341: 0.500
101: 0.414
155: 0.414
309: 0.414

>>> for item in model.similar_critics(232, 'pearson', n=10):
   print "%4i: %0.3f" % item
   33: 1.000
   36: 1.000
155: 1.000
260: 1.000
289: 1.000
302: 1.000
309: 1.000
317: 1.000
511: 1.000
769: 1.000

These scores are clearly very different, and it appears that Pearson thinks that there are much more similar users than the Euclidean distance metric. The Euclidean distance metric tends to favor users who have rated fewer items exactly the same. Pearson correlation favors more scores that fit well linearly, and therefore, Pearson corrects grade inflation where two critics might rate movies very similarly, but one user rates them consistently one star higher than the other.

If you plot out how many shared rankings each critic has, you'll see that the data is very sparse. Here is the preceding data with the number of rankings appended:

Euclidean scores:
688: 1.000 (1 shared rankings)
914: 1.000 (2 shared rankings)
   47: 0.500 (5 shared rankings)
   78: 0.500 (3 shared rankings)
170: 0.500 (1 shared rankings)

Pearson scores:
   33: 1.000 (2 shared rankings)
   36: 1.000 (3 shared rankings)
155: 1.000 (2 shared rankings)
260: 1.000 (3 shared rankings)
289: 1.000 (3 shared rankings)

Therefore, it is not enough to find similar critics and use their ratings to predict our users' scores; instead, we will have to aggregate the scores of all of the critics, regardless of similarity, and predict ratings for the movies we haven't rated.

Predicting movie ratings for users

To predict how we might rate a particular movie, we can compute a weighted average of critics who have also rated the same movies as the user. The weight will be the similarity of the critic to user—if a critic has not rated a movie, then their similarity will not contribute to the overall ranking of the movie.

Getting ready

Ensure that you have completed the previous recipes in this large, cumulative article.

How to do it…

The following steps walk you through the prediction of movie ratings for users:

  1. First, add the predict_ranking function to the MovieLens class in order to predict the ranking a user might give a particular movie with similar critics:
        def predict_ranking(self, user, movie, metric='euclidean', critics=None):
             """
             Predicts the ranking a user might give a movie based on the
             weighted average of the critics similar to the that user.
             """
    
             critics = critics or self.similar_critics(user, metric=metric)
             total   = 0.0
             simsum = 0.0
    
             for critic, similarity in critics.items():
                 if movie in self.reviews[critic]:
                     total += similarity * self.reviews[critic][movie]['rating']
                     simsum += similarity
    
             if simsum == 0.0: return 0.0
             return total / simsum
  2. Next, add the predict_all_rankings method to the MovieLens class:
    def predict_all_rankings(self, user, metric='euclidean',
    n=None):
             """
             Predicts all rankings for all movies, if n is
    specified returns
             the top n movies and their predicted ranking.
             """
             critics = self.similar_critics(user, metric=metric)
             movies = {
                movie: self.predict_ranking(user, movie, metric,
    critics)
                 for movie in self.movies
             }
    
             if n:
                 return heapq.nlargest(n, movies.items(),
    key=itemgetter(1))
             return movies

How it works…

The predict_ranking method takes a user and a movie along with a string specifying the distance metric and returns the predicted rating for that movie for that particular user. A fourth argument, critics, is meant to be an optimization for the predict_all_rankings method, which we'll discuss shortly. The prediction gathers all critics who are similar to the user and computes the weighted total rating of the critics, filtered by those who actually did rate the movie in question. The weights are simply their similarity to the user, computed by the distance metric. This total is then normalized by the sum of the similarities to move the rating back into the space of 1 to 5 stars:

>>> print model.predict_ranking(422, 50, 'euclidean')
4.35413151722
>>> print model.predict_ranking(422, 50, 'pearson')
4.3566797826

Here, we can see the predictions for Star Wars (ID 50 in our MovieLens dataset) for the user 422. The Euclidean and Pearson computations are very close to each other (which isn't necessarily to be expected), but the prediction is also very close to the user's actual rating, which is 4.

The predict_all_rankings method computes the ranking predictions for all movies
for a particular user according to the passed-in metric. It optionally takes a value, n, to
return the top n best matches. This function optimizes the similar critics' lookup by only executing it once and then passing those discovered critics to the predict_ranking function in order to improve the performance. However, this method must be run on every single movie in the dataset:

>>> for mid, rating in model.predict_all_rankings(578, 'pearson',
10):
...     print "%0.3f: %s" % (rating, model.movies[mid]['title'])
5.000: Prefontaine (1997)
5.000: Santa with Muscles (1996)
5.000: Marlene Dietrich: Shadow and Light (1996)
5.000: Star Kid (1997)
5.000: Aiqing wansui (1994)
5.000: Someone Else's America (1995)
5.000: Great Day in Harlem, A (1994)
5.000: Saint of Fort Washington, The (1993)
4.954: Anna (1996)
4.817: Innocents, The (1961)

As you can see, we have now computed what our recommender thinks the top movies for
this particular user are, along with what we think the user will rate the movie! The top-10
list of average movie ratings plays a huge rule here and a potential improvement could be to use the Bayesian averaging in addition to the similarity weighting, but that is left for the reader to implement.

Collaboratively filtering item by item

So far, we have compared users to other users in order to make our predictions. However, the similarity space can be partitioned in two ways. User-centric collaborative filtering plots users in the preference space and discovers how similar users are to each other. These similarities are then used to predict rankings, aligning the user with similar critics. Item-centric collaborative filtering does just the opposite; it plots the items together in the preference space and makes recommendations according to how similar a group of items are to another group.

Item-based collaborative filtering is a common optimization as the similarity of items changes slowly. Once enough data has been gathered, reviewers adding reviews does not necessarily change the fact that Toy Story is more similar to Babe than The Terminator, and users who prefer Toy Story might prefer the former to the latter. Therefore, you can simply compute item similarities once in a single offline-process and use that as a static mapping for recommendations, updating the results on a semi-regular basis.

This recipe will walk you through item-by-item collaborative filtering.

Getting ready

This recipe requires the completion of the previous recipes in this article.

How to do it…

Construct the following function to perform item-by-item collaborative filtering:

     def shared_critics(self, movieA, movieB):
         """
         Returns the intersection of critics for two items, A and B
         """

         if movieA not in self.movies:
             raise KeyError("Couldn't find movie '%s' in data" %movieA)
         if movieB not in self.movies:
            raise KeyError("Couldn't find movie '%s' in data" %movieB)

         criticsA = set(critic for critic in self.reviews if movieA
in self.reviews[critic])
         criticsB = set(critic for critic in self.reviews if movieB
in self.reviews[critic])
        shared   = criticsA & criticsB # Intersection operator

         # Create the reviews dictionary to return
         reviews = {}
         for critic in shared:
             reviews[critic] = (
                 self.reviews[critic][movieA]['rating'],
               self.reviews[critic][movieB]['rating'],
             )
         return reviews

     def similar_items(self, movie, metric='euclidean', n=None):
         # Metric jump table
         metrics = {
             'euclidean': self.euclidean_distance,
             'pearson':   self.pearson_correlation,
         }

         distance = metrics.get(metric, None)

         # Handle problems that might occur
         if movie not in self.reviews:
             raise KeyError("Unknown movie, '%s'." % movie)
         if not distance or not callable(distance):
             raise KeyError("Unknown or unprogrammed distance metric
'%s'." % metric)

         items = {}
         for item in self.movies:
             if item == movie:
                 continue

             items[item] = distance(item, movie, prefs='movies')

         if n:
             return heapq.nlargest(n, items.items(),
key=itemgetter(1))
         return items

How it works…

To perform item-by-item collaborative filtering, the same distance metrics can be used but must be updated to use the preferences from shared_critics rather than shared_preferences (for example, item similarity versus user similarity). Update the functions to accept a prefs parameter that determines which preferences are to be used, but I'll leave that to the reader as it is only two lines of code.

If you print out the list of similar items for a particular movie, you can see some interesting results. For example, review the similarity results for The Crying Game (1992), which has an ID of 631:

for movie, similarity in model.similar_items(631, 'pearson').items():
   print "%0.3f: %s" % (similarity, model.movies[movie]['title'])

0.127: Toy Story (1995)
0.209: GoldenEye (1995)
0.069: Four Rooms (1995)
0.039: Get Shorty (1995)
0.340: Copycat (1995)
0.225: Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
0.232: Twelve Monkeys (1995)
...

This crime thriller is not very similar to Toy Story, which is a children's movie, but is more similar to Copycat, which is another crime thriller. Of course, critics who have rated many movies skew the results, and more movie reviews are needed before this normalizes into something more compelling.

It is presumed that the item similarity scores are run regularly but do not need to be
computed in real time. Given a set of computed item similarities, computing recommendations are as follows:

     def predict_ranking(self, user, movie, metric='euclidean'):
         movies = self.similar_items(movie, metric=metric)
         total = 0.0
         simsum = 0.0

         for relmovie, similarity in movies.items():
             # Ignore movies already reviewed by user
             if relmovie in self.reviews[user]:
                 total += similarity *
self.reviews[user][relmovie]['rating']
                 simsum += similarity

         if simsum == 0.0: return 0.0
         return total / simsum

This method simply uses the inverted item-to-item similarity scores rather than the
user-to-user similarity scores. Since similar items can be computed offline, the lookup for movies via the self.similar_items method should be a database lookup rather than a real-time computation.

 >>> print model.predict_ranking(232, 52, 'pearson')
3.980443976

You can then compute a ranked list of all possible recommendations in a similar way as the user-to-user recommendations.

Building a nonnegative matrix factorization model

A general improvement on the basic cross-wise nearest-neighbor similarity scoring of collaborative filtering is a matrix factorization method, which is also known as Singular Value Decomposition (SVD). Matrix factorization methods attempt to explain the ratings through the discovery of latent features that are not easily identifiable by analysts. For instance, this technique can expose possible features such as the amount of action, family friendliness, or fine-tuned genre discovery in our movies dataset.

What's especially interesting about these features is that they are continuous and not discrete values and can represent an individual's preference along a continuum. In this sense, the model can explore shades of characteristics, for example, perhaps a critic in the movie reviews' dataset, such as action flicks with a strong female lead that are set in European countries. A James Bond movie might represent a shade of that type of movie even though it only ticks the set in European countries and action genre boxes. Depending on how similarly reviewers rate the movie, the strength of the female counterpart to James Bond will determine how they might like the movie.

Also, extremely helpfully, the matrix factorization model does well on sparse data, that is data with few recommendation and movie pairs. Reviews' data is particularly sparse because not everyone has rated the same movies and there is a massive set of available movies. SVD can also be performed in parallel, making it a good choice for much larger datasets.

In the remaining recipes in this article, we will build a nonnegative matrix factorization model in order to improve our recommendation engine.

How to do it…

  1. Loading the entire dataset into the memory.
  2. Dumping the SVD-based model to the disk.
  3. Training the SVD-based model.
  4. Testing the SVD-based model.

How it works…

Matrix factorization, or SVD works, by finding two matrices such that when you take their dot product (also known as the inner product or scalar product), you will get a close approximation of the original matrix. We have expressed our training matrix as a sparse N x M matrix of users to movies where the values are the 5-star rating if it exists, otherwise, the value is blank or 0. By factoring the model with the values that we have and then taking the dot product of the two matrices produced by the factorization, we hope to fill in the blank spots in our original matrix with a prediction of how the user would have rated the movie in that column.

The intuition is that there should be some latent features that determine how users rate an item, and these latent features are expressed through the semantics of their previous ratings. If we can discover the latent features, we will be able to predict new ratings. Additionally, there should be fewer features than there are users and movies (otherwise, each movie or user would be a unique feature). This is why we compose our factored matrices by some feature length before taking their dot product.

Mathematically, this task is expressed as follows. If we have a set of U users and M movies, let R of size |U| x |M| be the matrix that contains the ratings of users. Assuming that we have K latent features, find two matrices, P and Q, where P is |U| x K and Q is |M| x K such that the dot product of P and Q transpose approximates R. P, which therefore represent the strength of the associations between users and features and Q represents the association of movies with features.

There are a few ways to go about factorization, but the choice we made was to perform gradient descent. Gradient descent initializes two random P and Q matrices, computes their dot product, and then minimizes the error compared to the original matrix by traveling down a slope of an error function (the gradient). This way, the algorithm hopes to find a local minimum where the error is within an acceptable threshold.

Our function computed the error as the squared difference between the predicted value and the actual value.

To minimize the error, we modify the values pik and qkj by descending along the gradient of the current error slope, differentiating our error equation with respect to p yields:

We then differentiate our error equation with respect to the variable q yields in the following equation:

We can then derive our learning rule, which updates the values in P and Q by a constant learning rate, which is α. This learning rate, α, should not be too large because it determines how big of a step we take towards the minimum, and it is possible to step across to the other side of the error curve. It should also not be too small, otherwise it will take forever to converge.

We continue to update our P and Q matrices, minimizing the error until the sum of the error squared is below some threshold, 0.001 in our code, or until we have performed a maximum number of iterations.

Matrix factorization has become an important technique for recommender systems, particularly those that leverage Likert-scale-like preference expressions—notably, star ratings. The Netflix Prize challenge has shown us that matrix-factored approaches perform with a high degree of accuracy for ratings prediction tasks. Additionally, matrix factorization is a compact, memory-efficient representation of the parameter space for a model and can be trained in parallel, can support multiple feature vectors, and can be improved with confidence levels. Generally, they are used to solve cold-start problems with sparse reviews and in an ensemble with more complex hybrid-recommenders that also compute content-based recommenders.

See also

Loading the entire dataset into the memory

The first step in building a nonnegative factorization model is to load the entire dataset in the memory. For this task, we will be leveraging NumPy highly.

Getting ready

In order to complete this recipe, you'll have to download the MovieLens database from the University of Minnesota GroupLens page at http://grouplens.org/datasets/movielens/ and unzip it in a working directory where your code will be. We will also use NumPy in this code significantly, so please ensure that you have this numerical analysis package downloaded and ready. Additionally, we will use the load_reviews function from the previous recipes. If you have not had the opportunity to review the appropriate section, please have the code for that function ready.

How to do it…

To build our matrix factorization model, we'll need to create a wrapper for the predictor that loads the entire dataset into memory. We will perform the following steps:

  1. We create the following Recommender class as shown. Please note that this class depends on the previously created and discussed load_reviews function:
    import numpy as np
    import csv
    
    class Recommender(object):
         def __init__(self, udata):
             self.udata   = udata
             self.users   = None
             self.movies = None
             self.reviews = None
            self.load_dataset()
    
         def load_dataset(self):
             """
             Load an index of users & movies as a heap
       and reviews table as a N x M array where N is
       the number of users and M is the number of movies.
       Note that order matters so that we can look up values
             outside of the matrix!
             """
             self.users = set([])
             self.movies = set([])
             for review in load_reviews(self.udata):
                 self.users.add(review['userid'])
                 self.movies.add(review['movieid'])
    
             self.users = sorted(self.users)
             self.movies = sorted(self.movies)
    
             self.reviews = np.zeros(shape=(len(self.users),
    len(self.movies)))
             for review in load_reviews(self.udata):
                 uid = self.users.index(review['userid'])
                 mid = self.movies.index(review['movieid'])
                 self.reviews[uid, mid] = review['rating']
  2. With this defined, we can instantiate a model by typing the following command:
    data_path = '../data/ml-100k/u.data'
    model = Recommender(data_path)

How it works…

Let's go over this code line by line. The instantiation of our recommender requires a path to the u.data file; creates holders for our list of users, movies, and reviews; and then loads the dataset. We need to hold the entire dataset in memory for reasons that we will see later.

The basic data structure to perform our matrix factorization on is an N x M matrix where N is the number of users and M is the number of movies. To create this, we will first load all the movies and users into an ordered list so that we can look up the index of the user or movie by its ID. In the case of MovieLens, all of the IDs are contiguous from 1; however, this might not always be the case. It is good practice to have an index lookup table. Otherwise, you will be unable to fetch recommendations from our computation!

Once we have our index lookup lists, we create a NumPy array of all zeroes in the size of the length of our users' list by the length of our movies list. Keep in mind that the rows are users and the columns are movies! We then go through the ratings data a second time and then add the value of the rating at the uid, mid index location of our matrix. Note that if a user hasn't rated a movie, their rating is 0. This is important! Print the array out by entering model.reviews, and you should see something as follows:

[[ 5. 3. 4. ..., 0. 0. 0.]
[ 4. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 5. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 5. 0. ..., 0. 0. 0.]]

There's more…

Let's get a sense of how sparse or dense our dataset is by adding the following two methods to the Recommender class:

def sparsity(self):
     """
     Report the percent of elements that are zero in the array
     """
     return 1 - self.density()

def density(self):
     """
     Return the percent of elements that are nonzero in the array
     """
     nonzero = float(np.count_nonzero(self.reviews))
     return nonzero / self.reviews.size

Adding these methods to our Recommender class will help us evaluate our recommender, and it will also help us identify recommenders in the future.

Print out the results:

print "%0.3f%% sparse" % model.sparsity()
print "%0.3f%% dense" % model.density()

You should see that the MovieLens 100k dataset is 0.937 percent sparse and 0.063
percent dense.

This is very important to keep note of along with the size of the reviews dataset. Sparsity, which is common to most recommender systems, means that we might be able to use sparse matrix algorithms and optimizations. Additionally, as we begin to save models, this will help us identify the models as we load them from serialized files on the disk.

Dumping the SVD-based model to the disk

Before we build our model, which will take a long time to train, we should create a mechanism for us to load and dump our model to the disk. If we have a way of saving the parameterization of the factored matrix, then we can reuse our model without having to train it every time we want to use it—this is a very big deal since this model will take hours to train! Luckily, Python has a built-in tool for serializing and deserializing Python objects—the pickle module.

How to do it…

Update the Recommender class as follows:

 import pickle
class Recommender(object):

     @classmethod
     def load(klass, pickle_path):
         """
         Instantiates the class by deserializing the pickle.
         Note that the object returned may not be an exact match
         to the code in this class (if it was saved
   before updates).
         """
         with open(pickle_path, 'rb') as pkl:
            return pickle.load(pkl)

   def __init__(self, udata, description=None):
         self.udata   = udata
         self.users   = None
         self.movies = None
         self.reviews = None

         # Descriptive properties
         self.build_start = None
         self.build_finish = None
         self.description = None

         # Model properties
         self.model       = None
         self.features   = 2
         self.steps       = 5000
         self.alpha       = 0.0002
         self.beta         = 0.02

         self.load_dataset()

     def dump(self, pickle_path):
         """
         Dump the object into a serialized file using the pickle module.
         This will allow us to quickly reload our model in the future.
         """
         with open(pickle_path, 'wb') as pkl:
             pickle.dump(self, pkl)

How it works…

The @classmethod feature is a decorator in Python for declaring a class method instead of an instance method. The first argument that is passed in is the type instead of an instance (which we usually refer to as self). The load class method takes a path to a file on the disk that contains a serialized pickle object, which it then loads using the pickle module. Note that the class that is returned might not be an exact match with the Recommender class at the time you run the code—this is because the pickle module saves the class, including methods and properties, exactly as it was when you dumped it.

Speaking of dumping, the dump method provides the opposite functionality, allowing you to serialize the methods, properties, and data to disk in order to be loaded again in the future. To help us identify the objects that we're dumping and loading from disk, we've also added some descriptive properties including a description, some build parameters, and some timestamps to our __init__ function.

Training the SVD-based model

We're now ready to write our functions that factor our training dataset and build our recommender model. You can see the required functions in this recipe.

How to do it…

We construct the following functions to train our model. Note that these functions are not part of the Recommender class:

def initialize(R, K):
   """
     Returns initial matrices for an N X M matrix,
     R and K features.

     :param R: the matrix to be factorized
     :param K: the number of latent features

     :returns: P, Q initial matrices of N x K and M x K sizes
     """
     N, M = R.shape
     P = np.random.rand(N,K)
     Q = np.random.rand(M,K)

     return P, Q

def factor(R, P=None, Q=None, K=2, steps=5000, alpha=0.0002,
beta=0.02):
     """
     Performs matrix factorization on R with given parameters.

     :param R: A matrix to be factorized, dimension N x M
     :param P: an initial matrix of dimension N x K
     :param Q: an initial matrix of dimension M x K
     :param K: the number of latent features
     :param steps: the maximum number of iterations to optimize in
     :param alpha: the learning rate for gradient descen
     :param beta: the regularization parameter

     :returns: final matrices P and Q
     """

     if not P or not Q:
         P, Q = initialize(R, K)
     Q = Q.T

     rows, cols = R.shape
     for step in xrange(steps):
         for i in xrange(rows):
             for j in xrange(cols):
                 if R[i,j] > 0:
                     eij = R[i,j] - np.dot(P[i,:], Q[:,j])
                     for k in xrange(K):
                         P[i,k] = P[i,k] + alpha * (2 * eij * Q[k,j]
- beta * P[i,k])
                         Q[k,j] = Q[k,j] + alpha * (2 * eij * P[i,k]
- beta * Q[k,j])

         e = 0

         for i in xrange(rows):
             for j in xrange(cols):
                 if R[i,j] > 0:
                   e = e + pow(R[i,j] - np.dot(P[i,:], Q[:,j]), 2)
                     for k in xrange(K):
                         e = e + (beta/2) * (pow(P[i,k], 2) +
pow(Q[k,j], 2))
         if e < 0.001:
             break

     return P, Q.T

How it works…

We discussed the theory and the mathematics of what we are doing in the previous recipe, Building a non-negative matrix factorization model, so let's talk about the code. The initialize function creates two matrices, P and Q, that have a size related to the reviews matrix and the number of features, namely N x K and M x K, where N is the number of users and M is the number of movies. Their values are initialized to random numbers that are between 0.0 and 1.0. The factor function computes P and Q using gradient descent such that the dot product of P and Q is within a mean squared error of less than 0.001 or 5000 steps that have gone by, whichever comes first. Especially note that only values that are greater than 0 are computed. These are the values that we're trying to predict; therefore, we do not want to attempt to match them in our code (otherwise, the model will be trained on zero ratings)! This is also the reason that you can't use NumPy's built-in Singular Value Decomposition (SVD) function, which is np.linalg.svd or np.linalg.solve.

There's more…

Let's use these factorization functions to build our model and to save the model to disk once it has been built—this way, we can load the model at our convenience using the dump and load methods in the class. Add the following method to the Recommender class:

def build(self, output=None):
     """
     Trains the model by employing matrix factorization on training
     data set, (sparse reviews matrix). The model is the dot product
     of the P and Q decomposed matrices from the factorization.
     """
     options = {
         'K':     self.features,
         'steps': self.steps,
         'alpha': self.alpha,
         'beta': self.beta,
     }

     self.build_start = time.time()
     self.P, self.Q = factor(self.reviews, **options)
     self.model = np.dot(self.P, self.Q.T)
     self.build_finish = time.time()

     if output:
         self.dump(output)

This helper function will allow us to quickly build our model. Note that we're also saving P and Q—the parameters of our latent features. This isn't necessary, as our predictive model is the dot product of the two factored matrices. Deciding whether or not to save this information in your model is a trade-off between re-training time (you can potentially start from the current P and Q parameters although you must beware of the overfit) and disk space, as pickle will be larger on the disk with these matrices saved. To build this model and dump the data to the disk, run the following code:

model = Recommender(relative_path('../data/ml-100k/u.data'))
model.build('reccod.pickle')

Warning! This will take a long time to build! On a 2013 MacBook Pro with a 2.8 GHz processor, this process took roughly 9 hours 15 minutes and required 23.1 MB of memory; this is not insignificant for most of the Python scripts you might be used to writing! It is not a bad idea to continue through the rest of the recipe before building your model. It is also probably not a bad idea to test your code on a smaller test set of 100 records before moving on to the entire process! Additionally, if you don't have the time to train the model, you can find the pickle module of our model in the errata of this book.

Testing the SVD-based model

This recipe brings this article on recommendation engines to a close. We use our new nonnegative matrix factorization-based model and take a look at some of the predicted reviews.

How to do it…

The final step in leveraging our model is to access the predicted reviews for a movie based
on our model:

       def predict_ranking(self, user, movie):
           uidx = self.users.index(user)
           midx = self.movies.index(movie)
           if self.reviews[uidx, midx] > 0:
               return None
           return self.model[uidx, midx]

How it works…

Computing the ranking is relatively easy; we simply need to look up the index of the user and the index of the movie and look up the predicted rating in our model. This is why it is so essential to save an ordered list of the users and movies in our pickle module; this way, if the data changes (we add users or movies) but the change isn't reflected in our model, an exception is raised. Because models are historical predictions and not sensitive to changes in time, we need to ensure that we continually retrain our model with new data. This method also returns None if we know the ranking of the user (for example, it's not a prediction); we'll leverage this in the next step.

There's more…

To predict the highest-ranked movies, we can leverage the previous function to order the highest predicted rankings for our user:

      import heapq
       from operator import itemgetter

           def top_rated(self, user, n=12):
               movies = [(mid, self.predict_ranking(user, mid)) for
mid in self.movies]
                return heapq.nlargest(n, movies, key=itemgetter(1))
We can now print out the top-predicted movies that have not been rated by the user:
       >>> rec = Recommender.load('reccod.pickle')
       >>> for item in rec.top_rated(234):
       ...    print "%i: %0.3f" % item
         814: 4.437
       1642: 4.362
       1491: 4.361
       1599: 4.343
       1536: 4.324
       1500: 4.323
       1449: 4.281
       1650: 4.147
       1645: 4.135
       1467: 4.133
       1636: 4.133
       1651: 4.132

It's then simply a matter of using the movie ID to look up the movie in our movies database.

Summary

To learn more about Data Science, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended:

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Practical Data Science Cookbook

Explore Title