Building a Movie Recommendation Engine with Naïve Bayes
As promised, in this chapter, we will kick off our supervised learning journey with machine learning classification, and specifically, binary classification. The goal of the chapter is to build a movie recommendation system. It is a good starting point to learn classification from a real-life example—movie streaming service providers are already doing this, and we can do the same. You will learn the fundamental concepts of classification, including what it does and its various types and applications, with a focus on solving a binary classification problem using a simple, yet powerful, algorithm, Naïve Bayes. Finally, the chapter will demonstrate how to fine-tune a model, which is an important skill that every data science or machine learning practitioner should learn.
We will go into detail on the following topics:
- What is machine learning classification?
- Types of classification
- Applications...
Getting started with classification
Movie recommendation can be framed as a machine learning classification problem. If it is predicted that you like a movie, for example, then it will be on your recommended list, otherwise, it won't. Let's get started by learning the important concepts of machine learning classification.
Classification is one of the main instances of supervised learning. Given a training set of data containing observations and their associated categorical outputs, the goal of classification is to learn a general rule that correctly maps the observations (also called features or predictive variables) to the target categories (also called labels or classes). Putting it another way, a trained classification model will be generated after the model learns from the features and targets of training samples, as shown in the first half of Figure 2.1. When new or unseen data comes in, the trained model will be able to determine their...
Exploring Naïve Bayes
The Naïve Bayes classifier belongs to the family of probabilistic classifiers. It computes the probabilities of each predictive feature (also referred to as an attribute or signal) of the data belonging to each class in order to make a prediction of probability distribution over all classes. Of course, from the resulting probability distribution, we can conclude the most likely class that the data sample is associated with. What Naïve Bayes does specifically, as its name indicates, is as follows:
- Bayes: As in, it maps the probability of observed input features given a possible class to the probability of the class given observed pieces of evidence based on Bayes' theorem.
- Naïve: As in, it simplifies probability computation by assuming that predictive features are mutually independent.
I will explain Bayes' theorem with examples in the next section.
Learning Bayes' theorem...
Implementing Naïve Bayes
After calculating by hand the movie preference prediction example, as promised, we are going to code Naïve Bayes from scratch. After that, we will implement it using the scikit-learn
package.
Implementing Naïve Bayes from scratch
Before we develop the model, let's define the toy dataset we just worked with:
>>> import numpy as np
>>> X_train = np.array([
... [0, 1, 1],
... [0, 0, 1],
... [0, 0, 0],
... [1, 1, 0]])
>>> Y_train = ['Y', 'N', 'Y', 'Y']
>>> X_test = np.array([[1, 1, 0]])
For the model, starting with the prior, we first group the data by label and record their indices by classes:
>>> def get_label_indices(labels):
... """
... Group samples based on their labels and return indices
... @param labels: list of labels
... @return: dict, {class1: [indices], class2: [indices]}...
Building a movie recommender with Naïve Bayes
After the toy example, it is now time to build a movie recommender (or, more specifically, movie preference classifier) using a real dataset. We herein use a movie rating dataset (https://grouplens.org/datasets/movielens/). The movie rating data was collected by the GroupLens Research group from the MovieLens website (http://movielens.org).
For demonstration purposes, we will use the small dataset, ml-latest-small (downloaded from the following link: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip of ml-latest-small.zip (size: 1 MB)) as an example. It has around 100,00 ratings, ranging from 1 to 5, given by 6,040 users on 3,706 movies (last updated September 2018).
Unzip the ml-1m.zip
file and you will see the following four files:
movies.dat
: It contains the movie information in the format of MovieID::Title::Genres.ratings.dat
: It contains user movie ratings in the format of UserID...
Evaluating classification performance
Beyond accuracy, there are several metrics we can use to gain more insight and to avoid class imbalance effects. These are as follows:
- Confusion matrix
- Precision
- Recall
- F1 score
- Area under the curve
A confusion matrix summarizes testing instances by their predicted values and true values, presented as a contingency table:

Table 2.3: Contingency table for a confusion matrix
To illustrate this, we can compute the confusion matrix of our Naïve Bayes classifier. We use the confusion_matrix
function from scikit-learn
to compute it, but it is very easy to code it ourselves:
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(Y_test, prediction, labels=[0, 1]))
[[ 60 47]
[148 431]]
As you can see from the resulting confusion matrix, there are 47 false positive cases (where the model misinterprets a dislike as a like...
Tuning models with cross-validation
We can simply avoid adopting the classification results from one fixed testing set, which we did in experiments previously. Instead, we usually apply the k-fold cross-validation technique to assess how a model will generally perform in practice.
In the k-fold cross-validation setting, the original data is first randomly divided into k equal-sized subsets, in which class proportion is often preserved. Each of these k subsets is then successively retained as the testing set for evaluating the model. During each trial, the rest of the k -1 subsets (excluding the one-fold holdout) form the training set for driving the model. Finally, the average performance across all k trials is calculated to generate an overall result:

Figure 2.9: Diagram of 3-fold cross-validation
Statistically, the averaged performance of k-fold cross-validation is a better estimate of how a model performs in general. Given...
Summary
In this chapter, you learned the fundamental and important concepts of machine learning classification, including types of classification, classification performance evaluation, cross-validation, and model tuning. You also learned about the simple, yet powerful, classifier Naïve Bayes. We went in depth through the mechanics and implementations of Naïve Bayes with a couple of examples, the most important one being the movie recommendation project.
Binary classification was the main talking point of this chapter, and multiclass classification will be the subject of the next chapter. Specifically, we will talk about SVMs for image classification.
Exercise
- As mentioned earlier, we extracted user-movie relationships only from the movie rating data where most ratings are unknown. Can you also utilize data from the files
movies.dat
andusers.dat
? - Practice makes perfect—another great project to deepen your understanding could be heart disease classification. The dataset can be downloaded directly at https://www.kaggle.com/ronitf/heart-disease-uci, or from the original page at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.
- Don't forget to fine-tune the model you obtained from Exercise 2 using the techniques you learned in this chapter. What is the best AUC it achieves?
References
To acknowledge the use of the MovieLens dataset in this chapter, I would like to cite the following paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872