Chapter 9. Building a Recommendation System
In the last chapter, we covered the concepts around deploying Spark across various clusters. Over the course of this and the next chapter we will look at some practical use cases. In this chapter, we will look at building a Recommendation System, which is what most of us are building in one way or another. We'll cover the following topics:
- Overview of a recommendation system
- Why do you need a recommendation system?
- The long tail phenomenon
- Types of Recommendations
- Key problems in recommendations
- Content-based recommendations
- Collaborative filtering
- Latent factor models
This chapter will hopefully give you a good introduction to recommender systems, and then follow up with specific code examples to solve a real world use case of movie recommendations.
Let's get started.
What is a recommendation system?
We come across recommendation systems on almost a daily basis, whether you are buying stuff from Amazon, watching movies on Netflix, playing games on Xbox, finding news articles on Google, or listening to music with Spotify. These online applications recommend items based on your previous history, or users who have similar interests.
Figure 9.1: Recommendation system on Amazon
Why has recommendation become such a big thing in our lives when 15-20 years ago in a typical brick and mortar store this was unheard of? The answer lies in the fact that we are now in an era of abundance rather than scarcity. Let's drill down a bit more into this. 20 years ago, the number of products that a typical retailer stocked were limited. The reason is the limit of shelf space and expensive real estate cost.
Similarly, our favourite movie shop would only contain a limited number of movies and our book seller would contain a limited number of books. I still remember 20 years ago...
User specific recommendations
During the remainder of this chapter, we will focus on user-specific ratings. Let's start by considering a model of the recommendation system.
Let's assume:
C = Set of customers.
I = Set of items (could be movies, books, news items, and so on).
R = Set of ratings. This is an ordered set, where higher numbers indicate the high likeness of a particular item, whereas the lower value indicates a low likeness of a particular item. Generally this is represented by a real value between 0 and 1.
Let's define a utility function u, which looks at every pair of customers and items and maps it to a specific rating:
u: C * I → R
Let's give an example of a utility matrix, for a set of movies and users:
A utility matrix is generally a sparse matrix, as users rate fewer movies than they watch. The areas where ratings are missing can be either...
Key issues with recommendation systems
There are three key issues with recommender systems in general:
- Gathering known input data
- Predicting unknown from known ratings
- Evaluating Prediction methods
Gathering known input data
The first interim milestone in building a recommendation system is to gather the input data, that is, customers, products, and the relevant ratings. While you already have customers and products in your CRM and other systems, you would like to get the ratings of the products from the users. There are two methods to collect product ratings:
- Explicit: Explicit ratings means the users would explicitly rate a particular item, for example, a movie on Netflix, a book/product on Amazon, and so on. This is a very direct way to engage with users and it typically provides the highest quality data. In real life, despite the incentives given to rate an item, very few users actually leave ratings for the products. Getting explicit ratings is therefore not scalable for any meaningful prediction...
Recommendation system in Spark
We are now going to move ahead with the practical example of building the recommendation system with Spark. Since most users are familiar with movies, we are going to use the Movie Lens data set for building a recommendation system, have a look at the data, and look at some of the options. The theory behind recommendation systems and this practical example should give you a good starting point in building one.
We are going to use the MovieLens 100k dataset, which at the time of writing was last updated in October 2016. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from
MovieLens
(https://movielens.org/), a movie recommendation service. It contains 1,00,004 ratings and 1,296 tag applications across 9,125 movies. This data was created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016 and it can be found at http://bit.ly/24PV0hK. Further details...
The following articles, blogs, and videos have been used for the contents of this chapter. They have also been included to provide users with further reading material:
This concludes the chapter. We have gone through recommendation systems, including the theory and some practical examples in Scala. I have learned a lot of this theory from some of the courses on data mining at Coursera, which is an amazing platform. I hope we have been able to do justice to the topic. We have tried to focus a lot on the design and the factors involved in a recommendation system as I always believe that engineering the solution is the easy part once you understand what you are up against.
The next chapter is focused on another case study, which is churn prediction, one of the most popular use cases in any customer-driven organization, that understands the cost of acquiring a new customer versus retaining an existing one.