Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

Chapter 9. Building a Recommendation System

In the last chapter, we covered the concepts around deploying Spark across various clusters. Over the course of this and the next chapter we will look at some practical use cases. In this chapter, we will look at building a Recommendation System, which is what most of us are building in one way or another. We'll cover the following topics:

Overview of a recommendation system
Why do you need a recommendation system?
The long tail phenomenon
Types of Recommendations
Key problems in recommendations
Content-based recommendations
Collaborative filtering
Latent factor models

This chapter will hopefully give you a good introduction to recommender systems, and then follow up with specific code examples to solve a real world use case of movie recommendations.

Let's get started.

What is a recommendation system?

We come across recommendation systems on almost a daily basis, whether you are buying stuff from Amazon, watching movies on Netflix, playing games on Xbox, finding news articles on Google, or listening to music with Spotify. These online applications recommend items based on your previous history, or users who have similar interests.

Figure 9.1: Recommendation system on Amazon

Why has recommendation become such a big thing in our lives when 15-20 years ago in a typical brick and mortar store this was unheard of? The answer lies in the fact that we are now in an era of abundance rather than scarcity. Let's drill down a bit more into this. 20 years ago, the number of products that a typical retailer stocked were limited. The reason is the limit of shelf space and expensive real estate cost.

Similarly, our favourite movie shop would only contain a limited number of movies and our book seller would contain a limited number of books. I still remember 20 years ago...

User specific recommendations

During the remainder of this chapter, we will focus on user-specific ratings. Let's start by considering a model of the recommendation system.

Let's assume:

C = Set of customers.

I = Set of items (could be movies, books, news items, and so on).

R = Set of ratings. This is an ordered set, where higher numbers indicate the high likeness of a particular item, whereas the lower value indicates a low likeness of a particular item. Generally this is represented by a real value between 0 and 1.

Let's define a utility function u, which looks at every pair of customers and items and maps it to a specific rating:

u: C * I → R

Let's give an example of a utility matrix, for a set of movies and users:

	Godfather I	Godfather II	Good Will Hunting	A Beautiful Mind
Roger			1	0.5
Aznan	1	0.7	0.2
Fawad	0.9	0.8	0.1
Adrian			1	0.8

A utility matrix is generally a sparse matrix, as users rate fewer movies than they watch. The areas where ratings are missing can be either...

Key issues with recommendation systems

There are three key issues with recommender systems in general:

Gathering known input data
Predicting unknown from known ratings
Evaluating Prediction methods

Gathering known input data

The first interim milestone in building a recommendation system is to gather the input data, that is, customers, products, and the relevant ratings. While you already have customers and products in your CRM and other systems, you would like to get the ratings of the products from the users. There are two methods to collect product ratings:

Explicit: Explicit ratings means the users would explicitly rate a particular item, for example, a movie on Netflix, a book/product on Amazon, and so on. This is a very direct way to engage with users and it typically provides the highest quality data. In real life, despite the incentives given to rate an item, very few users actually leave ratings for the products. Getting explicit ratings is therefore not scalable for any meaningful prediction...

Recommendation system in Spark

We are now going to move ahead with the practical example of building the recommendation system with Spark. Since most users are familiar with movies, we are going to use the Movie Lens data set for building a recommendation system, have a look at the data, and look at some of the options. The theory behind recommendation systems and this practical example should give you a good starting point in building one.

Sample dataset

We are going to use the MovieLens 100k dataset, which at the time of writing was last updated in October 2016. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens (https://movielens.org/), a movie recommendation service. It contains 1,00,004 ratings and 1,296 tag applications across 9,125 movies. This data was created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016 and it can be found at http://bit.ly/24PV0hK. Further details...

References

The following articles, blogs, and videos have been used for the contents of this chapter. They have also been included to provide users with further reading material:

Coursera course on Mining Massive Datasets by Stanford University.
The Long Tail - https://www.wired.com/2004/10/tail/
Harvard CS50 - Recommender Systems - https://www.youtube.com/watch?v=Eeg1DEeWUjA
https://en.wikipedia.org/wiki/Cosine_similarity
https://en.wikipedia.org/wiki/Jaccard_index
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

Summary

This concludes the chapter. We have gone through recommendation systems, including the theory and some practical examples in Scala. I have learned a lot of this theory from some of the courses on data mining at Coursera, which is an amazing platform. I hope we have been able to do justice to the topic. We have tried to focus a lot on the design and the factors involved in a recommendation system as I always believe that engineering the solution is the easy part once you understand what you are up against.

The next chapter is focused on another case study, which is churn prediction, one of the most popular use cases in any customer-driven organization, that understands the cost of acquiring a new customer versus retaining an existing one.