Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Spark 2

You're reading from  Learning Apache Spark 2

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785885136
Pages 356 pages
Edition 1st Edition
Languages

Table of Contents (18) Chapters

Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface
1. Architecture and Installation 2. Transformations and Actions with Spark RDDs 3. ETL with Spark 4. Spark SQL 5. Spark Streaming 6. Machine Learning with Spark 7. GraphX 8. Operating in Clustered Mode 9. Building a Recommendation System 10. Customer Churn Prediction 1. Theres More with Spark

Chapter 9. Building a Recommendation System

In the last chapter, we covered the concepts around deploying Spark across various clusters. Over the course of this and the next chapter we will look at some practical use cases. In this chapter, we will look at building a Recommendation System, which is what most of us are building in one way or another. We'll cover the following topics:

  • Overview of a recommendation system
  • Why do you need a recommendation system?
  • The long tail phenomenon
  • Types of Recommendations
  • Key problems in recommendations
  • Content-based recommendations
  • Collaborative filtering
  • Latent factor models

This chapter will hopefully give you a good introduction to recommender systems, and then follow up with specific code examples to solve a real world use case of movie recommendations.

Let's get started.

What is a recommendation system?


We come across recommendation systems on almost a daily basis, whether you are buying stuff from Amazon, watching movies on Netflix, playing games on Xbox, finding news articles on Google, or listening to music with Spotify. These online applications recommend items based on your previous history, or users who have similar interests.

Figure 9.1: Recommendation system on Amazon

Why has recommendation become such a big thing in our lives when 15-20 years ago in a typical brick and mortar store this was unheard of? The answer lies in the fact that we are now in an era of abundance rather than scarcity. Let's drill down a bit more into this. 20 years ago, the number of products that a typical retailer stocked were limited. The reason is the limit of shelf space and expensive real estate cost.

Similarly, our favourite movie shop would only contain a limited number of movies and our book seller would contain a limited number of books. I still remember 20 years ago...

User specific recommendations


During the remainder of this chapter, we will focus on user-specific ratings. Let's start by considering a model of the recommendation system.

Let's assume:

C = Set of customers.

I = Set of items (could be movies, books, news items, and so on).

R = Set of ratings. This is an ordered set, where higher numbers indicate the high likeness of a particular item, whereas the lower value indicates a low likeness of a particular item. Generally this is represented by a real value between 0 and 1.

Let's define a utility function u, which looks at every pair of customers and items and maps it to a specific rating:

u: C * I → R

Let's give an example of a utility matrix, for a set of movies and users:

Godfather I

Godfather II

Good Will Hunting

A Beautiful Mind

Roger

1

0.5

Aznan

1

0.7

0.2

Fawad

0.9

0.8

0.1

Adrian

1

0.8

A utility matrix is generally a sparse matrix, as users rate fewer movies than they watch. The areas where ratings are missing can be either...

Key issues with recommendation systems


There are three key issues with recommender systems in general:

  1. Gathering known input data
  2. Predicting unknown from known ratings
  3. Evaluating Prediction methods

Gathering known input data

The first interim milestone in building a recommendation system is to gather the input data, that is, customers, products, and the relevant ratings. While you already have customers and products in your CRM and other systems, you would like to get the ratings of the products from the users. There are two methods to collect product ratings:

  • Explicit: Explicit ratings means the users would explicitly rate a particular item, for example, a movie on Netflix, a book/product on Amazon, and so on. This is a very direct way to engage with users and it typically provides the highest quality data. In real life, despite the incentives given to rate an item, very few users actually leave ratings for the products. Getting explicit ratings is therefore not scalable for any meaningful prediction...

Recommendation system in Spark


We are now going to move ahead with the practical example of building the recommendation system with Spark. Since most users are familiar with movies, we are going to use the Movie Lens data set for building a recommendation system, have a look at the data, and look at some of the options. The theory behind recommendation systems and this practical example should give you a good starting point in building one.

Sample dataset

We are going to use the MovieLens 100k dataset, which at the time of writing was last updated in October 2016. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens (https://movielens.org/), a movie recommendation service. It contains 1,00,004 ratings and 1,296 tag applications across 9,125 movies. This data was created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016 and it can be found at http://bit.ly/24PV0hK. Further details...

References


The following articles, blogs, and videos have been used for the contents of this chapter. They have also been included to provide users with further reading material:

  1. Coursera course on Mining Massive Datasets by Stanford University.
  2. The Long Tail - https://www.wired.com/2004/10/tail/
  3. Harvard CS50 - Recommender Systems - https://www.youtube.com/watch?v=Eeg1DEeWUjA
  4. https://en.wikipedia.org/wiki/Cosine_similarity
  5. https://en.wikipedia.org/wiki/Jaccard_index
  6. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
  7. http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

Summary


This concludes the chapter. We have gone through recommendation systems, including the theory and some practical examples in Scala. I have learned a lot of this theory from some of the courses on data mining at Coursera, which is an amazing platform. I hope we have been able to do justice to the topic. We have tried to focus a lot on the design and the factors involved in a recommendation system as I always believe that engineering the solution is the easy part once you understand what you are up against.

The next chapter is focused on another case study, which is churn prediction, one of the most popular use cases in any customer-driven organization, that understands the cost of acquiring a new customer versus retaining an existing one.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Spark 2
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781785885136
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}