Reader small image

You're reading from  Machine Learning with Spark. - Second Edition

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785889936
Edition2nd Edition
Languages
Right arrow
Authors (2):
Rajdeep Dua
Rajdeep Dua
author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

Manpreet Singh Ghotra
Manpreet Singh Ghotra
author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra

View More author details
Right arrow

Building a Recommendation Engine with Spark

Now that you have learned the basics of data processing and feature extraction, we will move on to explore individual machine learning models in detail, starting with recommendation engines.

Recommendation engines are probably among the best types of machine learning models known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites, such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue.

The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process; in this way, they are similar and, in fact, often complementary to search engines, which...

Types of recommendation models

Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches, such as ranking models, have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models.

Content-based filtering

Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content, such as titles, names, tags, and other metadata attached to an item, or in the case of media,...

Extracting the right features from your data

In this section, we will use explicit rating data, without additional user, item metadata, or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair.

Extracting features from the MovieLens 100k dataset

In this example, we will use the same MovieLens dataset that we used in the previous chapter. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code.

First, let's inspect the raw ratings dataset:

object FeatureExtraction { 

def getFeatures(): Dataset[FeatureExtraction.Rating] = {
val spark = SparkSession.builder.master("local[2]...

Training the recommendation model

Once we have extracted these simple features from our raw data, we are ready to proceed with model training; ML takes care of this for us. All we have to do is provide the correctly-parsed input dataset we just created as well as our chosen model parameters.

Split the dataset in to training and testing sets with ratio 80:20, as shown in the following lines of code:

def createALSModel() { 
val ratings = FeatureExtraction.getFeatures();

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
println(training.first())
}

You will see the following output:

16/09/07 13:23:28 INFO Executor: Finished task 0.0 in stage 1.0 (TID 
1). 1768 bytes result sent to driver

16/09/07 13:23:28 INFO TaskSetManager...

Using the recommendation model

Now that we have our trained model, we're ready to use it to make predictions.

ALS Model recommendations

Starting Spark v2.0, org.apache.spark.ml.recommendation.ALS modeling is a blocked implementation of the factorization algorithm that groups "users" and "products" factors into blocks and decreases communication by sending only one copy of each user vector to each product block at each iteration, and only for the product blocks that need that user's feature vector.

Here, we will load the rating data from the movies dataset where each row consists of a user, movie, rating, and a timestamp. We will then train an ALS model by default works on explicit preferences (implicitPrefs is false). We will evaluate...

Evaluating the performance of recommendation models

How do we know whether the model we have trained is a good model? We will need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable, such as Mean Squared Error, while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model, but are often closer to what we care about in the real world, such as Mean Average Precision.

Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate...

FP-Growth algorithm

We will apply the FP-Growth algorithm to find frequently recommended movies.

The FP-Growth algorithm has been described in the paper by Han et al., Mining frequent patterns without candidate generation available at: http://dx.doi.org/10.1145/335191.335372, where FP stands for the frequent pattern. For given a dataset of transactions, the first step of FP-Growth is to calculate item frequencies and identify frequent items. The second step of FP-Growth algorithm implementation uses a suffix tree (FP-tree) structure to encode transactions; this is done without generating candidate sets explicitly, which are usually expensive to generate for large datasets.

FP-Growth Basic Sample

Let's start with a very simple dataset of random numbers:

val transactions...

Summary

In this chapter, we used Spark's ML and MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user may have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model.

In the next chapter, you will learn how to use Spark to train a model to classify your data and to use standard evaluation mechanisms to gauge the performance of your model.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Spark. - Second Edition
Published in: Apr 2017Publisher: PacktISBN-13: 9781785889936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra