Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

Chapter 6. Machine Learning with Spark

We have spent a considerable amount of time understanding the architecture of Spark, RDDs, DataFrames and Dataset-based APIs, Spark SQL, and Streaming, all of which was primarily related to building the foundations of what we are going to discuss in this chapter, which is machine learning. Our focus has been on getting the data onto the Spark platform either in batch or in streaming fashion, and transforming it into the desired state.

Once you have the data in the platform, what do you do with it? You can either use it for reporting purposes, building dashboards, or letting your data scientists analyze the data to detect patterns, identify reasons for specific events, understand the behavior of customers, group them into segments to aid better decision making, or predict the future.

The power of Spark's MLLib stems from the fact that it lets you operate your algorithms over a distributed dataset, which can sometimes be its weakness too as not all algorithms...

What is machine learning?

Machine learning is a branch of AI that gives computers the ability to learn new patterns with little to no human intervention. The machine learning models learn from previous computations to produce more accurate results as more data is crunched. A very simple example is Facebook's face detection algorithm, which uses machine learning techniques to identify the people in the pictures, and gets refined over time. Machine learning has its roots in computation statistics and has been referred to as data mining, although data mining focuses more on the unsupervised learning part of machine learning. To some people machine learning is still science fiction; however, it is now being used in everyday life from predicting fraud, to recommending new products and services to customers, and predicting when your car needs a service.

Is machine learning a new phenomenon? Almost 75 years ago in the Bulletin of Mathematical Biophysics, Warren S. Mculloch (http://bit.ly/2eSkb1q...

Why machine learning?

While we have given some examples on why you need machine learning, it might be helpful to look at some of the sample use cases of machine learning. Machine learning is used by us on a daily basis from fraud detection, banking, credit risk assessment, to predicting customer churn and sales volumes. People who are from a statistics background might say, "Hey - I have done all of that using simple statistics". The answer is that you have probably used a lot of the techniques that we will discuss in this book using a different name, as there is a huge overlap between statistics, data mining, and machine learning.

Some example use cases include:

Credit risk: To predict how likely is it for the borrower to meet its debt obligations under the agreed terms, financial institutions need to manage the credit risk inherent in the portfolio, in addition to the risks on individual credits or transactions.
Self-driving cars: They are the talk of the town, with everyone planning...

Types of machine learning

There are four major categories of machine learning algorithms:

Supervised learning: In supervised learning, we have a bunch of data that has already been labeled, and can be used to train a model, which can later be used to predict the labels of new and un-labeled data. A simple example could be data on a list of customers who have previously churned, or people who have defaulted on their loans. We can use this data to train a model, and understand the behaviors demonstrated by churners or loan-defaulters. Once we have trained a model, we can use this model to detect churners or loan-defaulters by looking at similar attributes, and identifying the likelihood of a person being a churner or a loan defaulter. This is also sometimes known as predictive modeling or predictive analytic. Example algorithms include:
- Decision trees
- Regression
- Neural networks
- SVM
Figure 6.3: Supervised learning
Unsupervised learning: In unsupervised learning, there is no pre-existing data with...

Introduction to Spark MLLib

MLLib stands for Machine Learning Library in Spark and is designed to make machine learning scalable, approachable, and easy for data scientists and engineers. It was created in the Berkley AMPLab and shipped with Spark 0.8.

Spark MLLib is a very active project with huge contributions from the community and an ever growing coverage of machine learning algorithms in the areas of classification, regression, clustering, recommendation, and other utilities such as feature extraction, feature selection, summary statistics, linear algebra, and frequent pattern matching.

Version 0.8 started small with the introduction of limited algorithms, such as:

KMeans
Alternating Least Squares (ALS)
Gradient Descent (Optimization Technique)

From an API perspective, support for these algorithms was made available in the following programs:

Java
Scala

The amazing pace of MLLib can be gauged from the fact that within 3 months, version 0.9 was launched, which added the following...

Why do we need the Pipeline API?

Before digging into the details of the Pipeline API, it is important to understand what a machine learning pipeline means, and why we need a Pipeline API.

It is important to understand that you cannot have an efficient machine learning platform if the only thing you provide is a bunch of algorithms for people to use. Machine learning is quite an involved process, which involves multiple steps, and a machine learning algorithm itself is just one (though very important) part of the step. As an example, let's consider a text classification example, where you have a corpus of text, and you want to classify if that is a sports article or not a sports article. We would like to simplify it to a 1 and a 0, where a 1 indicates it is about sports and 0 indicates it is not about sports. This is a supervised machine learning flow, where we will use data with existing labels, to predict the labels for data with no labels.

You would need to collect this data. Preprocess...

How does it work?

A pipeline is a sequence of stages and each stage is either a Transformer or an Estimator. The stages are run in a sequence in a way that the input frame is transformed as it passes through each stage of the process:

Transformer stages: The transform() method on the DataFrame
Estimator stages: The fit() method on the DataFrame

A pipeline is created by declaring its stages, configuring appropriate parameters, and then chaining them in a pipeline object. For example, if we were to create a simple classification pipeline we would tokenize the data into columns, use the hashing term feature extractor to extract features, and then build a logistic regression model.

Tip

Please ensure that you add Apache Spark ML Jar either in the class path or build that when you are doing the initial build.

Scala syntax - building a pipeline

This pipeline can be built as follows using the Scala API:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg...

Feature engineering

Feature engineering is perhaps the most important topic in machine learning. The success and failure of a model to predict the future depends primarily on how you engineer features to get a better lift. The difference between an experienced data scientist and a novice would be their ability to engineer features from the data sets given, and this is perhaps the most difficult and time consuming aspect of machine learning. This is where the understanding of business problems is the key. Feature engineering is basically an art more than it is a science, and basically it is needed to frame the problem. So what is feature engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying business problem to the predictive models, resulting in improved model accuracy on unseen data^.

Due to the importance of feature engineering, Spark provides algorithms for working with features divided into three major groups:

Feature...

Classification and regression

Apache Spark provides a number of classification and regression algorithms. The main algorithms are listed as follows.

Classification

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Typically in classification cases, the dependent variables are categorical. A very common example is classification of e-mail as spam versus not spam. The major algorithms that come with Spark include the following:

Logistic regression
Decision tree classifier
Random forest classifier
Gradient- boosted tree classifier
Multilayer perceptron classifier
One-vs-Rest classifier
Naïve Bayes

Regression

In machine learning and statistics, Regression is a process by which we estimate or predict a response based on the model trained based on previous data sets....

Clustering

For the most part of this chapter, we have primarily focused on supervised machine learning techniques where we train a model based before using it for predictions. Clustering is an unsupervised machine learning technique, used in customer segmentation, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Apache Spark provides various clustering algorithms, including:

K-Means
Latent Dirichlet Allocation (LDA)
Bisecting K-Means
Gaussian Mixture Models

Collaborative filtering

Most of us will have used eBay, Amazon, or any other popular web retailer. And most of us will have seen recommendations based on our past choices. For example, if you buy an electric toothbrush, Amazon would recommend you some extra brush heads as these items normally go together. All of these suggestions are based on recommended systems, which are typically based on collaborative filtering techniques.

Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). The collaborative filtering approach is based on similarity; the basic idea is people who liked similar items in the past will like similar items in the future.

In the following example, Adrian likes the movies Mission Impossible, Skyfall, Casino Royale, and Spectre. Bob likes the movies Skyfall, Casino Royale, and Spectre. Andrew likes the movies Skyfall and Spectre.

To recommend a movie to Andrew, we calculate...

ML-tuning - model selection and hyperparameter tuning

Model development is one of the major tasks. However an important ML task is the selection of the best model from among a list of models, and tuning the model for optimal performance. Tuning can obviously be done for the individual steps or the entire pipeline model, which would include multiple algorithms, feature engineering, transformations and selections.

MLLib supports model selection using the following tools:

Cross Validator
Train Validation Split

We will look at Model Tuning in Chapter 9, Building a Recommendation System, on Recommendations to see how we can minimize mean squared error, one of the characteristics of a good model.

References

The following articles, blog posts, and videos were used as a resource during the preparation of this chapter. However, these are also important from the perspective of further reading.

Summary

In this chapter we have covered details around machine learning basics, types of machine learning, introduction to Spark MLLib, introduction to Pipeline API, examples of building a Pipeline API and then highlighting the algorithms provided by Spark around feature engineering, classification, regression, clustering and collaborative filtering.

Machine learning is an advanced topic, and it is impossible to cover the depth and breadth of the topic in such a small chapter. However, I hope this chapter gives you a flavor of what is available within Spark and where you can go to for further information. The references section contains the details of the topics. For machine learning, I would recommend Practical Machine Learning or Master Machine Learning with Spark both of which have been published by Packt Publishing and are really good books to give your more in-depth understanding of machine learning.

The next chapter covers GraphX, which is quite a hot topic. We'll cover the basics of...