Reader small image

You're reading from  Apache Spark 2.x Machine Learning Cookbook

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551606
Edition1st Edition
Languages
Right arrow
Authors (5):
Mohammed Guller
Mohammed Guller
author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi
Siamak Amirghodsi
author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei
Shuen Mei
author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran
Meenakshi Rajendran
author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall
Broderick Hall
author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

View More author details
Right arrow

Chapter 4. Common Recipes for Implementing a Robust Machine Learning System

In this chapter, we will cover:

  • Spark's basic statistical API to help you build your own algorithms
  • ML pipelines for real-life machine learning applications
  • Normalizing data with Spark
  • Splitting data for training and testing
  • Common operations with the new Dataset API
  • Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
  • LabeledPoint data structure for Spark ML
  • Getting access to Spark cluster in Spark 2.0+
  • Getting access to Spark cluster pre-Spark 2.0
  • Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
  • New model export and PMML markup in Spark 2.0
  • Regression model evaluation using Spark 2.0
  • Binary classification model evaluation using Spark 2.0
  • Multilabel classification model evaluation using Spark 2.0
  • Multiclass classification model evaluation using Spark 2.0
  • Using the Scala Breeze library to do graphics in Spark 2.0

Introduction


In every line of business ranging from running a small business to creating and managing a mission critical application, there are a number of tasks that are common and need to be included as a part of almost every workflow that is required during the course of executing the functions. This is true even for building robust machine learning systems. In Spark machine learning, some of these tasks range from splitting the data for model development (train, test, validate) to normalizing input feature vector data to creating ML pipelines via the Spark API. We provide a set of recipes in this chapter to enable the reader to think about what is actually required to implement an end-to-end machine learning system.

This chapter attempts to demonstrate a number of common tasks which are present in any robust Spark machine learning system implementation. To avoid redundant references these common tasks in every recipe covered in this book, we have factored out such common tasks as short...

Spark's basic statistical API to help you build your own algorithms


In this recipe, we cover Spark's multivariate summary (that is, Statistics.colStats) such as correlation, stratified sampling, hypothesis testing, random data generation, kernel density estimators, and much more, which can be applied to extremely large datasets while taking advantage of both parallelism and resiliency via RDDs.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
  1. Set the output level to ERROR to reduce Spark...

ML pipelines for real-life machine learning applications


This is the first of two recipes which cover the ML pipeline in Spark 2.0. For a advanced treatment of ML pipelines with additional details such as API and parameter extraction, see later chapters in this book.

In this recipe, we attempt to have a single pipeline that can tokenize text, use HashingTF (an old trick) to map term frequencies, run a regression to fit a model, and then predict which group a new term belongs to (for example, news filtering, gesture classification, and so on).

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:

 

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression...

Normalizing data with Spark


In this recipe, we normalizing (scaling) the data prior to importing the into an ML algorithm. There are a good number of ML algorithms such as Support Vector Machine (SVM) that work with scaled input vectors rather than with the raw values.

How to do it...

  1. Go to the UCI Machine Learning Repository and download the http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data file.
  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.feature.MinMaxScaler
  1. Define a method to parse wine data into a tuple:
def parseWine(str: String): (Int, Vector...

Splitting data for training and testing


In this recipe, you will learn to use Spark's API to split your input data into different datasets that can be used for training and validation phases. It is common to use an 80/20 split, but other of splitting the data can be considered as well based on your preference.

How to do it...

  1. Go to the UCI Machine Learning Repository and download the http://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip file.

 

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{ Level, Logger}
  1. Set the output level to ERROR to reduce Spark's logging output:
Logger.getLogger("org"...

Common operations with the new Dataset API


In this recipe, we cover the Dataset API, which is the way for data wrangling in Spark 2.0 and beyond. In Chapter 3, Spark's Three Data Musketeers for Machine Learning - Perfect Together we covered three detailed recipes for dataset, and in this chapter we cover some of the common, repetitive operations that are required to work with these new API sets. Additionally, we demonstrate the query plan generated by the Spark SQL Catalyst optimizer.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. We will use a JSON data file named cars.json, which has been created for this example:
name,city
Bears,Chicago
Packers,Green Bay
Lions,Detroit
Vikings,Minnesota
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to get access to the cluster and log4j.Logger to reduce the amount of output produced...

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0


In this recipe, we explore the differences in creating RDD, DataFrame, and Dataset a text file and their relationship to each other via a short sample code:

Dataset: spark.read.textFile()
RDD: spark.sparkContext.textFile()
DataFrame: spark.read.text()

Note

Assume spark is the session name

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
  1. We also define a case class to host the data used:
case class Beatle(id: Long, name: String)
  1. Set the output level to ERROR to reduce Spark's logging output:
Logger.getLogger("org").setLevel...

LabeledPoint data structure for Spark ML


LabeledPoint is a data structure that has been since the early days for a feature vector along with a label so it can be used in unsupervised learning algorithms. We demonstrate a short recipe that uses LabeledPoint, the Seq data structure, and DataFrame to run a logistic regression for binary classification of the data. The emphasis here is on LabeledPoint, and the regression algorithms are covered in more depth in Chapter 5Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I and Chapter 6Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.ml.feature.LabeledPoint...

Getting access to Spark cluster in Spark 2.0


In this recipe, we demonstrate how to get access to a cluster using a single point access named SparkSession. Spark 2.0 abstracts multiple contexts (such as SQLContext, HiveContext) into a single entry point, SparkSession, which allows you to get access to all Spark in a unified way.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster.
  2. In Spark 2.x, SparkSession is more commonly used instead.
import org.apache.spark.sql.SparkSession
  1. Create Spark's configuration and SparkSession so we can have access to the cluster:
val spark = SparkSession
.builder
.master("local[*]") // if use cluster master("spark://master:7077")
.appName("myAccesSparkCluster20")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()

The preceding...

Getting access to Spark cluster pre-Spark 2.0


This is a pre-Spark 2.0 recipe, but it will be helpful for who want to quickly compare and contrast the cluster access for porting pre-Spark 2.0 programs to Spark 2.0's new paradigm.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.{SparkConf, SparkContext}
  1. Create Spark's configuration and SparkContext so we can have access to the cluster:
val conf = new SparkConf()
.setAppName("MyAccessSparkClusterPre20")
.setMaster("local[4]") // if cluster setMaster("spark://MasterHostIP:7077")
.set("spark.sql.warehouse.dir", ".")

val sc = new SparkContext(conf)

The preceding code utilizes the setMaster() function to set the cluster master location. As you can see, we are running the code in...

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0


In this recipe, we demonstrate how to get of SparkContext using a object in Spark 2.0. This recipe will demonstrate the creation, usage, and back and forth conversion of RDD to Dataset. The reason this is important is that even though we prefer Dataset going forward, we must still be able to use and augment the legacy (pre-Spark 2.0) code mostly utilizing RDD.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for the Spark session to gain access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import scala.util.Random
  1. Set the output level to ERROR to reduce Spark's logging output:
Logger.getLogger("org").setLevel...

New model export and PMML markup in Spark 2.0


In this recipe, we explore the model export facility available in Spark 2.0 to use Predictive Model Markup Language (PMML). This standard XML-based language you to export and run your on other systems (some limitations apply). You can explore the There's more... section for more information.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.KMeans
  1. Create Spark's configuration and SparkContext:
val spark = SparkSession
.builder
.master("local[*]")   // if use cluster master("spark://master:7077")
.appName("myPMMLExport")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. We read...

Regression model evaluation using Spark 2.0


In this recipe, we explore how to evaluate a regression model (a regression decision tree in this example). Spark provides the RegressionMetrics facility which has statistical facilities such as Mean Squared Error (MSE), R-Squared, and so on, right out of the box.

The objective in this recipe is to understand the metrics provided by out of the box. It is best to concentrate on step 8 since we cover regression in more detail in Chapter 5Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I and Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II and the book.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark...

Binary classification model evaluation using Spark 2.0


In this recipe, we demonstrate the use of the BinaryClassificationMetrics facility in Spark 2.0 and its application to evaluating a model that has a outcome (for example, a logistic regression).

The purpose here is not to showcase the regression itself, but to demonstrate how to go about evaluating it using common metrics such as receiver operating characteristic (ROC), Area Under ROC Curve, thresholds, and so on.

We recommend that you concentrate on step 8 since we cover regression in detail in Chapter 5Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I and Chapter 6Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:

 

package spark.ml.cookbook.chapter4
  1. Import the necessary packages...

Multiclass classification model evaluation using Spark 2.0


In this recipe, we explore MulticlassMetrics, which you to evaluate a l that classifies the output to more than two labels (for example, red, blue, green, purple, do-not-know). It highlights the use of confusion matrix (confusionMatrix) and model accuracy.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
  1. Create Spark's configuration and SparkContext:
val spark = SparkSession
.builder
.master("local[*]")
.appName("myMulticlass...

Multilabel classification model evaluation using Spark 2.0


In this recipe, we explore multilabel classification MultilabelMetrics in Spark 2.0 which should not be mixed up with the previous recipe dealing with multiclass classification MulticlassMetrics. The key to this recipe is to on evaluation metrics such as Hamming loss, accuracy, f1-measure, and so on, and what they measure.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
  1. Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.evaluation.MultilabelMetrics
import org.apache.spark.rdd.RDD
  1. Create Spark's configuration and SparkContext:
val spark = SparkSession
.builder
.master("local[*]")
.appName("myMultilabel")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. We create the...

Using the Scala Breeze library to do graphics in Spark 2.0


In this recipe, we will use the functions scatter() and plot() from the Scala Breeze linear algebra library (part of) to draw a scatter plot from a two-dimensional data. Once the results are on the Spark cluster, either the actionable data can be used in the driver for drawing or a JPEG or GIF can be generated in the backend and pushed forward for and speed (popular with GPU-based analytical databases such as MapD)

How to do it...

  1. First, we need to download the necessary ScalaNLP library. Download the JAR from the Maven repository available at https://repo1.maven.org/maven2/org/scalanlp/breeze-viz_2.11/0.12/breeze-viz_2.11-0.12.jar.
  1. Place the JAR in the C:\spark-2.0.0-bin-hadoop2.7\examples\jars directory on a Windows machine:
  2. In macOS, please put the JAR in its correct path. For our setting examples, the path is /Users/USERNAME/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/.
  3. The following is the sample screenshot showing the JARs:
  4. Start...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Machine Learning Cookbook
Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall