Reader small image

You're reading from  Apache Spark 2.x Machine Learning Cookbook

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551606
Edition1st Edition
Languages
Right arrow
Authors (5):
Mohammed Guller
Mohammed Guller
author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi
Siamak Amirghodsi
author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei
Shuen Mei
author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran
Meenakshi Rajendran
author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall
Broderick Hall
author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

View More author details
Right arrow

Chapter 13. Spark Streaming and Machine Learning Library

In this chapter, we will cover the following recipes:

  • Structured streaming for near real-time machine learning
  • Streaming DataFrames for real-time machine learning
  • Streaming Datasets for real-time machine learning
  • Streaming data and debugging with queueStream
  • Downloading and understanding the famous Iris data for unsupervised classification
  • Streaming KMeans for a real-time online classifier
  • Downloading wine quality data for streaming regression
  • Streaming linear regression for a real-time regression
  • Downloading Pima Diabetes data for supervised classification
  • Streaming logistic regression for an on-line classifier

Introduction


Spark streaming is an journey toward unification and structuring of the APIs in order to address the concerns of batch versus stream. Spark streaming has been available since Spark 1.3 with Discretized Stream (DStream). The new direction is to abstract the underlying using an unbounded table model in which the users can query the table using SQL or functional programming and write the output to another output table in multiple modes (complete, delta, and append output). The Spark SQL Catalyst optimizer and Tungsten (off-heap memory manager) are now an intrinsic part of the Spark streaming, which leads to a much efficient execution.

In this chapter, we not only cover the streaming facilities available in Spark's machine library out of the box, but also provide four introductory recipes that we found useful as we journeyed toward our better understanding of Spark 2.0.

The following figure depicts what is covered in this chapter:

Spark 2.0+ builds on the success of the previous...

Structured streaming for near real-time machine learning


In this recipe, we explore the new structured paradigm introduced in Spark 2.0. We explore real-time streaming using sockets and structured streaming API to vote and the votes accordingly.

We also explore the newly introduced subsystem by simulating a stream of randomly generated votes to pick the most unpopular comic book villain.

Note

There are two distinct programs (VoteCountStream.scala and CountStreamproducer.scala) that make up this recipe.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages for the Spark context to get access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.io.{BufferedOutputStream, PrintWriter...

Streaming DataFrames for real-time machine learning


In this recipe, we explore the concept of a DataFrame. We create a DataFrame consisting of the name and age of individuals, which we will be streaming across a wire. A streaming DataFrame is a popular technique to use with Spark ML since we do not have a full integration between structured ML at the time of writing.

We limit this recipe to only the extent of demonstrating a streaming DataFrame and leave it up to the reader to adapt this to their own custom ML pipelines. While streaming DataFrame is not available out of the box in Spark 2.1.0, it will be a natural evolution to see it in later versions of Spark.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages:
import java.util.concurrent.TimeUnit
import org.apache.log4j.{Level, Logger}
import...

Streaming Datasets for real-time machine learning


In this recipe, we create a streaming to demonstrate the use of Datasets with a Spark 2.0 structured programming paradigm. We stream stock prices a file using a Dataset and apply a filter to select the day's stock that closed above $100.

The recipe demonstrates how streams can be used to filter and to act on the incoming data using a simple structured streaming programming model. While it is similar to a DataFrame, there are some differences in the syntax. The recipe is written in a generalized manner so the user can customize it for their own Spark ML programming projects.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages:
import java.util.concurrent.TimeUnit
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import...

Streaming data and debugging with queueStream


In this recipe, we the concept of queueStream(), which is a valuable tool while trying to get a streaming program to work during the development cycle. We found the queueStream() API very useful and felt that other developers can benefit from a recipe that fully its usage.

We start by simulating a user browsing various URLs associated with different web pages using the program ClickGenerator.scala and then proceed to consume and tabulate the data (user behavior/visits) using the ClickStream.scala program:

We use Spark's streaming API with Dstream(), which will require the use of a streaming context. We are calling this out explicitly to highlight one of the differences between Spark streaming and the Spark structured streaming programming model.

Note

There are two distinct programs (ClickGenerator.scala and ClickStream.scala) that make up this recipe.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the...

Downloading and understanding the famous Iris data for unsupervised classification


In this recipe, we and inspect the well-known Iris dataset in for the upcoming streaming KMeans recipe, which lets you see classification/clustering in real-time.

The data is housed on the UCI machine learning repository, which is a great source of data to prototype algorithms on. You will notice that R bloggers tend to love this dataset.

How to do it...

  1. You can start by downloading the dataset using either two of the following commands:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

You can also use the following command:

curl https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -o iris.data

You can also use the following command:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
  1. Now we begin our first step of data exploration by examining how the data in iris.data is formatted:
head -5 iris.data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2...

Streaming KMeans for a real-time on-line classifier


In this recipe, we explore the version of KMeans in used in unsupervised learning schemes. The purpose of streaming KMeans algorithm is to classify or group a set of data points into a number of clusters based on their similarity factor.

There are two implementations of the KMeans classification method, one for static/offline data and another version for continuously arriving, real-time updating data.

We will be streaming iris dataset clustering as new data streams into our streaming context.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import scala.collection.mutable.Queue...

Downloading wine quality data for streaming regression


In this recipe, we download and the wine quality dataset the UCI machine learning repository to prepare data for Spark's streaming linear regression algorithm from MLlib.

How to do it...

You will need one of the following command-line tools curl or wget to retrieve specified data:

  1. You can start by downloading the dataset using either of the following three commands. The first one is as follows:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

You can also use the following command:

curl http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv-o winequality-white.csv

This command is the third way to do the same:

http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
  1. Now we begin our first steps of data exploration by seeing how the data in winequality-white.csv is formatted:
head -5 winequality-white.csv

"fixed acidity";"volatile...

Streaming linear regression for a real-time regression


In this recipe, we will use the quality dataset from UCI and Spark's streaming linear regression algorithm from MLlib to predict the quality of a wine based on a group of wine features.

The difference between this recipe and the traditional recipes we saw before is the use of Spark ML streaming to score the quality of the wine in real time using a linear regression model.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.streaming...

Downloading Pima Diabetes data for supervised classification


In this recipe, we and inspect the Pima dataset from the UCI machine learning repository. We will use the dataset later with Spark's streaming logistic regression algorithm.

How to do it...

You will need one of the following command-line tools curl or wget to retrieve the specified data:

  1. You can start by downloading the dataset using either two of the following commands. The first command is as follows:
http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

This is an alternative that you can use:

wget http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data -o pima-indians-diabetes.data
  1. Now we begin our first steps of data exploration by seeing how the data in pima-indians-diabetes.data is formatted (from Mac or Linux Terminal):
head -5 pima-indians-diabetes.data
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0...

Streaming logistic regression for an on-line classifier


In this recipe, we will be using the Pima Diabetes dataset we downloaded in the previous recipe and Spark's streaming logistic regression algorithm with SGD to predict whether a Pima with various features will test positive as a diabetic. It is an on-line classifier that learns and predicts based on the streamed data.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
  1. Import the necessary packages:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.StreamingLogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Machine Learning Cookbook
Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall