Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Machine Learning Cookbook

Product typeBook

Published inSep 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781783551606

Edition1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (5):

Mohammed Guller

Siamak Amirghodsi

Shuen Mei

Meenakshi Rajendran

Broderick Hall

View More author details

Chapter 13. Spark Streaming and Machine Learning Library

In this chapter, we will cover the following recipes:

Structured streaming for near real-time machine learning
Streaming DataFrames for real-time machine learning
Streaming Datasets for real-time machine learning
Streaming data and debugging with queueStream
Downloading and understanding the famous Iris data for unsupervised classification
Streaming KMeans for a real-time online classifier
Downloading wine quality data for streaming regression
Streaming linear regression for a real-time regression
Downloading Pima Diabetes data for supervised classification
Streaming logistic regression for an on-line classifier

Introduction

Spark streaming is an journey toward unification and structuring of the APIs in order to address the concerns of batch versus stream. Spark streaming has been available since Spark 1.3 with Discretized Stream (DStream). The new direction is to abstract the underlying using an unbounded table model in which the users can query the table using SQL or functional programming and write the output to another output table in multiple modes (complete, delta, and append output). The Spark SQL Catalyst optimizer and Tungsten (off-heap memory manager) are now an intrinsic part of the Spark streaming, which leads to a much efficient execution.

In this chapter, we not only cover the streaming facilities available in Spark's machine library out of the box, but also provide four introductory recipes that we found useful as we journeyed toward our better understanding of Spark 2.0.

The following figure depicts what is covered in this chapter:

Spark 2.0+ builds on the success of the previous...

Structured streaming for near real-time machine learning

In this recipe, we explore the new structured paradigm introduced in Spark 2.0. We explore real-time streaming using sockets and structured streaming API to vote and the votes accordingly.

We also explore the newly introduced subsystem by simulating a stream of randomly generated votes to pick the most unpopular comic book villain.

Note

There are two distinct programs (VoteCountStream.scala and CountStreamproducer.scala) that make up this recipe.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages for the Spark context to get access to the cluster and log4j.Logger to reduce the amount of output produced by Spark:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.io.{BufferedOutputStream, PrintWriter...

Streaming DataFrames for real-time machine learning

In this recipe, we explore the concept of a DataFrame. We create a DataFrame consisting of the name and age of individuals, which we will be streaming across a wire. A streaming DataFrame is a popular technique to use with Spark ML since we do not have a full integration between structured ML at the time of writing.

We limit this recipe to only the extent of demonstrating a streaming DataFrame and leave it up to the reader to adapt this to their own custom ML pipelines. While streaming DataFrame is not available out of the box in Spark 2.1.0, it will be a natural evolution to see it in later versions of Spark.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages:

import java.util.concurrent.TimeUnit
import org.apache.log4j.{Level, Logger}
import...

Streaming Datasets for real-time machine learning

In this recipe, we create a streaming to demonstrate the use of Datasets with a Spark 2.0 structured programming paradigm. We stream stock prices a file using a Dataset and apply a filter to select the day's stock that closed above $100.

The recipe demonstrates how streams can be used to filter and to act on the incoming data using a simple structured streaming programming model. While it is similar to a DataFrame, there are some differences in the syntax. The recipe is written in a generalized manner so the user can customize it for their own Spark ML programming projects.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages:

import java.util.concurrent.TimeUnit
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import...

Streaming data and debugging with queueStream

In this recipe, we the concept of queueStream(), which is a valuable tool while trying to get a streaming program to work during the development cycle. We found the queueStream() API very useful and felt that other developers can benefit from a recipe that fully its usage.

We start by simulating a user browsing various URLs associated with different web pages using the program ClickGenerator.scala and then proceed to consume and tabulate the data (user behavior/visits) using the ClickStream.scala program:

We use Spark's streaming API with Dstream(), which will require the use of a streaming context. We are calling this out explicitly to highlight one of the differences between Spark streaming and the Spark structured streaming programming model.

Note

There are two distinct programs (ClickGenerator.scala and ClickStream.scala) that make up this recipe.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the...

Downloading and understanding the famous Iris data for unsupervised classification

In this recipe, we and inspect the well-known Iris dataset in for the upcoming streaming KMeans recipe, which lets you see classification/clustering in real-time.

The data is housed on the UCI machine learning repository, which is a great source of data to prototype algorithms on. You will notice that R bloggers tend to love this dataset.

How to do it...

You can start by downloading the dataset using either two of the following commands:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

You can also use the following command:

curl https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -o iris.data

You can also use the following command:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Now we begin our first step of data exploration by examining how the data in iris.data is formatted:

head -5 iris.data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2...

Streaming KMeans for a real-time on-line classifier

In this recipe, we explore the version of KMeans in used in unsupervised learning schemes. The purpose of streaming KMeans algorithm is to classify or group a set of data points into a number of clusters based on their similarity factor.

There are two implementations of the KMeans classification method, one for static/offline data and another version for continuously arriving, real-time updating data.

We will be streaming iris dataset clustering as new data streams into our streaming context.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import scala.collection.mutable.Queue...

Downloading wine quality data for streaming regression

In this recipe, we download and the wine quality dataset the UCI machine learning repository to prepare data for Spark's streaming linear regression algorithm from MLlib.

How to do it...

You will need one of the following command-line tools curl or wget to retrieve specified data:

You can start by downloading the dataset using either of the following three commands. The first one is as follows:

wget http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

You can also use the following command:

curl http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv-o winequality-white.csv

This command is the third way to do the same:

http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

Now we begin our first steps of data exploration by seeing how the data in winequality-white.csv is formatted:

head -5 winequality-white.csv

"fixed acidity";"volatile...

Streaming linear regression for a real-time regression

In this recipe, we will use the quality dataset from UCI and Spark's streaming linear regression algorithm from MLlib to predict the quality of a wine based on a group of wine features.

The difference between this recipe and the traditional recipes we saw before is the use of Spark ML streaming to score the quality of the wine in real time using a linear regression model.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.streaming...

Downloading Pima Diabetes data for supervised classification

In this recipe, we and inspect the Pima dataset from the UCI machine learning repository. We will use the dataset later with Spark's streaming logistic regression algorithm.

How to do it...

You will need one of the following command-line tools curl or wget to retrieve the specified data:

You can start by downloading the dataset using either two of the following commands. The first command is as follows:

http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

This is an alternative that you can use:

wget http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data -o pima-indians-diabetes.data

Now we begin our first steps of data exploration by seeing how the data in pima-indians-diabetes.data is formatted (from Mac or Linux Terminal):

head -5 pima-indians-diabetes.data
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0...

Streaming logistic regression for an on-line classifier

In this recipe, we will be using the Pima Diabetes dataset we downloaded in the previous recipe and Spark's streaming logistic regression algorithm with SGD to predict whether a Pima with various features will test positive as a diabetic. It is an on-line classifier that learns and predicts based on the streamed data.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.

Set up the package location where the program will reside:

package spark.ml.cookbook.chapter13

Import the necessary packages:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.StreamingLogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection...

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark 2.x Machine Learning Cookbook

Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (5)

Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

Other recommended products

Related to this chapter

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages