Reader small image

You're reading from  Machine Learning with Spark. - Second Edition

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785889936
Edition2nd Edition
Languages
Right arrow
Authors (2):
Rajdeep Dua
Rajdeep Dua
author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

Manpreet Singh Ghotra
Manpreet Singh Ghotra
author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra

View More author details
Right arrow

Designing a Machine Learning System

In this chapter, we will design a high-level architecture for an intelligent, distributed machine learning system that uses Spark as its core computation engine. The problem we will focus on will be taking the existing architecture for a web-based business and redesigning it to use automated machine learning systems to power key areas of the business.

Before we dig deeper into our scenario, we will spend some time understanding what machine learning is.

Then we will:

  • Introduce a hypothetical business scenario
  • Provide an overview of the current architecture
  • Explore various ways in which machine learning systems can enhance or replace certain business functions
  • Provide a new architecture based on these ideas

A modern large-scale data environment includes the following requirements:

  • It must integrate with the other components of the system, especially with data collection and...

What is Machine Learning?

Machine learning is a subfield of data mining. While data mining has been around for more than 50+ years, machine learning is a subset where a large cluster of machines is used to analyze and extract knowledge from large datasets.

Machine learning is closely related to computational statistics. It has strong ties to mathematical optimization; it provides methods, theory, and application domains to the field. Machine learning is employed in various types of computing tasks where designing and programming explicit algorithms are infeasible. Example applications are spam filtering, optical character recognition (OCR), search engine, and computer vision. Machine learning is sometimes combined with data mining, which focuses more on exploratory data analysis and is known as unsupervised learning.

Machine learning systems can be classified into three categories, depending on the nature of the...

Introducing MovieStream

To better illustrate the design of our architecture, we will introduce a practical scenario. Let's assume that we have just been appointed to head the data science team of MovieStream, a fictitious Internet business that streams movies and television shows to its users.

MovieStream system is outlined in the following diagram:

MovieStream's current architecture

As we can see in the preceding diagram, currently, MovieStream's content editorial team is responsible for deciding which movies and shows are promoted and shown in various parts of the site. They are also responsible for creating the content for MovieStream's bulk marketing campaigns, which include e-mail and other direct marketing channels. Currently, MovieStream collects basic data on what titles are viewed by users on an aggregate basis and has access to some demographic data collected from users when they sign...

Business use cases for a machine learning system

Perhaps the first question we should answer is, Why to use machine learning at all?

Why doesn't MovieStream simply continue with human-driven decisions? There are many reasons to use machine learning (and certainly some reasons not to), but the most important ones are mentioned here:

  • The scale of data involved means that full human involvement quickly becomes infeasible as MovieStream grows
  • Model-driven approaches such as machine learning and statistics can often benefit from uncovering patterns that cannot be seen by humans (due to the size and complexity of the datasets)
  • Model-driven approaches can avoid human and emotional biases (as long as the correct processes are carefully applied)

However, there is no reason why both model-driven and human-driven processes and decision making cannot coexist. For example, many machine learning systems rely on receiving...

Types of machine learning models

While we have one example, there are many other examples, some of which we will touch on in the relevant chapters when we introduce each machine learning task.

However, we can broadly divide the preceding use cases and methods into two categories of machine learning:

  • Supervised learning: These types of models use labeled data to learn. Recommendation engines, regression, and classification are examples of supervised learning methods. The labels in these models can be user--movie ratings (for the recommendation), movie tags (in the case of the preceding classification example), or revenue figures (for regression). We will cover supervised learning models in Chapter 4, Building a Recommendation Engine with Spark, Chapter 6, Building a Classification Model with Spark, and Chapter 7, Building a Regression Model with Spark.
  • Unsupervised learning: When a model does not require labeled...

The components of a data-driven machine learning system

The high-level components of our machine learning system are outlined in the following diagram. This diagram illustrates the machine learning pipeline from which we obtain data and in which we store data. We then transform it into a form that is usable as input to a machine learning model; train, test, and refine our model; and then, deploy the final model to our production system. The process is then repeated as new data is generated.

A general machine-learning pipeline

Data ingestion and storage

The first step in our machine learning pipeline will be taking in the data that we require for training our models. Like many other businesses, MovieStream's data is typically generated by user activity, other...

An architecture for a machine learning system

Now that we have explored how our machine learning system might work in the context of MovieStream, we can outline a possible architecture for our system:

MovieStream's future architecture

As we can see, our system incorporates the machine learning pipeline outlined in the preceding diagram; this system also includes:

  • Collecting data about users, their behavior, and our content titles
  • Transforming this data into features
  • Training our models, including our training-testing and model-selection phases
  • Deploying the trained models to both our live model-serving system as well as using these models for offline processes
  • Feeding back the model results into the MovieStream website through recommendation and targeting pages
  • Feeding back the model results into MovieStream's personalized marketing channels
  • Using the offline models to provide tools to MovieStream&apos...

Spark MLlib

Apache Spark is an open-source platform for large dataset processing. It is well suited for iterative machine learning tasks as it leverages in-memory data structures such as RDDs. MLlib is Spark's machine learning library. MLlib provides functionality for various learning algorithms-supervised and unsupervised. It includes various statistical and linear algebra optimizations. It is shipped along with Apache Spark and hence saves on installation headaches like some other libraries. MLlib supports several higher languages such as Scala, Java, Python and R. It also provides a high-level API to build machine-learning pipelines.

MLlib's integration with Spark has quite a few benefits. Spark is designed for iterative computation cycles; it enables efficient implementation platform for large machine learning algorithms, as these algorithms are themselves iterative.

Any improvement in Spark's...

Performance improvements in Spark ML over Spark MLlib

Spark 2.0 uses Tungsten Engine, which is built using ideas of modern compilers and MPP databases. It emits optimized bytecode at runtime, which collapses the query into a single function. Hence, there is no need for virtual function calls. It also uses CPU registers to store intermediate data. This technique has been called whole stage code generation.

Reference : https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.htmlSource: https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html

The upcoming table and graph show single function improvements between Spark 1.6 and Spark 2.0:

Chart comparing Performance improvements in Single line functions between Spark 1.6 and Spark 2.0
Table comparing Performance improvements in Single line functions between Spark 1.6 and Spark...

Comparing algorithms supported by MLlib

In this section, we look at various algorithms supported by MLlib versions.

Classification

In 1.6, there are over 10 algorithms supported for classification, whereas when Spark MLversion 1.0 was announced, only 3 algorithms were supported.

Clustering

There has been quite a bit of investment in Clustering algorithms, moving from 1 algo support in 1.0.0 to supporting 6 implementations in 1.6.0.

Regression

Traditionally, regression...

MLlib supported methods and developer APIs

MLlib provides fast and distributed implementations of learning algorithms, including various linear models, Naive Bayes, SVM, and Ensembles of Decision Trees (also known as Random Forests) for classification and regression problems, alternating.

Least Squares (explicit and implicit feedback) are used for collaborative filtering. It also supports k-means clustering and principal component analysis (PCA) for clustering and dimensionality reduction.

The library provides some low-level primitives and basic utilities for convex optimization (http://spark.apache.org/docs/latest/mllib-optimization.html), distributed linear algebra (with support for Vectors and Matrix), statistical analysis (using Breeze and also native functions), and feature extraction, and supports various I/O formats, including native support for LIBSVM format.

It also supports data integration via Spark SQL...

MLlib vision

MLlib's vision is to provide a scalable machine learning platform, which can handle large datasets at scale and fastest processing time as compared to the existing systems such as Hadoop.

It also strives to provide support for as many algorithms as possible in the domain of supervised and unsupervised learning classification, regression such as Classification, Regression, and clustering.

MLlib versions compared

In this section, we will compare various versions of MLlib and new functionality, which has been added.

Spark 1.6 to 2.0

The DataFrame-based API will be the primary API.

The RDD-based API is entering maintenance mode. The MLlib guide (http://spark.apache.org/docs/2.0.0/ml-guide.html) provides more details.

The following are the new features introduced in Spark 2.0:

  • ML persistence: The DataFrames-based API provides support for saving and loading ML models and Pipelines in Scala, Java, Python, and R
  • MLlib in R: SparkR offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression in this release
  • Python: PySpark in 2.0 supports new MLlib algorithms, LDA, Generalized Linear Regression, Gaussian Mixture...

Summary

In this chapter, we learnt about the components that are inherent in a data-driven, automated machine learning system. We also outlined how a possible high-level architecture for such a system might look in a real-world situation. We also got an overview of MLlib-Spark's machine learning library-compared to other machine learning implementations from a performance perspective. In the end, we looked at new features in various versions of Spark starting from Spark 1.6 to Spark 2.0.

In next chapter, we shall discuss how to obtain publicly-available datasets for common machine learning tasks. We will also explore general concepts to process, clean, and transform data so that it can be used to train a machine learning model.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Spark. - Second Edition
Published in: Apr 2017Publisher: PacktISBN-13: 9781785889936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra