Reader small image

You're reading from  Apache Spark 2.x Machine Learning Cookbook

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551606
Edition1st Edition
Languages
Right arrow
Authors (5):
Mohammed Guller
Mohammed Guller
author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi
Siamak Amirghodsi
author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei
Shuen Mei
author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran
Meenakshi Rajendran
author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall
Broderick Hall
author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

View More author details
Right arrow

Chapter 8. Unsupervised Clustering with Apache Spark 2.0

In this chapter, we will cover:

  • Building a KMeans classification system in Spark 2.0
  • Bisecting KMeans, the new kid on the block in Spark 2.0
  • Using Gaussian Mixture and Expectation Maximization (EM) in Spark 2.0 to classify data
  • Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
  • Using Latent Dirichlet Allocation (LDA) to classify documents and text into topics
  • Streaming KMeans to classify data in near real time

Introduction


Unsupervised machine is a type of learning technique in which we try to draw inferences either directly or indirectly (through latent factors) from a set of unlabeled observations. In simple terms, we are trying to find the hidden knowledge or structures in a set of data without initially labeling the training data.

While most machine learning library implementation break down when applied to large datasets (iterative, multi-pass, a lot of intermediate writes), the Apache Spark Machine Library succeeds by providing machine library algorithms designed for parallelism and extremely large datasets using memory for intermediate writes out of the box.

At the most abstract level, we can think of unsupervised learning as:

  • Clustering systems: Classify the inputs into either using hard (only belonging to a single cluster) or soft (probabilistic membership and overlaps) categorization.
  • Dimensionality reduction systems: Find hidden using a condensed representation of the original data.

The...

Building a KMeans classifying system in Spark 2.0


In this recipe, we will load a set of features (for example, x, y, z coordinates) using a LIBSVM file and then proceed to use KMeans() to instantiate an object. We will then set the of desired clusters to three and then use kmeans.fit() to action the algorithm. Finally, we will the centers for the three clusters that we found.

It is really important to note that Spark does not implement KMeans++, contrary to popular literature, instead it implements KMeans || (pronounced as KMeans Parallel). See the following recipe and the sections following the code for a complete explanation of the algorithm as it is implemented in Spark.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter8
  1. Import the necessary packages for Spark context to get access to the cluster and Log4j.Logger to reduce...

Bisecting KMeans, the new kid on the block in Spark 2.0


In this recipe, we will download the glass and try to identify and label each glass using a bisecting KMeans algorithm. The Bisecting KMeans is a hierarchical version of the K-Mean algorithm implemented in Spark using the BisectingKMeans() API. While this algorithm is conceptually like KMeans, it can considerable speed for some use cases where the hierarchical path is present.

The dataset we used for this recipe is the Glass Identification Database. The study of the classification of types of glass was motivated by criminological research. Glass could be considered as evidence if it is correctly identified. The data can be found at NTU (Taiwan), already in LIBSVM format.

How to do it...

  1. We downloaded the prepared data file in LIBSVM from: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale

The dataset contains 11 features and 214 rows.

  1. The original dataset and data dictionary is also available at the UCI website...

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data


In this recipe, we will Spark's implementation of expectation maximization (EM) GaussianMixture(), which calculates the maximum likelihood given a set of as input. It assumes a Gaussian in which each point can be sampled from K of sub-distributions (cluster memberships).

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.

  2. Set up the package location where the program will reside:

package spark.ml.cookbook.chapter8.
  1. Import the necessary packages for vector and matrix manipulation:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.SparkSession
  1. Create Spark's session object:

val spark = SparkSession
 .builder
.master("local[*]")
 .appName("myGaussianMixture")
 .config("spark.sql.warehouse.dir", ".")
 .getOrCreate()
  1. Let us take...

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0


This is a classification method for the vertices of a given their similarities as defined by their edges. It uses the GraphX library which is ships out of the box with Spark to implement the algorithm. Power Iteration Clustering is similar to other Eigen Vector/Eigen Value algorithms, but without the overhead of matrix decomposition. It is suitable when you have a large sparse matrix (for example, graphs depicted as a sparse matrix).

GraphFrames will be the replacement/interface proper for the GraphX library going forward (https://databricks.com/blog/2016/03/03/introducing-graphframes.html).

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.

  2. Set up the package location where the program will reside:

package spark.ml.cookbook.chapter8
  1. Import the necessary packages for Spark context to get access to the cluster and Log4j.Logger to reduce...

Latent Dirichlet Allocation (LDA) to classify documents and text into topics


In this recipe, we will explore the Latent Dirichlet Allocation (LDA) algorithm in Spark 2.0. The LDA we use in this recipe is completely different from linear analysis. Both Latent Dirichlet Allocation and linear discrimination analysis are referred to as LDA, but they are extremely different techniques. In this recipe, when we use the LDA, we refer to Latent Dirichlet Allocation. The chapter on text analytics is also relevant to understanding the LDA.

LDA is often used in natural language processing which tries to classify a large body of document (for example, emails from the Enron fraud case) into a discrete number of topics or themes so it can be understood. LDA is also a good candidate for selecting articles based on one's interest (for example, as you turn a page and spend time on a specific topic) in a given magazine article or page.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice...

Streaming KMeans to classify data in near real-time


Spark streaming is a powerful facility lets you combine near real time and batch in the same paradigm. The streaming KMeans interface lives at the intersection of ML clustering and Spark streaming, and takes full advantage of the core facilities provided by Spark streaming itself (for example, fault tolerance, exactly once delivery semantics, and so on).

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. Import the necessary packages for streaming KMeans: 

package spark.ml.cookbook.chapter8.

  1. Import the necessary packages for streaming KMeans:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
  1. We set up the following parameters...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Machine Learning Cookbook
Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall