Reader small image

You're reading from  Machine Learning with Spark. - Second Edition

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785889936
Edition2nd Edition
Languages
Right arrow
Authors (2):
Rajdeep Dua
Rajdeep Dua
author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

Manpreet Singh Ghotra
Manpreet Singh Ghotra
author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra

View More author details
Right arrow

Building a Clustering Model with Spark

In the last few chapters, we covered supervised learning methods, where the training data is labeled with the true outcome that we would like to predict (for example, a rating for recommendations and class assignment for classification or a real target variable in the case of regression).

Next, we will consider the case where we do not have labeled data available. This is called unsupervised learning, as the model is not supervised with the true target label. The unsupervised case is very common in practice, since obtaining labeled training data can be very difficult or expensive in many real-world scenarios (for example, having humans label training data with class labels for classification). However, we would still like to learn some underlying structure in the data and use these to make predictions.

This is where unsupervised learning approaches can be useful. Unsupervised...

Types of clustering models

There are many different forms of clustering models available, ranging from simple to extremely complex ones. The Spark MLlib currently provides k-means clustering, which is among the simplest approaches available. However, it is often very effective, and its simplicity means it is relatively easy to understand and is scalable.

k-means clustering

k-means attempts to partition a set of data points into K distinct clusters (where K is an input parameter for the model).

More formally, k-means tries to find clusters so as to minimize the sum of squared errors (or distances) within each cluster. This objective function is known as the within cluster sum of squared errors (WCSS).

It is the sum, over each cluster, of the squared errors between...

Extracting the right features from your data

Like most of the machine learning models we have encountered so far, k-means clustering requires numerical vectors as input. The same feature extraction and transformation approaches that we have seen for classification and regression are applicable for clustering.

As k-means, like least squares regression, uses a squared error function as the optimization objective, it tends to be impacted by outliers and features with large variance.

Clustering could be leveraged to detect outliers as they can cause a lot of problems.

As for regression and classification cases, input data can be normalized and standardized to overcome this, which might improve accuracy. In some cases, however, it might be desirable not to standardize data, if, for example, the objective is to find segmentations according to certain specific features.

...

K-means - training a clustering model

Training for K-means in Spark ML takes an approach similar to the other models -- we pass a DataFrame that contains our training data to the fit method of the KMeans object.

Here we use the libsvm data format.

Training a clustering model on the MovieLens dataset

We will train a model for both the movie and user factors that we generated by running our recommendation model.

We need to pass in the number of clusters K and the maximum number of iterations for the algorithm to run. Model training might run for less than the maximum number of iterations if the change in the objective function from one iteration to the next is less than the tolerance level (the default for this tolerance is 0.0001).

Spark ML's k-means provides...

K-means - evaluating the performance of clustering models

With models such as regression, classification, and recommendation engines, there are many evaluation metrics that can be applied to clustering models to analyze their performance and the goodness of the clustering of the data points. Clustering evaluation is generally divided into either internal or external evaluation. Internal evaluation refers to the case where the same data used to train the model is used for evaluation. External evaluation refers to using data external to the training data for evaluation purposes.

Internal evaluation metrics

Common internal evaluation metrics include the WCSS we covered earlier (which is exactly the k-means objective function), the Davies-Bouldin Index, the Dunn Index...

Effect of iterations on WSSSE

Let us find out the effect of iterations on WSSSE for the MovieLens dataset. We will calculate WSSSE for various values of iterations and plot the output.

The code listing is:

object MovieLensKMeansMetrics { 
case class RatingX(userId: Int, movieId: Int, rating: Float,
timestamp: Long)
val DATA_PATH= "../../../data/ml-100k"
val PATH_MOVIES = DATA_PATH + "/u.item"
val dataSetUsers = null

def main(args: Array[String]): Unit = {

val spConfig = (new
SparkConf).setMaster("local[1]").setAppName("SparkApp").
set("spark.driver.allowMultipleContexts", "true")

val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config(spConfig)
.getOrCreate()

val datasetUsers = spark.read.format("libsvm").load(
"./data/movie_lens_libsvm/movie_lens_users_libsvm...

Bisecting KMeans

It is a variation of generic KMeans.

The steps of the algorithm are:

  1. Initialize by randomly selecting a point, say then compute the centroid w of M and compute:


The centroid is the center of the cluster. A centroid is a vector containing one number for each variable, where each number is the mean of a variable for the observations in that cluster.
  1. Divide M =[x1, x2, ... xn] into two, sub-clusters ML and MR, according to the following rule:

  1. Compute the centroids of ML and MR, wL and wR, as in step 2.
  1. If wL = cL and wR = cR, stop.
Otherwise, let cL= wL cR = wR , go to step 2.

Bisecting K-means - training a clustering model

Training for bisecting K-means in Spark ML involves taking an approach similar to the other models -- we pass a DataFrame that contains our training data to the fit method of the KMeans object. Note that here we use the libsvm data format:

  1. Instantiate the cluster object:
        val spConfig = (new                         
SparkConf).setMaster("local[1]").setAppName("SparkApp").
set("spark.driver.allowMultipleContexts", "true")

val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config(spConfig)
.getOrCreate()

val datasetUsers = spark.read.format("libsvm").load(
BASE + "/movie_lens_2f_users_libsvm/part-00000")
datasetUsers.show(3)
The output of the command show(3) is shown here:
         ...

Gaussian Mixture Model

A mixture model is a probabilistic model of a sub-population within a population. These models are used to make statistical inferences about a sub-population, given the observations of pooled populations.

A Gaussian Mixture Model (GMM) is a mixture model represented as a weighted sum of Gaussian component densities. Its model coefficients are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a trained model.

The spark.ml implementation uses the EM algorithm.

It has the following parameters:

  • k: Number of desired clusters
  • convergenceTol: Maximum change in log-likelihood at which one considers convergence achieved
  • maxIterations: Maximum number of iterations to perform without reaching convergence
  • initialModel: Optional starting point from which to start the EM algorithm
(if this parameter is omitted, a random...

Summary

In this chapter, we explored a new class of model that learns structures from unlabeled data -- unsupervised learning. We worked through the required input data and feature extraction, and saw how to use the output of one model (a recommendation model in our example) as the input to another model (our k-means clustering model). Finally, we evaluated the performance of the clustering model, using both manual interpretation of the cluster assignments and using mathematical performance metrics.

In the next chapter, we will cover another type of unsupervised learning used to reduce our data down to its most important features or components -- dimensionality reduction models.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Spark. - Second Edition
Published in: Apr 2017Publisher: PacktISBN-13: 9781785889936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra