You're reading from Machine Learning with Spark. - Second Edition

Product typeBook

Published inApr 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785889936

Edition2nd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (2):

Rajdeep Dua

Manpreet Singh Ghotra

View More author details

Building a Clustering Model with Spark

In the last few chapters, we covered supervised learning methods, where the training data is labeled with the true outcome that we would like to predict (for example, a rating for recommendations and class assignment for classification or a real target variable in the case of regression).

Next, we will consider the case where we do not have labeled data available. This is called unsupervised learning, as the model is not supervised with the true target label. The unsupervised case is very common in practice, since obtaining labeled training data can be very difficult or expensive in many real-world scenarios (for example, having humans label training data with class labels for classification). However, we would still like to learn some underlying structure in the data and use these to make predictions.

This is where unsupervised learning approaches can be useful. Unsupervised...

Types of clustering models

There are many different forms of clustering models available, ranging from simple to extremely complex ones. The Spark MLlib currently provides k-means clustering, which is among the simplest approaches available. However, it is often very effective, and its simplicity means it is relatively easy to understand and is scalable.

k-means clustering

k-means attempts to partition a set of data points into K distinct clusters (where K is an input parameter for the model).

More formally, k-means tries to find clusters so as to minimize the sum of squared errors (or distances) within each cluster. This objective function is known as the within cluster sum of squared errors (WCSS).

It is the sum, over each cluster, of the squared errors between...

Extracting the right features from your data

Like most of the machine learning models we have encountered so far, k-means clustering requires numerical vectors as input. The same feature extraction and transformation approaches that we have seen for classification and regression are applicable for clustering.

As k-means, like least squares regression, uses a squared error function as the optimization objective, it tends to be impacted by outliers and features with large variance.

Clustering could be leveraged to detect outliers as they can cause a lot of problems.

As for regression and classification cases, input data can be normalized and standardized to overcome this, which might improve accuracy. In some cases, however, it might be desirable not to standardize data, if, for example, the objective is to find segmentations according to certain specific features.

K-means - training a clustering model

Training for K-means in Spark ML takes an approach similar to the other models -- we pass a DataFrame that contains our training data to the fit method of the KMeans object.

Here we use the libsvm data format.

Training a clustering model on the MovieLens dataset

We will train a model for both the movie and user factors that we generated by running our recommendation model.

We need to pass in the number of clusters K and the maximum number of iterations for the algorithm to run. Model training might run for less than the maximum number of iterations if the change in the objective function from one iteration to the next is less than the tolerance level (the default for this tolerance is 0.0001).

Spark ML's k-means provides...

K-means - evaluating the performance of clustering models

With models such as regression, classification, and recommendation engines, there are many evaluation metrics that can be applied to clustering models to analyze their performance and the goodness of the clustering of the data points. Clustering evaluation is generally divided into either internal or external evaluation. Internal evaluation refers to the case where the same data used to train the model is used for evaluation. External evaluation refers to using data external to the training data for evaluation purposes.

Internal evaluation metrics

Common internal evaluation metrics include the WCSS we covered earlier (which is exactly the k-means objective function), the Davies-Bouldin Index, the Dunn Index...

Effect of iterations on WSSSE

Let us find out the effect of iterations on WSSSE for the MovieLens dataset. We will calculate WSSSE for various values of iterations and plot the output.

The code listing is:

object MovieLensKMeansMetrics { 
  case class RatingX(userId: Int, movieId: Int, rating: Float, 
    timestamp: Long) 
  val DATA_PATH= "../../../data/ml-100k" 
  val PATH_MOVIES = DATA_PATH + "/u.item" 
  val dataSetUsers = null 

  def main(args: Array[String]): Unit = { 

    val spConfig = (new 
      SparkConf).setMaster("local[1]").setAppName("SparkApp"). 
      set("spark.driver.allowMultipleContexts", "true") 

    val spark = SparkSession 
      .builder() 
      .appName("Spark SQL Example") 
      .config(spConfig) 
      .getOrCreate() 

    val datasetUsers = spark.read.format("libsvm").load( 
      "./data/movie_lens_libsvm/movie_lens_users_libsvm...

Bisecting KMeans

It is a variation of generic KMeans.

Reference: http://www.siam.org/meetings/sdm01/pdf/sdm01_05.pdf

The steps of the algorithm are:

Initialize by randomly selecting a point, say then compute the centroid w of M and compute:

The centroid is the center of the cluster. A centroid is a vector containing one number for each variable, where each number is the mean of a variable for the observations in that cluster.

Divide M =[x1, x2, ... xn] into two, sub-clusters M_L and M_R, according to the following rule:

Compute the centroids of M_L and M_R, w_L and w_R, as in step 2.

If w_L = c_L and w_R = c_R, stop.

Otherwise, let c_L= w_L c_R = w_R , go to step 2.

Bisecting K-means - training a clustering model

Training for bisecting K-means in Spark ML involves taking an approach similar to the other models -- we pass a DataFrame that contains our training data to the fit method of the KMeans object. Note that here we use the libsvm data format:

Instantiate the cluster object:

        val spConfig = (new                         
        SparkConf).setMaster("local[1]").setAppName("SparkApp"). 
        set("spark.driver.allowMultipleContexts", "true") 

        val spark = SparkSession 
          .builder() 
          .appName("Spark SQL Example") 
          .config(spConfig) 
          .getOrCreate() 

        val datasetUsers = spark.read.format("libsvm").load( 
          BASE + "/movie_lens_2f_users_libsvm/part-00000") 
        datasetUsers.show(3)

The output of the command show(3) is shown here:

...

Gaussian Mixture Model

A mixture model is a probabilistic model of a sub-population within a population. These models are used to make statistical inferences about a sub-population, given the observations of pooled populations.

A Gaussian Mixture Model (GMM) is a mixture model represented as a weighted sum of Gaussian component densities. Its model coefficients are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a trained model.

The spark.ml implementation uses the EM algorithm.

It has the following parameters:

k: Number of desired clusters
convergenceTol: Maximum change in log-likelihood at which one considers convergence achieved
maxIterations: Maximum number of iterations to perform without reaching convergence
initialModel: Optional starting point from which to start the EM algorithm

(if this parameter is omitted, a random...

Summary

In this chapter, we explored a new class of model that learns structures from unlabeled data -- unsupervised learning. We worked through the required input data and feature extraction, and saw how to use the output of one model (a recommendation model in our example) as the input to another model (our k-means clustering model). Finally, we evaluated the performance of the clustering model, using both manual interpretation of the cluster assignments and using mathematical performance metrics.

In the next chapter, we will cover another type of unsupervised learning used to reduce our data down to its most important features or components -- dimensionality reduction models.

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with Spark. - Second Edition

Published in: Apr 2017Publisher: PacktISBN-13: 9781785889936

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra

Other recommended products

Related to this chapter

Kali Linux Intrusion and Exploitation Cookbook

Born from it predecessor, Kali Linux 2.0 is the most popular Linux distribution designed for security professionals who build defensive strategies around their systems with an offensive mindset. This book aims at introducing you to the most common and not -so- common pre-installed penetration testing tools and create custom complex images in the form of easy to follow recipes. Right Information gathering, vulnerability assessment & penetration testing for Web, wired and wireless Network this book will help you get grips on exploiting vulnerabilities in your system and fix those security anomalies in no time.

BookApr 2017512 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

BookJun 2017296 pages

Machine Learning with Apache Spark Quick Start Guide

Machine Learning with Apache Spark provides a hands-on introduction to Big Data and Advanced Analytics. In a world driven by mass data creation and consumption, this book combines the latest scalable technologies with advanced analytical algorithms using real-world use-cases in order to derive actionable insights from Big Data in real-time.

BookDec 2018240 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages