You're reading from Scala and Spark for Big Data Analytics

Product typeBook

Published inJul 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785280849

Edition1st Edition

Languages

Scala

Tools

Apache Spark Hadoop

Concepts

Big Data

Authors (2):

Md. Rezaul Karim

Sridhar Alla

View More author details

Time to Put Some Order - Cluster Your Data with Spark MLlib

"If you take a galaxy and try to make it bigger, it becomes a cluster of galaxies, not a galaxy. If you try to make it smaller than that, it seems to blow itself apart"

- Jeremiah P. Ostriker

In this chapter, we will delve deeper into machine learning and find out how we can take advantage of it to cluster records belonging to a certain group or class for a dataset of unsupervised observations. In a nutshell, the following topics will be covered in this chapter:

Unsupervised learning
Clustering techniques
Hierarchical clustering (HC)
Centroid-based clustering (CC)
Distribution-based clustering (DC)
Determining number of clusters
A comparative analysis between clustering algorithms
Submitting jobs on computing clusters

Unsupervised learning

In this section, we will provide a brief introduction to unsupervised machine learning technique with appropriate examples. Let's start the discussion with a practical example. Suppose you have a large collection of not-pirated-totally-legal mp3s in a crowded and massive folder on your hard drive. Now, what if you can build a predictive model that helps automatically group together similar songs and organize them into your favorite categories such as country, rap, rock, and so on. This act of assigning an item to a group such that a mp3 to is added to the respective playlist in an unsupervised way. In the previous chapters, we assumed you're given a training dataset of correctly labeled data. Unfortunately, we don't always have that extravagance when we collect data in the real-world. For example, suppose we would like to divide up a large...

Clustering techniques

In this section, we will discuss clustering techniques along with challenges and suitable examples. A brief overview of hierarchical clustering, centroid-based clustering, and distribution-based clustering will be provided too.

Unsupervised learning and the clustering

Clustering analysis is about dividing data samples or data points and putting them into corresponding homogeneous classes or clusters. Thus a trivial definition of clustering can be thought as the process of organizing objects into groups whose members are similar in some way.
A cluster is, therefore, a collection of objects that are similar between them and are dissimilar to the objects belonging to other clusters. As shown in Figure 2...

Centroid-based clustering (CC)

In this section, we discuss the centroid-based clustering technique and its computational challenges. An example of using K-means with Spark MLlib will be shown for a better understanding of the centroid-based clustering.

Challenges in CC algorithm

As discussed previously, in a centroid-based clustering algorithm like K-means, setting the optimal value of the number of clusters K is an optimization problem. This problem can be described as NP-hard (that is non-deterministic polynomial-time hard) featuring high algorithmic complexities, and thus the common approach is trying to achieve only an approximate solution. Consequently, solving these optimization problems imposes an extra burden and consequently...

Hierarchical clustering (HC)

In this section, we discuss the hierarchical clustering technique and its computational challenges. An example of using the bisecting K-means algorithm of hierarchical clustering with Spark MLlib will be shown too for a better understanding of hierarchical clustering.

An overview of HC algorithm and challenges

A hierarchical clustering technique is computationally different from the centroid-based clustering in the way the distances are computed. This is one of the most popular and widely used clustering analysis technique that looks to build a hierarchy of clusters. Since a cluster usually consists of multiple objects, there will be other candidates to compute the distance too. Therefore, with...

Distribution-based clustering (DC)

In this section, we will discuss the distribution-based clustering technique and its computational challenges. An example of using Gaussian mixture models (GMMs) with Spark MLlib will be shown for a better understanding of distribution-based clustering.

Challenges in DC algorithm

A distribution-based clustering algorithm like GMM is an expectation-maximization algorithm. To avoid the overfitting problem, GMM usually models the dataset with a fixed number of Gaussian distributions. The distributions are initialized randomly, and the related parameters are iteratively optimized too to fit the model better to the training dataset. This is the most robust feature of GMM and helps the model to...

Determining number of clusters

The beauty of clustering algorithms like K-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have a raw data and would like to know the patterns in that data. However, deciding the number of clusters prior to doing the experiment might not be successful but may sometimes lead to an overfitting or underfitting problem. On the other hand, one common thing to all three algorithms (that is, K-means, bisecting K-means, and Gaussian mixture) is that the number of clusters must be determined in advance and supplied to the algorithm as a parameter. Hence, informally, determining the number of clusters is a separate optimization problem to be solved.

In this section, we will use a heuristic approach based on the Elbow method. We start from K = 2 clusters, and then we ran the...

A comparative analysis between clustering algorithms

Gaussian mixture is used mainly for expectation minimization, which is an example of an optimization algorithm. Bisecting K-means, which is faster than regular K-means, also produces slightly different clustering results. Below we try to compare these three algorithms. We will show a performance comparison in terms of model building time and the computional cost for each algorithm. As shown in the following code, we can compute the cost in terms of WCSS. The following lines of code can be used to compute the WCSS for the K-means and bisecting algorithms:

val WCSSS = model.computeCost(landRDD) // land RDD is the training set 
println("Within-Cluster Sum of Squares = " + WCSSS) // Less is better

For the dataset we used throughout this chapter, we got the following values of WCSS:

Within-Cluster Sum of Squares of Bisecting...

Submitting Spark job for cluster analysis

The examples shown in this chapter can be made scalable for the even larger dataset to serve different purposes. You can package all three clustering algorithms with all the required dependencies and submit them as a Spark job in the cluster. Now use the following lines of code to submit your Spark job of K-means clustering, for example (use similar syntax for other classes), for the Saratoga NY Homes dataset:

# Run application as standalone mode on 8 cores 
SPARK_HOME/bin/spark-submit \   
--class org.apache.spark.examples.KMeansDemo \   
--master local[8] \   
KMeansDemo-0.1-SNAPSHOT-jar-with-dependencies.jar \   
Saratoga_NY_Homes.txt

# Run on a YARN cluster 
export HADOOP_CONF_DIR=XXX 
SPARK_HOME/bin/spark-submit \   
--class org.apache.spark.examples.KMeansDemo \   
--master yarn \   
--deploy-mode cluster \  # can be client for client mode...

Summary

In this chapter, we delved even deeper into machine learning and found out how we can take advantage of machine learning to cluster records belonging to a dataset of unsupervised observations. Consequently, you learnt the practical know-how needed to quickly and powerfully apply supervised and unsupervised techniques on available data to new problems through some widely used examples based on the understandings from the previous chapters. The examples we are talking about will be demonstrated from the Spark perspective. For any of the K-means, bisecting K-means, and Gaussian mixture algorithms, it is not guaranteed that the algorithm will produce the same clusters if run multiple times. For example, we observed that running the K-means algorithm multiple times with the same parameters generated slightly different results at each run.

For a performance comparison between...

The rest of the chapter is locked

You have been reading a chapter from

Scala and Spark for Big Data Analytics

Published in: Jul 2017Publisher: PacktISBN-13: 9781785280849

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Other recommended products

Related to this chapter

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

TensorFlow: Powerful Predictive Analytics with TensorFlow

Predictive analytics discovers hidden patterns from structured and unstructured data for automated decision making in business intelligence. Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. TensorFlow, Google’s brainchild, is immensely popular and extensively used for predictive analysis.

BookMar 2018164 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages