Reader small image

You're reading from  Essential PySpark for Scalable Data Analytics

Product typeBook
Published inOct 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800568877
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Sreeram Nudurupati
Sreeram Nudurupati
author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Right arrow

Chapter 8: Unsupervised Machine Learning

In the previous two chapters, you were introduced to the supervised learning class of machine learning algorithms, their real-world applications, and how to implement them at scale using Spark MLlib. In this chapter, you will be introduced to the unsupervised learning category of machine learning, where you will learn about parametric and non-parametric unsupervised algorithms. A few real-world applications of clustering and association algorithms will be presented to help you understand the applications of unsupervised learning to solve real-life problems. You will gain basic knowledge and understanding of clustering and association problems when using unsupervised machine learning. We will also look at the implementation details of a few clustering algorithms in Spark ML, such as K-means clustering, hierarchical clustering, latent Dirichlet allocation, and an association algorithm called alternating least squares.

In this chapter, we&apos...

Technical requirements

In this chapter, we will be using Databricks Community Edition to run our code (https://community.cloud.databricks.com).

Introduction to unsupervised machine learning

Unsupervised learning is a machine learning technique where no guidance is available to the learning algorithm in the form of known label values in the training data. Unsupervised learning is useful in categorizing unknown data points into groups based on patterns, similarities, or differences that are inherent within the data, without any prior knowledge of the data.

In supervised learning, a model is trained on known data, and then inferences are drawn from the model using new, unseen data. On the other hand, in unsupervised learning, the model training process in itself is the end goal, where patterns hidden within the training data are discovered during the model training process. Unsupervised learning is harder compared to supervised learning since it is difficult to ascertain if the results of an unsupervised learning algorithm are meaningful without any external evaluation, especially without access to any correctly labeled data...

Clustering using machine learning

In machine learning, clustering deals with identifying patterns or structures within uncategorized data without needing any external guidance. Clustering algorithms parse given data to identify clusters or groups with matching patterns that exist in the dataset. The result of clustering algorithms are clusters of data that can be defined as a collection of objects that are similar in a certain way. The following diagram illustrates how clustering works:

Figure 8.1 – Clustering

In the previous diagram, an uncategorized dataset is being passed through a clustering algorithm, resulting in the data being categorized into smaller clusters or groups of data, based on a data point's proximity to another data point in a two-dimensional Euclidian space.

Thus, the clustering algorithm groups data based on the Euclidean distance between the data on a two-dimensional plane. Clustering algorithms consider the Euclidean distance...

Building association rules using machine learning

Association rules is a data mining technique where the goal is identifying relationships between various entities within a given dataset by identifying entities that occur frequently together. Association rules are useful in making new item recommendations based on the relationship between existing items that frequently appear together. In data mining association, rules are implemented using a series of if-then-else statements that help show the probability of relationships between entities. The association rules technique is widely used in recommender systems, market basket analysis, and affinity analysis problems.

Collaborative filtering using alternating least squares

In machine learning, collaborative filtering is more commonly used for recommender systems. A recommender system is a technique that's used to filter information by considering user preference. Based on user preference and taking into consideration their...

Real-world applications of unsupervised learning

Unsupervised learning algorithms are being used today to solve some real-world business challenges. We will take a look at a few such challenges in this section.

Clustering applications

This section presents some of the real-world business applications of clustering algorithms.

Customer segmentation

Retail marketing teams, as well as business-to-customer organizations, are always trying to optimize their marketing spends. Marketing teams in particular are concerned with one specific metric called cost per acquisition (CPA). CPA is indicative of the amount that an organization needs to spend to acquire a single customer, and an optimal CPA means a better return on marketing investments. The best way to optimize CPA is via customer segmentation as this improves the effectiveness of marketing campaigns. Traditional customer segmentation takes standard customer features such as demographic, geographic, and social information...

Summary

This chapter introduced you to unsupervised learning algorithms, as well as how to categorize unlabeled data and identify associations between data entities. Two main areas of unsupervised learning algorithms, namely clustering and association rules, were presented. You were introduced to the most popular clustering and collaborative filtering algorithms. You were also presented with working code examples of clustering algorithms such as K-means, bisecting K-means, LDA, and GSM using code in Spark MLlib. You also saw code examples for building a recommendation engine using the alternative least-squares algorithm in Spark MLlib. Finally, a few real-world business applications of unsupervised learning algorithms were presented. We looked at several concepts, techniques, and code examples surrounding unsupervised learning algorithms so that you can train your models at scale using Spark MLlib.

So far, in this and the previous chapter, you have only explored the data wrangling...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Essential PySpark for Scalable Data Analytics
Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati