You're reading from Essential PySpark for Scalable Data Analytics

Product typeBook

Published inOct 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800568877

Edition1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1)

Sreeram Nudurupati

Chapter 8: Unsupervised Machine Learning

In the previous two chapters, you were introduced to the supervised learning class of machine learning algorithms, their real-world applications, and how to implement them at scale using Spark MLlib. In this chapter, you will be introduced to the unsupervised learning category of machine learning, where you will learn about parametric and non-parametric unsupervised algorithms. A few real-world applications of clustering and association algorithms will be presented to help you understand the applications of unsupervised learning to solve real-life problems. You will gain basic knowledge and understanding of clustering and association problems when using unsupervised machine learning. We will also look at the implementation details of a few clustering algorithms in Spark ML, such as K-means clustering, hierarchical clustering, latent Dirichlet allocation, and an association algorithm called alternating least squares.

In this chapter, we&apos...

Technical requirements

In this chapter, we will be using Databricks Community Edition to run our code (https://community.cloud.databricks.com).

Sign-up instructions can be found at https://databricks.com/try-databricks.
The code for this chapter can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/Chapter08.
The datasets for this chapter can be found at https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/data.

Introduction to unsupervised machine learning

Unsupervised learning is a machine learning technique where no guidance is available to the learning algorithm in the form of known label values in the training data. Unsupervised learning is useful in categorizing unknown data points into groups based on patterns, similarities, or differences that are inherent within the data, without any prior knowledge of the data.

In supervised learning, a model is trained on known data, and then inferences are drawn from the model using new, unseen data. On the other hand, in unsupervised learning, the model training process in itself is the end goal, where patterns hidden within the training data are discovered during the model training process. Unsupervised learning is harder compared to supervised learning since it is difficult to ascertain if the results of an unsupervised learning algorithm are meaningful without any external evaluation, especially without access to any correctly labeled data...

Clustering using machine learning

In machine learning, clustering deals with identifying patterns or structures within uncategorized data without needing any external guidance. Clustering algorithms parse given data to identify clusters or groups with matching patterns that exist in the dataset. The result of clustering algorithms are clusters of data that can be defined as a collection of objects that are similar in a certain way. The following diagram illustrates how clustering works:

Figure 8.1 – Clustering

In the previous diagram, an uncategorized dataset is being passed through a clustering algorithm, resulting in the data being categorized into smaller clusters or groups of data, based on a data point's proximity to another data point in a two-dimensional Euclidian space.

Thus, the clustering algorithm groups data based on the Euclidean distance between the data on a two-dimensional plane. Clustering algorithms consider the Euclidean distance...

Building association rules using machine learning

Association rules is a data mining technique where the goal is identifying relationships between various entities within a given dataset by identifying entities that occur frequently together. Association rules are useful in making new item recommendations based on the relationship between existing items that frequently appear together. In data mining association, rules are implemented using a series of if-then-else statements that help show the probability of relationships between entities. The association rules technique is widely used in recommender systems, market basket analysis, and affinity analysis problems.

Collaborative filtering using alternating least squares

In machine learning, collaborative filtering is more commonly used for recommender systems. A recommender system is a technique that's used to filter information by considering user preference. Based on user preference and taking into consideration their...

Real-world applications of unsupervised learning

Unsupervised learning algorithms are being used today to solve some real-world business challenges. We will take a look at a few such challenges in this section.

Clustering applications

This section presents some of the real-world business applications of clustering algorithms.

Customer segmentation

Retail marketing teams, as well as business-to-customer organizations, are always trying to optimize their marketing spends. Marketing teams in particular are concerned with one specific metric called cost per acquisition (CPA). CPA is indicative of the amount that an organization needs to spend to acquire a single customer, and an optimal CPA means a better return on marketing investments. The best way to optimize CPA is via customer segmentation as this improves the effectiveness of marketing campaigns. Traditional customer segmentation takes standard customer features such as demographic, geographic, and social information...

Summary

This chapter introduced you to unsupervised learning algorithms, as well as how to categorize unlabeled data and identify associations between data entities. Two main areas of unsupervised learning algorithms, namely clustering and association rules, were presented. You were introduced to the most popular clustering and collaborative filtering algorithms. You were also presented with working code examples of clustering algorithms such as K-means, bisecting K-means, LDA, and GSM using code in Spark MLlib. You also saw code examples for building a recommendation engine using the alternative least-squares algorithm in Spark MLlib. Finally, a few real-world business applications of unsupervised learning algorithms were presented. We looked at several concepts, techniques, and code examples surrounding unsupervised learning algorithms so that you can train your models at scale using Spark MLlib.

So far, in this and the previous chapter, you have only explored the data wrangling...

The rest of the chapter is locked

You have been reading a chapter from

Essential PySpark for Scalable Data Analytics

Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages