You're reading from Essential PySpark for Scalable Data Analytics

Product typeBook

Published inOct 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800568877

Edition1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1)

Sreeram Nudurupati

Chapter 7: Supervised Machine Learning

In the previous two chapters, you were introduced to the machine learning process, the various stages involved, and the first step of the process, namely feature engineering. Equipped with the fundamental knowledge of the machine learning process and with a usable set of machine learning features, you are ready to move on to the core part of the machine learning process, namely model training.

In this chapter, you will be introduced to the supervised learning category of machine learning algorithms, where you will learn about parametric and non-parametric algorithms, as well as gain the knowledge required to solve regression and classification problems using machine learning. Finally, you will implement a few regression algorithms using the Spark machine learning library, such as linear regression and decision trees, and a few classification algorithms such as logistic regression, naïve Bayes, and support vector machines. Tree ensemble...

Technical requirements

In this chapter, we will be using Databricks Community Edition to run our code (https://community.cloud.databricks.com).

Sign-up instructions can be found at https://databricks.com/try-databricks.
The code for this chapter can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/Chapter07.
The datasets for this chapter can be found at https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/data.

Introduction to supervised machine learning

A machine learning problem can be considered as a process where an unknown variable is derived from a set of known variables using a mathematical or statistical function. The difference here is that a machine learning algorithm learns the mapping function from a given dataset.

Supervised learning is a class of machine learning algorithms where a model is trained on a dataset and the outcome for each set of inputs is already known. This is known as supervised learning as the algorithm here behaves like a teacher, guiding the training process until the desired level of model performance is achieved. Supervised learning requires data that is already labeled. Supervised learning algorithms can be further classified as parametric and non-parametric algorithms. We will look at these in the following sections.

Parametric machine learning

A machine learning algorithm that simplifies the learning process by summarizing the data with a fixed...

Regression

Regression is a supervised learning technique that helps us learn the correlation between a continuous output parameter called Label and a set of input parameters called Features. Regression produces machine learning models that predict a continuous label, given a feature vector. The concept of regression can be best explained using the following diagram:

Figure 7.1 – Linear regression

In the preceding diagram, the scatterplot represents data points spread across a two-dimensional space. The linear regression algorithm, being a parametric learning algorithm, assumes that the learning function will have a linear form. Thus, it learns the coefficients that are required to represent a straight line that approximately fits the data points on the scatterplot.

Spark MLlib has distributed and scalable implementations of a few prominent regression algorithms, such as linear regression, decision trees, random forests, and gradient boosted trees...

Classification

Classification is another type of supervised learning technique, where the task is to categorize a given dataset into different classes. Machine learning classifiers learn a mapping function from input parameters called Features that go to a discreet output parameter called Label. Here, the learning function tries to predict whether the label belongs to one of several known classes. The following diagram depicts the concept of classification:

Figure 7.2 – Logistic regression

In the preceding diagram, a logistic regression algorithm is learning a mapping function that divides the data points in a two-dimensional space into two distinct classes. The learning algorithm learns the coefficients of a Sigmoid function, which classifies a set of input parameters into one of two possible classes. This type of classification can be split into two distinct classes. This is known as binary classification or binomial classification.

Logistic regression...

Tree ensembles

Non-parametric learning algorithms such as decision trees do not make any assumptions on the form of the learning function being learned and try to fit a model to the data at hand. However, decision trees run the risk of overfitting training data. Tree ensemble methods are a great way to leverage the benefits of decision trees while minimizing the risk of overfitting. Tree ensemble methods combine several decision trees to produce better-performing predictive models. Some popular tree ensemble methods include random forests and gradient boosted trees. We will explore how these ensemble methods can be used to build regression and classification models using Spark MLlib.

Regression using random forests

Random forests build multiple decision trees and merge them to produce a more accurate model and reduce the risk of overfitting. Random forests can be used to train regression models, as shown in the following code example:

from pyspark.ml.regression import RandomForestRegressor...

Real-world supervised learning applications

In the past, data science and machine learning were used exclusively for academic research purposes. However, over the past decade, this field has found its use in actual business applications to help businesses find their competitive edge, improve overall business performance, and become profitable. In this section, we will look at some real-world applications of machine learning.

Regression applications

Some of the applications of machine learning regression models and how they help improve business performance will be presented in this section.

Customer lifetime value estimation

In any retail or CPG kind of business where customer churn is a huge factor, it is necessary to direct marketing spend at those customers who are profitable. In non-subscription kinds of businesses, typically 20% of the customer base generates up to 80% of revenue. Machine learning models can be leveraged to model and predict each customer's lifetime...

Summary

In this chapter, you were introduced to a class of machine learning algorithms called supervised learning algorithms, which can learn from well-labeled existing data. You explored the concepts of parametric and non-parametric learning algorithms and their pros and cons. Two major use cases of supervised learning algorithms called regression and classification were presented. Model training examples, along with code from Spark MLlib, were explored so that we could look at a few prominent types of regression and classification models. Tree ensemble methods, which improve the stability, accuracy, and performance of decision tree models by combining several models and preventing overfitting, were also presented.

Finally, you explored some real-world business applications of the various machine learning models presented in this chapter. We explained how supervised learning can be leveraged for business use cases, and working code samples were presented to help you train your...

The rest of the chapter is locked

You have been reading a chapter from

Essential PySpark for Scalable Data Analytics

Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages