You're reading from Essential PySpark for Scalable Data Analytics

Product typeBook

Published inOct 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800568877

Edition1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1)

Sreeram Nudurupati

Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

In the previous chapter, you were introduced to Apache Spark's native, scalable machine learning library, called MLlib, and you were provided with an overview of its major architectural components, including transformers, estimators, and pipelines.

This chapter will take you to your first stage of the scalable machine learning journey, which is feature engineering. Feature engineering deals with the process of extracting machine learning features from preprocessed and clean data in order to make it conducive for machine learning. You will learn about the concepts of feature extraction, feature transformation, feature scaling, and feature selection and implement these techniques using the algorithms that exist within Spark MLlib and some code examples. Toward the end of this chapter, you will have learned the necessary techniques to implement scalable feature engineering pipelines that convert...

Technical requirements

In this chapter, we will be using the Databricks Community Edition to run our code. This can be found at https://community.cloud.databricks.com.

Sign-up instructions can be found at https://databricks.com/try-databricks. The code used in this chapter can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/Chapter06.
The datasets used in this chapter can be found at https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/data.

The machine learning process

A typical data analytics and data science process involves gathering raw data, cleaning data, consolidating data, and integrating data. Following this, we apply statistical and machine learning techniques to the preprocessed data in order to generate a machine learning model and, finally, summarize and communicate the results of the process to business stakeholders in the form of data products. A high-level overview of the machine learning process is presented in the following diagram:

Figure 6.1 – The data analytics and data science process

As you can see from the preceding diagram, the actual machine learning process itself is just a small portion of the entire data analytics process. Data teams spend a good amount of time curating and preprocessing data, and just a portion of that time is devoted to building actual machine learning models.

The actual machine learning process involves stages that allow you to carry out...

Feature extraction

A machine learning model is equivalent to a function in mathematics or a method in computer programming. A machine learning model takes one or more parameters or variables as input and yields an output, called a prediction. In machine learning terminology, these input parameters or variables are called features. A feature is a column of the input dataset within a machine learning algorithm or model. A feature is a measurable data point, such as an individual's name, gender, or age, or it can be time-related data, weather, or some other piece of data that is useful for analysis.

Machine learning algorithms leverage linear algebra, a field of mathematics, and make use of mathematical structures such as matrices and vectors to represent data internally and also within the code level implementation of algorithms. Real-world data, even after undergoing the data engineering process, rarely occurs in the form of matrices and vectors. Therefore, the feature engineering...

Feature transformation

Feature transformation is the process of carefully reviewing the various variable types, such as categorical variables and continuous variables, present in the training data and determining the best type of transformation to achieve optimal model performance. This section will describe, with code examples, how to transform a few common types of variables found in machine learning datasets, such as text and numerical variables.

Transforming categorical variables

Categorical variables are pieces of data that have discrete values with a limited and finite range. They are usually text-based in nature, but they can also be numerical. Examples include country codes and the month of the year. We mentioned a few techniques regarding how to extract features from text variables in the previous section. In this section, we will explore a few other algorithms to transform categorical variables.

The tokenization of text into individual terms

The Tokenizer class...

Feature selection

Feature selection is a technique that involves reducing the number of features in the machine learning process while leveraging lesser data and also improving the accuracy of the trained model. Feature selection is the process of either automatically or manually selecting only those features that contribute the most to the prediction variable that you are interested in. Feature selection is an important aspect of machine learning, as irrelevant or semi-relevant features can gravely impact model accuracy.

Apache Spark MLlib comes packaged with a few feature selectors, including VectorSlicer, ChiSqSelector, UnivariateFeatureSelector, and VarianceThresholdSelector. Let's explore how to implement feature selection within Apache Spark using the following code example that utilizes ChiSqSelector to select the optimal features given the label column that we are trying to predict:

from pyspark.ml.feature import ChiSqSelector
chisq_selector=ChiSqSelector(numTopFeatures...

Feature store as a central feature repository

A large percentage of the time spent on any machine learning problem is on data cleansing and data wrangling to ensure we build our models on clean and meaningful data. Feature engineering is another critical process of the machine learning process where data scientists spend a huge chunk of their time curating machine learning features, which happens to be a complex and time-consuming process. It appears counter-intuitive to have to create features again and again for each new machine learning problem.

Typically, feature engineering takes place on already existing historic data, and new features are perfectly reusable in different machine learning problems. In fact, data scientists spend a good amount of time searching for the right features for the problem at hand. So, it would be tremendously beneficial to have a centralized repository of features that is also searchable and has metadata to identify features. This central repository...

Delta Lake as an offline feature store

In Chapter 3, Data Cleansing and Integration, we established data lakes as the scalable and relatively inexpensive choice for the long-term storage of historical data. Some challenges with reliability and cloud-based data lakes were presented, and you learned how Delta Lake has been designed to overcome these challenges. The benefits of Delta Lake as an abstraction layer on top of cloud-based data lakes extend beyond just data engineering workloads to data science workloads as well, and we will explore those benefits in this section.

Delta Lake makes for an ideal candidate for an offline feature store on cloud-based data lakes because of the data reliability features and the novel time travel features that Delta Lake has to offer. We will discuss these in the following sections.

Structure and metadata with Delta tables

Delta Lake supports structured data with well-defined data types for columns. This makes Delta tables strongly typed...

Summary

In this chapter, you learned about the concept of feature engineering and why it is an important part of the whole machine learning process. Additionally, you learned why it is required to create features and train machine learning models.

You explored various feature engineering techniques such as feature extraction and how they can be used to convert text-based data into features. Feature transformation techniques useful in dealing with categorical and continuous variables were introduced, and examples of how to convert them into features were presented. You also explored feature scaling techniques that are useful for normalizing features to help prevent some features from unduly biasing the trained model.

Finally, you were introduced to techniques for selecting the right features to optimize the model performance for the label being predicted via feature selection techniques. The skills learned in this chapter will help you to implement scalable and performant feature...

The rest of the chapter is locked

You have been reading a chapter from

Essential PySpark for Scalable Data Analytics

Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages