Reader small image

You're reading from  Essential PySpark for Scalable Data Analytics

Product typeBook
Published inOct 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800568877
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Sreeram Nudurupati
Sreeram Nudurupati
author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati

Right arrow

Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

In the previous chapter, you were introduced to Apache Spark's native, scalable machine learning library, called MLlib, and you were provided with an overview of its major architectural components, including transformers, estimators, and pipelines.

This chapter will take you to your first stage of the scalable machine learning journey, which is feature engineering. Feature engineering deals with the process of extracting machine learning features from preprocessed and clean data in order to make it conducive for machine learning. You will learn about the concepts of feature extraction, feature transformation, feature scaling, and feature selection and implement these techniques using the algorithms that exist within Spark MLlib and some code examples. Toward the end of this chapter, you will have learned the necessary techniques to implement scalable feature engineering pipelines that convert...

Technical requirements

In this chapter, we will be using the Databricks Community Edition to run our code. This can be found at https://community.cloud.databricks.com.

The machine learning process

A typical data analytics and data science process involves gathering raw data, cleaning data, consolidating data, and integrating data. Following this, we apply statistical and machine learning techniques to the preprocessed data in order to generate a machine learning model and, finally, summarize and communicate the results of the process to business stakeholders in the form of data products. A high-level overview of the machine learning process is presented in the following diagram:

Figure 6.1 – The data analytics and data science process

As you can see from the preceding diagram, the actual machine learning process itself is just a small portion of the entire data analytics process. Data teams spend a good amount of time curating and preprocessing data, and just a portion of that time is devoted to building actual machine learning models.

The actual machine learning process involves stages that allow you to carry out...

Feature extraction

A machine learning model is equivalent to a function in mathematics or a method in computer programming. A machine learning model takes one or more parameters or variables as input and yields an output, called a prediction. In machine learning terminology, these input parameters or variables are called features. A feature is a column of the input dataset within a machine learning algorithm or model. A feature is a measurable data point, such as an individual's name, gender, or age, or it can be time-related data, weather, or some other piece of data that is useful for analysis.

Machine learning algorithms leverage linear algebra, a field of mathematics, and make use of mathematical structures such as matrices and vectors to represent data internally and also within the code level implementation of algorithms. Real-world data, even after undergoing the data engineering process, rarely occurs in the form of matrices and vectors. Therefore, the feature engineering...

Feature transformation

Feature transformation is the process of carefully reviewing the various variable types, such as categorical variables and continuous variables, present in the training data and determining the best type of transformation to achieve optimal model performance. This section will describe, with code examples, how to transform a few common types of variables found in machine learning datasets, such as text and numerical variables.

Transforming categorical variables

Categorical variables are pieces of data that have discrete values with a limited and finite range. They are usually text-based in nature, but they can also be numerical. Examples include country codes and the month of the year. We mentioned a few techniques regarding how to extract features from text variables in the previous section. In this section, we will explore a few other algorithms to transform categorical variables.

The tokenization of text into individual terms

The Tokenizer class...

Feature selection

Feature selection is a technique that involves reducing the number of features in the machine learning process while leveraging lesser data and also improving the accuracy of the trained model. Feature selection is the process of either automatically or manually selecting only those features that contribute the most to the prediction variable that you are interested in. Feature selection is an important aspect of machine learning, as irrelevant or semi-relevant features can gravely impact model accuracy.

Apache Spark MLlib comes packaged with a few feature selectors, including VectorSlicer, ChiSqSelector, UnivariateFeatureSelector, and VarianceThresholdSelector. Let's explore how to implement feature selection within Apache Spark using the following code example that utilizes ChiSqSelector to select the optimal features given the label column that we are trying to predict:

from pyspark.ml.feature import ChiSqSelector
chisq_selector=ChiSqSelector(numTopFeatures...

Feature store as a central feature repository

A large percentage of the time spent on any machine learning problem is on data cleansing and data wrangling to ensure we build our models on clean and meaningful data. Feature engineering is another critical process of the machine learning process where data scientists spend a huge chunk of their time curating machine learning features, which happens to be a complex and time-consuming process. It appears counter-intuitive to have to create features again and again for each new machine learning problem.

Typically, feature engineering takes place on already existing historic data, and new features are perfectly reusable in different machine learning problems. In fact, data scientists spend a good amount of time searching for the right features for the problem at hand. So, it would be tremendously beneficial to have a centralized repository of features that is also searchable and has metadata to identify features. This central repository...

Delta Lake as an offline feature store

In Chapter 3, Data Cleansing and Integration, we established data lakes as the scalable and relatively inexpensive choice for the long-term storage of historical data. Some challenges with reliability and cloud-based data lakes were presented, and you learned how Delta Lake has been designed to overcome these challenges. The benefits of Delta Lake as an abstraction layer on top of cloud-based data lakes extend beyond just data engineering workloads to data science workloads as well, and we will explore those benefits in this section.

Delta Lake makes for an ideal candidate for an offline feature store on cloud-based data lakes because of the data reliability features and the novel time travel features that Delta Lake has to offer. We will discuss these in the following sections.

Structure and metadata with Delta tables

Delta Lake supports structured data with well-defined data types for columns. This makes Delta tables strongly typed...

Summary

In this chapter, you learned about the concept of feature engineering and why it is an important part of the whole machine learning process. Additionally, you learned why it is required to create features and train machine learning models.

You explored various feature engineering techniques such as feature extraction and how they can be used to convert text-based data into features. Feature transformation techniques useful in dealing with categorical and continuous variables were introduced, and examples of how to convert them into features were presented. You also explored feature scaling techniques that are useful for normalizing features to help prevent some features from unduly biasing the trained model.

Finally, you were introduced to techniques for selecting the right features to optimize the model performance for the label being predicted via feature selection techniques. The skills learned in this chapter will help you to implement scalable and performant feature...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Essential PySpark for Scalable Data Analytics
Published in: Oct 2021Publisher: PacktISBN-13: 9781800568877
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sreeram Nudurupati

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Read more about Sreeram Nudurupati