Reader small image

You're reading from  Artificial Intelligence with Python - Second Edition

Product typeBook
Published inJan 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839219535
Edition2nd Edition
Languages
Right arrow
Author (1)
Prateek Joshi
Prateek Joshi
author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi

Right arrow

Feature Selection and Feature Engineering

Feature selection – also known as variable selection, attribute selection, or variable subset selection – is a method used to select a subset of features (variables, dimensions) from an initial dataset. Feature selection is a key step in the process of building machine learning models and can have a huge impact on the performance of a model. Using correct and relevant features as the input to your model can also reduce the chance of overfitting, because having more relevant features reduces the opportunity of a model to use noisy features that don't add signal as input. Lastly, having less input features decreases the amount of time that it will take to train a model. Learning which features to select is a skill developed by data scientists that usually only comes from months and years of experience and can be more of an art than a science. Feature selection is important because it can:

  • Shorten training times...

Feature selection

In the previous chapter, we explored the components of a machine learning pipeline. A critical component of the pipeline is deciding which features will be used as inputs to the model. For many models, a small subset of the input variables provide the lion's share of the predictive ability. In most datasets, it is common for a few features to be responsible for the majority of the information signal and the rest of the features are just mostly noise.

It is important to lower the amount of input features for a variety of reasons including:

  • Reducing the multi collinearity of the input features will make the machine learning model parameters easier to interpret. Multicollinearity (also collinearity) is a phenomenon observed with features in a dataset where one predictor feature in a regression model can be linearly predicted from the other's features with a substantial degree of accuracy.
  • Reducing the time required to run the model...

Feature engineering

According to a recent survey performed by the folks at Forbes, data scientists spend around 80% of their time on data preparation:

https://miro.medium.com/max/1200/0*-dn9U8gMVWjDahQV.jpg

Figure 4: Breakdown of time spent by data scientists (source: Forbes)

This statistic highlights the importance of data preparation and feature engineering in data science.

Just like judicious and systematic feature selection can make models faster and more performant by removing features, feature engineering can accomplish the same by adding new features. This seems contradictory at first blush, but the features that are being added are not features that were removed by the feature selection process. The features being added are features that might have not been in the initial dataset. You might have the most powerful and well-designed machine learning algorithm in the world, but if your input features are not relevant, you will never be able to produce useful results. Let's analyze a couple of simple examples to get...

Outlier management

Home prices are a good domain to analyze to understand why we need to pay special attention to outliers. Regardless of what region of the world you live in, most houses in your neighborhood are going to fall in a certain range and they are going to have certain characteristics. Maybe something like this:

  • 1 to 4 bedrooms
  • 1 kitchen
  • 500 to 3000 square feet
  • 1 to 3 bathrooms

The average home price in the US in 2019 is $226,800. And you can guess that kind of house will probably share some of the characteristics above. But there might also be a couple houses that are outliers. Maybe a house that has 10 or 20 bedrooms. Some of these houses might be a 1 million or 10 million dollars, depending on the number of crazy customizations that these houses might have. As you might imagine, these outliers are going to affect the mean in a data set, and it will affect the average even more. For this reason, and given that there are not too many of these...

One-hot encoding

One-hot encoding is an often-used technique in machine learning for feature engineering. Some machine learning algorithms cannot handle categorical features, so one-hot encoding is a way to convert these categorical features into numerical features. Let's say that you have a feature labeled "status" that can take one of three values (red, green, or yellow). Because these values are categorical, there is no concept of which value is higher or lower. We could convert these values to numerical values and that would give them this characteristic. For example:

Yellow = 1

Red = 2

Green = 3

But this seems somewhat arbitrary. If we knew that red is bad and green is good, and yellow is somewhere in the middle, we might change the mapping to something like:

Red = -1

Yellow = 0

Green = 1

And that might produce better performance. But now let's see how this example can be one-hot encoded. To achieve the one-hot encoding of...

Log transform

Logarithm transformation (or log transform) is a common feature engineering transformation. Log transform helps to flatten highly skewed values. After the log transformation is applied, the data distribution is normalized.

Let's go over another example to again gain some intuition. Remember when you were 10-year-old and looking at 15-year-old boys and girls and thinking "They are so much older than me!" Now think of a 50-year-old person and another that is 55-year-old. In this case, you might think that the age difference is not that much. In both cases, the age difference is 5 years. However, in the first case a 15-year-old is 50 percent older than the 10-year-old, and in the second case the 55-year-old is only 10 percent older than the 50-year-old.

If we apply a log transform to all these data points it normalizes magnitude differences like this.

Applying a log transform also decreases the effect of the outliers, due to the normalization...

Scaling

In many instances, numerical features in a dataset can vary greatly in scale with other features. For example, the typical square footage of a house might be a number between 1000 and 3000 square feet, whereas 2, 3, or 4 might be a more typical number for the number of bedrooms in a house. If we leave these values alone, the features with a higher scale might be given a higher weighting if left alone. How can this issue be fixed?

Scaling can be a way to solve this problem. Continuous features become comparable in terms of the range after scaling is applied. Not all algorithms require scaled values (Random Forest comes to mind), but other algorithms will produce meaningless results if the dataset is not scaled beforehand (examples are k-nearest neighbors or k-means). We will now explore the two most common scaling methods.

Normalization (or minmax normalization) scales all values for a feature within a fixed range between 0 and 1. More formally, each value for...

Date manipulation

Time features can be of critical importance for some data science problems. In time series analysis, dates are obviously critical. Predicting that the S&P 500 is going to 3,000 means nothing if you don't attach a date to the prediction.

Dates without any processing might not provide much significance to most models and the values are going to be too unique to provide any predictive power. Why is 10/21/2019 different from 10/19/2019? If we use some of the domain knowledge, we might be able to greatly increase the information value of the feature. For example, converting the date to a categorical variable might help. If the target feature is that you are trying to determine when rent is going to get paid, convert the date to a binary value where the possible values are:

  • Before the 5th of the month = 1
  • After the 5th of the month = 0

If you are asked to predict foot traffic and sales at a restaurant, there might not be any...

Summary

In this chapter we analyzed two important steps in the machine learning pipeline:

  • Feature selection
  • Feature engineering

As we saw, these two processes currently are as much an art as they are a science. Picking a model to use in the pipeline potentially is an easier task than deciding which features to drop and which features to generate to add to the model. This chapter is not meant to be a comprehensive analysis of feature selection and feature engineering, but rather it's a small taste and hopefully it whets your appetite to explore this topic further.

In the next chapter, we'll start getting into the meat of machine learning. We will be building machine learning models starting with supervised learning models.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Artificial Intelligence with Python - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781839219535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi