Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Implementing mode or frequent category imputation

Mode imputation consists of replacing missing values with the mode. We normally use this procedure in categorical variables, hence the frequent category imputation name. Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers; in the following recipe, we will learn how to do so.

If the percentage of missing values is high, frequent category imputation may distort the original distribution of categories.

How to do it...

To begin, let's make a few imports and prepare the data:

  1. Let's import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import FrequentCategoryImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. Frequent categories should be calculated using the train set variables, so let's separate the data into train and test sets and their respective targets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
Remember that you can check the percentage of missing values in the train set with X_train.isnull().mean().
  1. Let's replace missing values with the frequent category, that is, the mode, in four categorical variables:
for var in ['A4', 'A5', 'A6', 'A7']:
value = X_train[var].mode()[0]
X_train[var] = X_train[var].fillna(value)
X_test[var] = X_test[var].fillna(value)

Note how we calculate the mode in the train set and use that value to replace the missing data in the train and test sets.

The pandas' fillna() returns a new dataset with imputed values by default. Instead of doing this, we can replace missing data in the original dataframe by executing X_train[var].fillna(inplace=True).

Now, let's impute missing values by the most frequent category using scikit-learn.

  1. First, let's separate the original dataset into train and test sets and only retain the categorical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3,
random_state=0)
  1. Let's create a frequent category imputer with SimpleImputer() from scikit-learn:
imputer = SimpleImputer(strategy='most_frequent')
 SimpleImputer() from scikit-learn will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.
  1. Let's fit the imputer to the train set so that it learns the most frequent values:
imputer.fit(X_train)
  1. Let's inspect the most frequent values learned by the imputer:
imputer.statistics_

The most frequent values are stored in the statistics_ attribute of the imputer, as follows:

array(['u', 'g', 'c', 'v'], dtype=object)
  1. Let's replace missing values with frequent categories:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Note that SimpleImputer() will return a NumPy array and not a pandas dataframe.

Finally, let's impute missing values using Feature-engine. First, we need to load and separate the data into train and test sets, just like we did in step 2 and step 3 in this recipe.

  1. Next, let's create a frequent category imputer with FrequentCategoryImputer() from Feature-engine, specifying the categorical variables that should have missing data removed:
mode_imputer = FrequentCategoryImputer(variables=['A4', 'A5', 'A6', 'A7'])
FrequentCategoryImputer() will select all categorical variables in the train set by default; that is, unless we pass a list of variables to impute.
  1. Let's fit the imputation transformer to the train set so that it learns the most frequent categories:
mode_imputer.fit(X_train)
  1. Let's inspect the learned frequent categories:
mode_imputer.imputer_dict_

We can see the dictionary with the most frequent values in the following output:

{'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v'}
  1. Finally, let's replace the missing values with frequent categories:
X_train = mode_imputer.transform(X_train)
X_test = mode_imputer.transform(X_test)

FrequentCategoryImputer() returns a pandas dataframe with the imputed values.

Remember that you can check that the categorical variables do not contain missing values by using X_train[['A4', 'A5', 'A6', 'A7']].isnull().mean().

How it works...

In this recipe, we replaced the missing values of the categorical variables in the Credit Approval Data Set with the most frequent categories using pandas, scikit-learn, and Feature-engine. Frequent categories should be learned from the train set, so we divided the dataset into train and test sets using train_test_split() from scikit-learn, as described in the Performing mean or median imputation recipe.

To impute missing data with pandas in multiple categorical variables, in step 4 we created a for loop over the categorical variables A4 to A7, and for each variable, we calculated the most frequent value using the pandas mode() method in the train set. Then, we used this value to replace the missing values with pandas fillna() in the train and test sets. Pandas fillna() returned a pandas Series without missing values, which we reassigned to the original variable in the dataframe.

To replace missing values using scikit-learn, we divided the data into train and test sets but only kept categorical variables. Next, we set up SimpleImputer() and specified most_frequent as the imputation method in the strategy. With the fit() method, imputer learned and stored frequent categories in its statistics_ attribute. With the transform() method, the missing values in the train and test sets were replaced with the learned statistics, returning NumPy arrays.

Finally, to replace the missing values via Feature-engine, we set up FrequentCategoryImputer(), specifying the variables to impute in a list. With fit(), the FrequentCategoryImputer() learned and stored frequent categories in a dictionary in the imputer_dict_ attribute. With the transform() method, missing values in the train and test sets were replaced with stored parameters, which allowed us to obtain pandas dataframes without missing data.

Note that, unlike SimpleImputer() from scikit-learn, FrequentCategoryImputer() will only impute categorical variables and ignores numerical ones.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon