Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Replacing missing values with an arbitrary number

Arbitrary number imputation consists of replacing missing values with an arbitrary value. Some commonly used values include 999, 9999, or -1 for positive distributions. This method is suitable for numerical variables. A similar method for categorical variables will be discussed in the Capturing missing values in a bespoke category recipe.

When replacing missing values with an arbitrary number, we need to be careful not to select a value close to the mean or the median, or any other common value of the distribution.

Arbitrary number imputation can be used when data is not missing at random, when we are building non-linear models, and when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data by arbitrary numbers using pandas, scikit-learn, and Feature-engine.

How to do it...

Let's begin by importing the necessary tools and loading and preparing the data:

  1. Import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import ArbitraryNumberImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. Let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)

Normally, we select arbitrary values that are bigger than the maximum value of the distribution.

  1. Let's find the maximum value of four numerical variables:
X_train[['A2','A3', 'A8', 'A11']].max()

The following is the output of the preceding code block:

A2     76.750
A3     26.335
A8     20.000
A11    67.000
dtype: float64
  1. Let's replace the missing values with 99 in the numerical variables that we specified in step 4:
for var in ['A2','A3', 'A8', 'A11']:
X_train[var].fillna(99, inplace=True)
X_test[var].fillna(99, inplace=True)
We chose 99 as the arbitrary value because it is bigger than the maximum value of these variables.

We can check the percentage of missing values using X_train[['A2','A3', 'A8', 'A11']].isnull().mean(), which should be 0 after step 5.

Now, we'll impute missing values with an arbitrary number using scikit-learn instead.

  1. First, let's separate the data into train and test sets while keeping only the numerical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3,
random_state=0)
  1. Let's set up SimpleImputer() so that it replaces any missing values with 99:
imputer = SimpleImputer(strategy='constant', fill_value=99)
If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.
  1. Let's fit the imputer to the train set:
imputer.fit(X_train)
  1. Let's replace the missing values with 99:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Note that SimpleImputer() will return a NumPy array. Be mindful of the order of the variables if you're transforming the array back into a dataframe.

To finish, let's impute missing values using Feature-engine. First, we need to load the data and separate it into train and test sets, just like we did in step 2 and step 3.

  1. Next, let's create an imputation transformer with Feature-engine's ArbitraryNumberImputer() in order to replace any missing values with 99 and specify the variables from which missing data should be imputed:
imputer = ArbitraryNumberImputer(arbitrary_number=99, 
variables=['A2','A3', 'A8', 'A11'])
ArbitraryNumberImputer() will automatically select all numerical variables in the train set; that is, unless we specify which variables to impute in a list.
  1. Let's fit the arbitrary number imputer to the train set:
imputer.fit(X_train)
  1. Finally, let's replace the missing values with 99:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

The variables specified in step 10 should now have missing data replaced with the number 99.

How it works...

In this recipe, we replaced missing values in numerical variables in the Credit Approval Data Set with an arbitrary number, 99, using pandas, scikit-learn, and Feature-engine. We loaded the data and divided it into train and test sets using train_test_split() from scikit-learn, as described in the Performing mean or median imputation recipe.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using the pandas max() method. Next, we chose a value, 99, that was bigger than the maximum values of the selected variables. In step 5, we used a for loop over the numerical variables to replace any missing data with the pandas fillna() method while passing 99 as an argument and setting the inplace argument to True in order to replace the values in the original dataframe.

To replace missing values using scikit-learn, we called SimpleImputer(), set strategy to constant, and specified 99 as the arbitrary value in the fill_value argument. Next, we fitted the imputer to the train set with the fit() method and replaced missing values using the transform() method in the train and test sets. SimpleImputer() returned a NumPy array with the missing data replaced by 99. 

Finally, we replaced missing values with ArbitraryValueImputer() from Feature-engine, specifying a value, 99, in the arbitrary_number argument. We also included the variables to impute in a list to the variables argument. Next, we applied the fit() method. ArbitraryNumberimputer() checked that the selected variables were numerical after applying the fit() method. With the transform() method, the missing values in the train and test sets were replaced with 99, thus returning dataframes without missing values in selected variables.

There's more...

Scikit-learn released the ColumnTransformer() object, which allows us to select specific variables so that we can apply a certain imputation method. To learn how to use ColumnTransformer(), check out the Assembling an imputation pipeline with scikit-learn recipe.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon