Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Implementing random sample imputation

Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.

How to do it...

Let's begin by importing the required libraries and tools and preparing the dataset:

  1. Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import RandomSampleImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. The random values that will be used to replace missing data should be extracted from the train set, so let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)

First, we will run the commands line by line to understand their output. Then, we will execute them in a loop to impute several variables. In random sample imputation, we extract as many random values as there is missing data in the variable.

  1. Let's calculate the number of missing values in the A2 variable:
number_na = X_train['A2'].isnull().sum()
  1. If you print the number_na variable, you will obtain 11 as output, which is the number of missing values in A2. Thus, let's extract 11 values at random from A2 for the imputation:
random_sample_train = X_train['A2'].dropna().sample(number_na, 
random_state=0)
  1. We can only use one pandas Series to replace values in another pandas Series if their indexes are identical, so let's re-index the extracted random values so that they match the index of the missing values in the original dataframe:
random_sample_train.index = X_train[X_train['A2'].isnull()].index
  1. Now, let's replace the missing values in the original dataset with randomly extracted values:
X_train.loc[X_train['A2'].isnull(), 'A2'] = random_sample_train
  1. Now, let's combine step 4 to step 7 in a loop to replace the missing data in the variables in various train and test sets:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:

# extract a random sample
random_sample_train = X_train[var].dropna().sample(
X_train[var].isnull().sum(), random_state=0)

random_sample_test = X_train[var].dropna().sample(
X_test[var].isnull().sum(), random_state=0)

# re-index the randomly extracted sample
random_sample_train.index = X_train[
X_train[var].isnull()].index
random_sample_test.index = X_test[X_test[var].isnull()].index

# replace the NA
X_train.loc[X_train[var].isnull(), var] = random_sample_train
X_test.loc[X_test[var].isnull(), var] = random_sample_test
Note how we always extract values from the train set, but we calculate the number of missing values and the index using the train or test sets, respectively.

To finish, let's impute missing values using Feature-engine. First, we need to separate the data into train and test, just like we did in step 3 of this recipe.

  1. Next, let's set up RandomSamplemputer() and fit it to the train set:
imputer = RandomSampleImputer()
imputer.fit(X_train)
RandomSampleImputer() will replace the values in all variables in the dataset by default.

We can specify the variables to impute by passing variable names in a list to the imputer using imputer = RandomSampleImputer(variables = ['A2', 'A3']).
  1. Finally, let's replace the missing values:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
To obtain reproducibility between code runs, we can set the random_state to a number when we initialize the RandomSampleImputer(). It will use the random_state at each run of the transform() method.

How it works...

In this recipe, we replaced missing values in the numerical and categorical variables of the Credit Approval Data Set with values extracted at random from the same variables using pandas and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe.

To perform random sample imputation using pandas, we calculated the number of missing values in the variable using pandas isnull(), followed by sum(). Next, we used pandas dropna() to drop missing information from the original variable in the train set so that we extracted values from observations with data using pandas sample(). We extracted as many observations as there was missing data in the variable to impute. Next, we re-indexed the pandas Series with the randomly extracted values so that we could assign those to the missing observations in the original dataframe. Finally, we replaced the missing values with values extracted at random using pandas' loc, which takes the location of the rows with missing data and the name of the column to which the new values are to be assigned as arguments.

We also carried out random sample imputation with RandomSampleImputer() from Feature-engine. With the fit() method, the RandomSampleImputer() stores a copy of the train set. With transform(), the imputer extracts values at random from the stored dataset and replaces the missing information with them, thereby returning complete pandas dataframes.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon