Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Assembling an imputation pipeline with scikit-learn

Datasets often contain a mix of numerical and categorical variables. In addition, some variables may contain a few missing data points, while others will contain quite a big proportion. The mechanisms by which data is missing may also vary among variables. Thus, we may wish to perform different imputation procedures for different variables. In this recipe, we will learn how to perform different imputation procedures for different feature subsets using scikit-learn.

How to do it...

To proceed with the recipe, let's import the required libraries and classes and prepare the dataset:

  1. Let's import pandas and the required classes from scikit-learn:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. Let's divide the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
  1. Let's group a subset of columns to which we want to apply different imputation techniques in lists:
features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']
  1. Let's create different imputation transformers using SimpleImputer() within the scikit-learn pipeline:
imputer_num_arbitrary = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=99)),
])
imputer_num_median = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
])
imputer_cat_frequent = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
])
imputer_cat_missing = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
])
We have covered all these imputation strategies in dedicated recipes throughout this chapter.
  1. Now, let's assemble the pipelines with the imputers within ColumnTransformer() and assign them to the different feature subsets we created in step 4:
preprocessor = ColumnTransformer(transformers=[
('imp_num_arbitrary', imputer_num_arbitrary,
features_num_arbitrary),
('imp_num_median', imputer_num_median, features_num_median),
('imp_cat_frequent', imputer_cat_frequent, features_cat_frequent),
('imp_cat_missing', imputer_cat_missing, features_cat_missing),
], remainder='passthrough')
  1. Next, we need to fit the preprocessor to the train set so that the imputation parameters are learned:
preprocessor.fit(X_train)
  1. Finally, let's replace the missing values in the train and test sets:
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

Remember that scikit-learn transformers return NumPy arrays. The beauty of this procedure is that we can save the preprocessor in one object to perpetuate all the parameters that are learned by the different transformers.

How it works...

In this recipe, we carried out different imputation techniques over different variable groups using scikit-learn's SimpleImputer() and ColumnTransformer().

After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute by a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.

Next, we created multiple imputation objects using SimpleImputer() in a scikit-learn pipeline. To assemble each Pipeline(), we gave each step a name with a string. In our example, we used imputer. Next to this, we created the imputation object with SimpleImputer(), varying the strategy for the different imputation techniques.

Next, we arranged pipelines with different imputation strategies within ColumnTransformer(). To set up ColumnTransformer(), we gave each step a name with a string. Then, we added one of the created pipelines and the list with the features which should be imputed with said pipeline.

Next, we fitted ColumnTransformer() to the train set, where the imputers learned the values to be used to replace missing data from the train set. Finally, we imputed the missing values in the train and test sets, using the transform() method of ColumnTransformer() to obtain complete NumPy arrays.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon