Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Assembling an imputation pipeline with Feature-engine

Feature-engine is an open source Python library that allows us to easily implement different imputation techniques for different feature subsets. Often, our datasets contain a mix of numerical and categorical variables, with few or many missing values. Therefore, we normally perform different imputation techniques on different variables, depending on the nature of the variable and the machine learning algorithm we want to build. With Feature-engine, we can assemble multiple imputation techniques in a single step, and in this recipe, we will learn how to do this.

How to do it...

Let's begin by importing the necessary Python libraries and preparing the data:

  1. Let's import pandas and the required function and class from scikit-learn, and the missing data imputation module from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import feature_engine.missing_data_imputers as mdi
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. Let's divide the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
  1. Let's create lists with the names of the variables that we want to apply specific imputation techniques to:
features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']
  1. Let's assemble an arbitrary value imputer, a median imputer, a frequent category imputer, and an imputer to replace any missing values with the Missing string within a scikit-learn pipeline:
pipe = Pipeline(steps=[
('imp_num_arbitrary', mdi.ArbitraryNumberImputer(
variables = features_num_arbitrary)),
('imp_num_median', mdi.MeanMedianImputer(
imputation_method = 'median', variables=features_num_median)),
('imp_cat_frequent', mdi.FrequentCategoryImputer(
variables = features_cat_frequent)),
('imp_cat_missing', mdi.CategoricalVariableImputer(
variables=features_cat_missing))
])
Note how we pass the feature lists we created in step 4 to the imputers.
  1. Let's fit the pipeline to the train set so that each imputer learns and stores the imputation parameters:
pipe.fit(X_train)
  1. Finally, let's replace missing values in the train and test sets:
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

We can store the pipeline after fitting it as an object to perpetuate the use of the learned parameters.

How it works...

In this recipe, we performed different imputation techniques on different variable groups from the Credit Approval Data Set by utilizing Feature-engine within a single scikit-learn pipeline.

After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute with a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.

Next, we assembled the different Feature-engine imputers within a single scikit-learn pipeline. With ArbitraryNumberImputer(), we imputed missing values with the number 999; with MeanMedianImputer(), we performed median imputation; with FrequentCategoryImputer(), we replaced the missing values with the mode; and with CategoricalVariableImputer(), we replaced the missing values with the Missing string. We specified a list of features to impute within each imputer.

When assembling a scikit-learn pipeline, we gave each step a name using a string, and next to it we set up each of the Feature-engine imputers, specifying the feature subset within each imputer.

With the fit() method, the imputers learned and stored parameters and with transform() the missing values were replaced, returning complete pandas dataframes.

We can store the scikit-learn pipeline with Feature-engine's transformers as one object in order to perpetuate the learned parameters.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon