Python Feature Engineering Cookbook - Second Edition

By Soledad Galli
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 2: Encoding Categorical Variables
About this book

Feature engineering, the process of transforming variables and creating features, albeit time-consuming, ensures that your machine learning models perform seamlessly. This second edition of Python Feature Engineering Cookbook will take the struggle out of feature engineering by showing you how to use open source Python libraries to accelerate the process via a plethora of practical, hands-on recipes.

This updated edition begins by addressing fundamental data challenges such as missing data and categorical values, before moving on to strategies for dealing with skewed distributions and outliers. The concluding chapters show you how to develop new features from various types of data, including text, time series, and relational databases. With the help of numerous open source Python libraries, you'll learn how to implement each feature engineering method in a performant, reproducible, and elegant manner.

By the end of this Python book, you will have the tools and expertise needed to confidently build end-to-end and reproducible feature engineering pipelines that can be deployed into production.

Publication date:
October 2022
Publisher
Packt
Pages
386
ISBN
9781804611302

 

Encoding Categorical Variables

Categorical variables are those whose values are selected from a group of categories or labels. For example, the Gender variable with the values of Male and Female is categorical, and so is the marital status variable with the values of never married, married, divorced, and widowed. In some categorical variables, the labels have an intrinsic order; for example, in the Student’s grade variable, the values of A, B, C, and Fail are ordered, with A being the highest grade and Fail being the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the City variable, with the values of London, Manchester, Bristol, and so on.

The values of categorical variables are often encoded as strings. To train mathematical or machine learning models, we need to transform those strings into numbers. The act of replacing strings with numbers is called categorical encoding. In this chapter, we will discuss multiple categorical encoding methods.

This chapter will cover the following recipes:

  • Creating binary variables through one-hot encoding
  • Performing one-hot encoding of frequent categories
  • Replacing categories with counts or the frequency of observations
  • Replacing categories with ordinal numbers
  • Performing ordinal encoding based on the target value
  • Implementing target mean encoding
  • Encoding with the Weight of Evidence
  • Grouping rare or infrequent categories
  • Performing binary encoding
 

Technical requirements

In this chapter, we will use the pandas, NumPy, and Matplotlib Python libraries, as well as scikit-learn and Feature-engine. For guidelines on how to obtain these libraries, visit the Technical requirements section of Chapter 1, Imputing Missing Data.

We will also use the open-source Category Encoders Python library, which can be installed using pip:

pip install category_encoders

To learn more about Category Encoders, visit the following link: https://contrib.scikit-learn.org/category_encoders/.

We will also use the Credit Approval dataset, which is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/credit+approval.

To prepare the dataset, follow these steps:

  1. Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/ and click on crx.data to download the data:
Figure 2.1 – The index directory for the Credit Approval dataset

Figure 2.1 – The index directory for the Credit Approval dataset

  1. Save crx.data to the folder where you will run the following commands.

After downloading the data, open up a Jupyter Notebook and run the following commands.

  1. Import the required libraries:
    import random
    import numpy as np
    import pandas as pd
  2. Load the data:
    data = pd.read_csv("crx.data", header=None)
  3. Create a list containing the variable names:
    varnames = [f"A{s}" for s in range(1, 17)]
  4. Add the variable names to the DataFrame:
    data.columns = varnames
  5. Replace the question marks in the dataset with NumPy NaN values:
    data = data.replace("?", np.nan)
  6. Cast some numerical variables as float data types:
    data["A2"] = data["A2"].astype("float")
    data["A14"] = data["A14"].astype("float")
  7. Encode the target variable as binary:
    data["A16"] = data["A16"].map({"+": 1, "-": 0})
  8. Rename the target variable:
    data.rename(columns={"A16": "target"}, inplace=True)
  9. Make lists that contain categorical and numerical variables:
    cat_cols = [
        c for c in data.columns if data[c].dtypes=="O"] 
    num_cols = [
        c for c in data.columns if data[c].dtypes!= "O"]
  10. Fill in the missing data:
    data[num_cols] = data[num_cols].fillna(0)
    data[cat_cols] = data[cat_cols].fillna("Missing")
  11. Save the prepared data:
    data.to_csv("credit_approval_uci.csv", index=False)

You can find a Jupyter Notebook that contains these commands in this book’s GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook-Second-Edition/blob/main/ch02-categorical-encoding/donwload-prepare-store-credit-approval-dataset.ipynb.

Note

Some libraries require that you have already imputed missing data, for which you can use any of the recipes from Chapter 1, Imputing Missing Data.

 

Creating binary variables through one-hot encoding

In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable takes a value of 1 if the category is present in an observation, or 0 otherwise.

The following table shows the one-hot encoded representation of the Gender variable with the categories of Male and Female:

Figure 2.2 – One-hot encoded representation of the Gender variable

Figure 2.2 – One-hot encoded representation of the Gender variable

As shown in Figure 2.2, from the Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.

For the categorical variable of Color with the values of red, blue, and green, we can create three variables called red, blue, and green. These variables will take the value of 1 if the observation is red, blue, or green, respectively, or 0 otherwise.

A categorical variable with k unique categories can be encoded using k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), so we only need to create one binary variable (k - 1 = 1) to capture all of the information. For the Color variable, which has three categories (k=3; red, blue, and green), we need to create two (k - 1 = 2) binary variables to capture all the information so that the following occurs:

  • If the observation is red, it will be captured by the red variable (red = 1, blue = 0).
  • If the observation is blue, it will be captured by the blue variable (red = 0, blue = 1)
  • If the observation is green, it will be captured by the combination of red and blue (red = 0, blue = 0)

Encoding into k-1 binary variables is well-suited for linear models. There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:

  • When training decision trees since they do not evaluate the entire feature space at the same time
  • When selecting features recursively
  • When determining the importance of each category within a variable

In this recipe, we will compare the one-hot encoding implementations of pandas, scikit-learn, Feature-engine, and Category Encoders.

How to do it...

First, let’s make a few imports and get the data ready:

  1. Import pandas and the train_test_split function from scikit-learn:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the Credit Approval dataset:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s inspect the unique categories of the A4 variable:
    X_train["A4"].unique()

We can see the unique values of A4 in the following output:

array(['u', 'y', 'Missing', 'l'], dtype=object)
  1. Let’s encode A4 into k-1 binary variables using pandas and then inspect the first five rows of the resulting DataFrame:
    dummies = pd.get_dummies(X_train["A4"], drop_first=True)
    dummies.head()

Note

With pandas get_dummies(), we can either ignore or encode missing data through the dummy_na parameter. By setting dummy_na=True, missing data will be encoded in a new binary variable. To encode the variable into k dummies, use drop_first=False instead.

Here, we can see the output of step 5, where each label is now a binary variable:

	l	u	y
596	0	1	0
303	0	1	0
204	0	0	1
351	0	0	1
118	0	1	0
  1. Now, let’s encode all of the categorical variables into k-1 binaries, capturing the result in a new DataFrame:
    X_train_enc = pd.get_dummies(X_train, drop_first=True)
    X_test_enc = pd.get_dummies(X_test, drop_first=True)

Note

The get_dummies method from pandas will automatically encode all variables of the object or type. We can encode a subset of the variables by passing the variable names in a list to the columns parameter.

  1. Let’s inspect the first five rows of the binary variables created in step 6:
    X_train_enc.head()

Note

When encoding more than one variable, get_dummies() captures the variable name – say, A1 – and places an underscore followed by the category name to identify the resulting binary variables.

We can see the binary variables in the following output:

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right

Note

The get_dummies() method will create one binary variable per seen category. Hence, if there are more categories in the train set than in the test set, get_dummies() will return more columns in the transformed train set than in the transformed test set, and vice versa. To avoid this, it is better to carry out one-hot encoding with scikit-learn or Feature-engine, as we will discuss later in this recipe.

  1. Let’s concatenate the binary variables to the original dataset:
    X_test_enc = pd.concat([X_test, X_test_enc], axis=1)
  2. Now, let’s drop the categorical variables from the data:
    X_test_enc.drop(
        labels=X_test_enc.select_dtypes(
            include="O").columns,
        axis=1,
        inplace=True,
    )

And that’s it! Now, we can use our categorical variables to train mathematical models. To inspect the result, use X_test_enc.head().

Now, let’s do one-hot encoding using scikit-learn.

  1. Import the encoder from scikit-learn:
    from sklearn.preprocessing import OneHotEncoder
  2. Let’s set up the transformer. By setting drop to "first", we encode into k-1 binary variables, and by setting sparse to False, the transformer will return a NumPy array (instead of a sparse matrix):
    encoder = OneHotEncoder(drop="first", sparse=False)

Tip

We can encode variables into k dummies by setting the drop parameter to None. We can also encode into k-1 if variables contain two categories and into k if variables contain more than two categories by setting the drop parameter to “if_binary”. The latter is useful because encoding binary variables into k dummies is redundant.

  1. First, let’s create a list containing the variable names:
    vars_categorical = X_train.select_dtypes(
        include="O").columns.to_list()
  2. Let’s fit the encoder to a slice of the train set with the categorical variables:
    encoder.fit(X_train[vars_categorical])
  3. Let’s inspect the categories for which dummy variables will be created:
    encoder.categories_

We can see the result of the preceding command here:

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable

Note

Scikit-learn’s OneHotEncoder() will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error by setting the handle_unknown parameter to 'ignore' or 'error', respectively.

  1. Let’s create the NumPy arrays with the binary variables for the train and test sets:
    X_train_enc = encoder.transform(
        X_train[vars_categorical])
    X_test_enc = encoder.transform(
        X_test[vars_categorical])
  2. Let’s extract the names of the binary variables:
    encoder.get_feature_names_out()

We can see the binary variable names that were returned in the following output:

Figure 2.5 – Arrays with the names of the one-hot encoded variables

Figure 2.5 – Arrays with the names of the one-hot encoded variables

  1. Let’s convert the array into a pandas DataFrame and add the variable names:
    X_test_enc = pd.DataFrame(X_test_enc)
    X_test_enc.columns = encoder.get_feature_names_out()
  2. To concatenate the one-hot encoded data to the original dataset, we need to make their indexes match:
    X_test_enc.index = X_test.index

Now, we are ready to concatenate the one-hot encoded variables to the original data and then remove the categorical variables using steps 8 and 9 from this recipe.

To follow up, let’s perform one-hot encoding with Feature-engine.

  1. Let’s import the encoder from Feature-engine:
    from feature_engine.encoding import OneHotEncoder
  2. Next, let’s set up the encoder so that it returns k-1 binary variables:
    ohe_enc = OneHotEncoder(drop_last=True)

Tip

Feature-engine automatically finds the categorical variables. To encode only a subset of the variables, we can pass the variable names in a list: OneHotCategoricalEncoder(variables=["A1", "A4"]). To encode numerical variables, we can set the ignore_format parameter to True or cast the variables as the object type. This is useful because sometimes, numerical variables are used to represent categories, such as postcodes.

  1. Let’s fit the encoder to the train set so that it learns the categories and variables to encode:
    ohe_enc.fit(X_train)
  2. Let’s explore the variables that will be encoded:
    ohe_enc.variables_

The transformer found and stored the variables of the object or categorical type, as shown in the following output:

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

Note

Feature-engine’s OneHotEncoder has the option to encode most variables into k dummies, while only returning k-1 dummies for binary variables. For this behavior, set the drop_last_binary parameter to True.

  1. Let’s explore the categories for which dummy variables will be created:
    ohe_enc.encoder_dict_

The following dictionary contains the categories that will be encoded in each variable:

{'A1': ['a', 'b'],
 'A4': ['u', 'y', 'Missing'],
 'A5': ['g', 'p', 'Missing'],
 'A6': ['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 
 'd',      'k', 'j', 'Missing', 'aa'],
 'A7': ['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 
 'n'],
 'A9': ['t'],
 'A10': ['t'],
 'A12': ['t'],
 'A13': ['g', 's']}
  1. Let’s encode the categorical variables in train and test sets:
    X_train_enc = ohe_enc.transform(X_train)
    X_test_enc = ohe_enc.transform(X_test)

Tip

Feature-engine’s OneHotEncoder() returns a copy of the original dataset plus the binary variables and without the original categorical variables. Thus, this data is ready to train machine learning models.

If we execute X_train_enc.head(), we will see the following DataFrame:

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right

Note how the A4 categorical variable was replaced with A4_u, A4_y, and so on.

Note

We can get the names of all the variables in the transformed dataset by executing ohe_enc.get_feature_names_out().

How it works...

In this recipe, we performed a one-hot encoding of categorical variables using pandas, scikit-learn, Feature-engine, and Category Encoders.

With get_dummies() from pandas, we automatically created binary variables for each of the categories in the categorical variables.

The OneHotEncoder transformers from the scikit-learn and Feature-engine libraries share the fit() and transform() methods. With fit(), the encoders learned the categories for which the dummy variables should be created. With transform(), they returned the binary variables either in a NumPy array or added them to the original DataFrame.

Tip

One-hot encoding expands the feature space. From nine original categorical variables, we created 36 binary ones. If our datasets contain many categorical variables or highly cardinal variables, we will easily increase the feature space dramatically, which increases the computational cost of training machine learning models or obtaining their predictions and may also deteriorate their performance.

There’s more...

We can also perform one-hot encoding using OneHotEncoder from the Category Encoders library.

OneHotEncoder() from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. So does pandas get_dummies(). Scikit-learn’s OneHotEncoder(), on the other hand, will encode all variables in the dataset.

With pandas, Feature-engine, and Category Encoders, we can only encode a subset of the variables, indicating their names in a list. With scikit-learn, we need to use an additional class, ColumnTransformer(), to slice the data before the transformation.

With Feature-engine and Category Encoders, the dummy variables are added to the original dataset and the categorical variables are removed after the encoding. With scikit-learn and pandas, we need to manually perform these procedures.

Finally, using OneHotEncoder() from scikit-learn, Feature-engine, and Category Encoders, we can perform the encoding step within a scikit-learn pipeline, which is more convenient if we have various feature engineering steps or want to put the pipelines into production. pandas get_dummies() is otherwise well suited for data analysis and visualization.

 

Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from feature_engine.encoding import OneHotEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(
            labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

  1. Let’s inspect the unique categories of the A6 variable:
    X_train["A6"].unique()

The unique values of A6 are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)
  1. Let’s count the number of observations per category of A6, sort them in decreasing order, and then display the five most frequent categories:
    X_train["A6"].value_counts().sort_values(
        ascending=False).head(5)

We can see the five most frequent categories and the number of observations per category in the following output:

c     93 q     56 w     48 i     41 ff    38
Name: A6, dtype: int64
  1. Now, let’s capture the most frequent categories of A6 in a list by using the code in step 4 inside a list comprehension:
    top_5 = [
        x for x in X_train["A6"].value_counts().sort_values(
            ascending=False).head(5).index
    ]
  2. Now, let’s add a binary variable per top category to the train and test sets:
    for label in top_5:
        X_train[f"A6_{label}"] = np.where(
            X_train["A6"] ==label, 1, 0)
        X_test[f"A6_{label}"] = np.where(
            X_test["A6"] ==label, 1, 0)
  3. Let’s display the top 10 rows of the original and encoded variable, A6, in the train set:
    X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)

In the output of step 7, we can see the A6 variable, followed by the binary variables:

    A6  A6_c  A6_q  A6_w  A6_i  A6_ff 596   c     1     0     0     0      0 303   q     0     1     0     0      0 204   w     0     0     1     0      0 351  ff     0     0     0     0      1 118   m     0     0     0     0      0 247   q     0     1     0     0      0 652   i     0     0     0     1      0 513   e     0     0     0     0      0 230  cc     0     0     0     0      0 250   e     0     0     0     0      0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

  1. Let’s set up the one-hot encoder to encode the five most frequent categories of the A6 and A7 variables:
    ohe_enc = OneHotEncoder(
        top_categories=5,
        variables=["A6", "A7"]
    )

Tip

Feature-engine’s OneHotEncoder() will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.

  1. Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of A6 and A7:
    ohe_enc.fit(X_train)

Note

The number of frequent categories to encode is arbitrarily determined by the user.

  1. Finally, let’s encode A6 and A7 in the train and test sets:
    X_train_enc = ohe_enc.transform(X_train)
    X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing X_train_enc.head(). You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the A6 categorical variable. We inspected its unique categories with pandas unique(). Next, we counted the number of observations per category using pandas value_counts(),which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values(). Next, we reduced the series to the five most popular categories by using pandas head(). Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where() method, we created binary variables by placing a value of 1 if the observation showed the category, or 0 otherwise.

To perform a one-hot encoding of the five most popular categories of the A6 and A7 variables with Feature-engine, we used OneHotEncoder(), indicating 5 in the top_categories argument, and passing the variable names in a list to the variables argument. With fit(), the encoder learned the top categories from the train set and stored them in its encoder_dict_ attribute. Then, with transform(), OneHotEncoder() replaced the original variables with the set of binary ones.

There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.

 

Replacing categories with counts or the frequency of observations

In count or frequency encoding, we replace the categories with the count or the fraction of observations showing that category. That is, if 10 out of 100 observations show the category blue for the Color variable, we would replace blue with 10 when doing count encoding, or with 0.1 if performing frequency encoding. These encoding methods, which capture the representation of each label in a dataset, are very popular in data science competitions. The assumption is that the number of observations per category is somewhat predictive of the target.

Tip

Note that if two different categories are present in the same number of observations, they will be replaced by the same value, which leads to information loss.

In this recipe, we will perform count and frequency encoding using pandas, Feature-engine, and Category Encoders.

How to do it...

Let’s begin by making some imports and preparing the data:

  1. Import pandas and the required function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s count the number of observations per category of the A7 variable and capture it in a dictionary:
    counts = X_train["A7"].value_counts().to_dict()

Tip

To encode categories with their frequency, execute X_train["A6"].value_counts(normalize=True).to_dict().

If we execute print(counts), we can observe the count of observations per category:

{'v': 277, 'h': 101, 'ff': 41, 'bb': 39, 'z': 7, 'dd': 5, 'j': 5, 'Missing': 4,, 'n': 3, 'o': 1}
  1. Let’s replace the categories in A7 with the counts:
    X_train["A7"] = X_train["A7"].map(counts)
    X_test["A7"] = X_test["A7"].map(counts)

Go ahead and inspect the data by executing X_train.head() to corroborate that the categories have been replaced by the counts.

Now, let’s carry out count encoding using Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

  1. Let’s import the count encoder from Feature-engine:
    from feature_engine.encoding import CountFrequencyEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
    count_enc = CountFrequencyEncoder(
        encoding_method="count", variables=None,
    )

Tip

CountFrequencyEncoder() will automatically find and encode all categorical variables in the train set. To encode only a subset of the variables, we can pass the variable names in a list to the variables argument.

  1. Let’s fit the encoder to the train set so that it stores the number of observations per category per variable:
    count_enc.fit(X_train)

Tip

The dictionaries with the category-to-counts pairs are stored in the encoder_dict_ attribute and can be displayed by executing count_enc.encoder_dict_.

  1. Finally, let’s replace the categories with counts in the train and test sets:
    X_train_enc = count_enc.transform(X_train)
    X_test_enc = count_enc.transform(X_test)

Tip

If there are categories in the test set that were not present in the train set, the transformer will replace those with np.nan and return a warning to make you aware of this. A good idea to prevent this behavior is to group infrequent labels, as described in the Grouping rare or infrequent categories recipe.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

To wrap up this recipe, let’s encode the variables using Category Encoders.

  1. Let’s import the encoder from Category Encoders:
    from category_encoders.count import CountEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
    count_enc = CountEncoder(cols=None)

Note

CountEncoder()automatically finds and encodes all categorical variables in the train set. To encode only a subset of the categorical variables, we can pass the variable names in a list to the cols argument. To replace the categories by frequency instead, we need to set the Normalize parameter to True.

  1. Let’s fit the encoder to the train set so that it counts and stores the number of observations per category per variable:
    count_enc.fit(X_train)

Tip

The values used to replace the categories are stored in the mapping attribute and can be displayed by executing count_enc.mapping.

  1. Finally, let’s replace the categories with counts in the train and test sets:
    X_train_enc = count_enc.transform(X_train)
    X_test_enc = count_enc.transform(X_test)

Note

Categories present in the test set that were not seen in the train set are referred to as unknown categories. CountEncoder() has different options to handle unknown categories, including returning an error, treating them as missing data, or replacing them with an indicated integer. CountEncoder() can also automatically group categories with few observations.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

How it works...

In this recipe, we replaced categories by the count of observations using pandas, Feature-engine, and Category Encoders.

Using pandas value_counts(), we determined the number of observations per category of the A7 variable, and with pandas to_dict(), we captured these values in a dictionary, where each key was a unique category, and each value the number of observations for that category. With pandas map() and using this dictionary, we replaced the categories with the observation counts in both the train and test sets.

To perform count encoding with Feature-engine, we used CountFrequencyEncoder() and set encoding_method to 'count'. We left the variables argument set to None so that the encoder automatically finds all of the categorical variables in the dataset. With the fit() method, the transformer found the categorical variables and stored the observation counts per category in the encoder_dict_ attribute. With the transform() method, the transformer replaced the categories with the counts, returning a pandas DataFrame.

Finally, we performed count encoding with CountEncoder() by setting Normalize to False. We left the cols argument set to None so that the encoder automatically finds the categorical variables in the dataset. With the fit() method, the transformer found the categorical variables and stored the category to count mappings in the mapping attribute. With the transform() method, the transformer replaced the categories with the counts in, returning a pandas DataFrame.

 

Replacing categories with ordinal numbers

Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1, depending on the implementation), where k is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.

In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and prepare the dataset:

  1. Import pandas and the data split function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. To encode the A7 variable, let’s make a dictionary of category-to-integer pairs:
    ordinal_mapping = {k: i for i, k in enumerate(
        X_train["A7"].unique(), 0)
    }

If we execute print(ordinal_mapping), we will see the digits that will replace each category:

{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}
  1. Now, let’s replace the categories with numbers in the original variables:
    X_train["A7"] = X_train["A7"].map(ordinal_mapping)
    X_test["A7"] = X_test["A7"].map(ordinal_mapping)

With print(X_train["A7"].head(10)), we can see the result of the preceding operation, where the original categories were replaced by numbers:

596	0
303	0
204	0
351	1
118	0
247	2
652	0
513	3
230	0
250	4
Name:	A7, dtype: int64

Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in step 2.

  1. Let’s import the required classes:
    from sklearn.preprocessing import OrdinalEncoder
    from sklearn.compose import ColumnTransformer

Tip

Do not confuse OrdinalEncoder() with LabelEncoder() from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.

  1. Let’s set up the encoder:
    enc = OrdinalEncoder()

Note

Scikit-learn’s OrdinalEncoder() will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s ColumnTransformer().

  1. Let’s make a list containing the categorical variables to encode:
    vars_categorical = X_train.select_dtypes(
        include="O").columns.to_list()
  2. Let’s make a list containing the remaining variables:
    vars_remainder = X_train.select_dtypes(
        exclude="O").columns.to_list()
  3. Now, let’s set up ColumTransformer() to encode the categorical variables. By setting the remainder parameter to "passthrough", we make ColumnTransformer() concatenate the variables that are not encoded at the back of the encoded features:
    ct = ColumnTransformer(
        [("encoder", enc, vars_categorical)],
        remainder="passthrough",
    )
  4. Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
    ct.fit(X_train)

By executing ct.named_transformers_["encoder"].categories_, you can visualize the unique categories per variable.

  1. Now, let’s encode the categorical variables in the train and test sets:
    X_train_enc = ct.transform(X_train)
    X_test_enc = ct.transform(X_test)

Remember that scikit-learn returns a NumPy array.

  1. Let’s transform the arrays into pandas DataFrames by adding the columns:
    X_train_enc = pd.DataFrame(
        X_train_enc, columns=vars_categorical+vars_remainder)
    X_test_enc = pd.DataFrame(
        X_test_enc, columns=vars_categorical+vars_remainder)

Note

Note that, with ColumnTransformer(), the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of step 12 with X_train_enc.head().

Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import OrdinalEncoder
  2. Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in step 7:
    enc = OrdinalEncoder(encoding_method="arbitrary", variables=vars_categorical)

Note

Feature-engine’s OrdinalEncoder automatically finds and encodes all categorical variables if the variables parameter is left set to None. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s OrdinalEncoder() can assign the integers according to the target mean value (see the Performing ordinal encoding based on the target value recipe).

  1. Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
    enc.fit(X_train)

Tip

The category to integer mappings are stored in the encoder_dict_ attribute and can be accessed by executing enc.encoder_dict_.

  1. Finally, let’s encode the categorical variables in the train and test sets:
    X_train_enc = enc.transform(X_train)
    X_test_enc = enc.transform(X_test)

Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.

How it works...

In this recipe, we replaced categories with integers assigned arbitrarily.

With pandas unique(), we returned the unique values of the A7 variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the A7 variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas map() to replace the strings in A7 with the integers.

Next, we carried out ordinal encoding using scikit-learn’s OrdinalEncoder() and used ColumnTransformer() to select the columns to encode. With the fit() method, the transformer created the category-to-integer mappings based on the categories in the train set. With the transform() method, the categories were replaced with integers, returning a NumPy array. ColumnTransformer() sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.

To perform ordinal encoding with Feature-engine, we used OrdinalEncoder(), indicating that the integers should be assigned arbitrarily in encoding_method and passing a list with the variables to encode in the variables argument. With the fit() method, the encoder assigned integers to each variable’s categories, which were stored in the encoder_dict_ attribute. These mappings were then used by the transform() method to replace the categories in the train and test sets, returning DataFrames.

There’s more...

You can also carry out ordinal encoding with OrdinalEncoder() from Category Encoders.

The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.

scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, ColumnTransformer(), to slice the data before the transformation.

Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.

Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.

 

Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s determine the mean target value per category in A7, then sort the categories from that with the lowest to that with the highest target value:
    y_train.groupby(X_train["A7"]).mean().sort_values()

The following is the output of the preceding command:

A7
o          0.000000
ff         0.146341
j          0.200000
dd         0.400000
v          0.418773
bb         0.512821
h          0.603960
n          0.666667
z          0.714286
Missing    1.000000
Name: target, dtype: float64
  1. Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
    ordered_labels = y_train.groupby(
        X_train["A7"]).mean().sort_values().index

To display the output of the preceding command, we can execute print(ordered_labels):

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')
  1. Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
    ordinal_mapping = {
        k: i for i, k in enumerate(
            ordered_labels, 0)
    }

We can visualize the result of the preceding code by executing print(ordinal_mapping):

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}
  1. Let’s use the dictionary we created in step 5 to replace the categories in A7 in the train and test sets, returning the encoded features as new columns:
    X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping)
    X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7 variable with the target before and after the encoding.

  1. Let’s plot the mean target response per category of the A7 variable:
    y_train.groupby(X_train["A7"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

We can see the non-monotonic relationship between categories of A7 and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

Figure 2.7 – Relationship between the categories of A7 and the target

  1. Let’s plot the mean target value per category in the encoded variable:
    y_train.groupby(X_train["A7_enc"]).mean().plot()
    plt.title("Relationship between A7 and the target")
    plt.ylabel("Mean of target")
    plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import OrdinalEncoder
  2. Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
    ordinal_enc = OrdinalEncoder(
        encoding_method="ordered",
        variables=None)

Tip

OrdinalEncoder() will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

  1. Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
    ordinal_enc.fit(X_train, y_train)

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

  1. Finally, let’s replace the categories with numbers in the train and test sets:
    X_train_enc = ordinal_enc.transform(X_train)
    X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the variables_ attribute of OrdinalEncoder() and the dictionaries with the category-to-integer mappings in the encoder_dict_ attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby() method.

How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the A7 categorical variable. With pandas groupby(), we grouped the data based on the categories of A7, and with pandas mean(), we determined the mean value of the target for each of the categories of A7. Next, we ordered the categories with pandas sort_values() from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map() in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7 using pandas groupby(), followed by pandas mean(), as described in the preceding paragraph. We followed up with pandas plot() to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title() and ylabel() methods.

To perform the encoding with Feature-engine, we used OrdinalEncoder() and indicated "ordered" in the encoding_method argument. We left the argument variables set to None so that the encoder automatically detects all categorical variables in the dataset. With the fit() method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_ and encoder_dict_ attributes, respectively. Finally, using the transform() method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

See also

For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.

 

Implementing target mean encoding

Mean encoding or target encoding maps each category to the probability estimate of the target attribute. If the target is binary, the numerical mapping is the posterior probability of the target conditioned to the value of the category. If the target is continuous, the numerical representation is given by the expected value of the target given the value of the category.

In its simplest form, the numerical representation for each category is given by the mean value of the target variable for a particular category group. For example, if we have a City variable, with the categories of London, Manchester, and Bristol, and we want to predict the default rate (the target takes values 0 and 1); if the default rate for London is 30%, we replace London with 0.3; if the default rate for Manchester is 20%, we replace Manchester with 0.2; and so on. If the target is continuous – say we want to predict income – then we would replace London, Manchester, and Bristol with the mean income earned in each city.

In mathematical terms, if the target is binary, the replacement value, S, is determined like so:

Here, the numerator is the number of observations with a target value of 1 for category i and the denominator is the number of observations with a category value of i.

If the target is continuous, S, this is determined by the following formula:

Here, the numerator is the sum of the target across observations in category i and ni is the total number of observations in category i.

These formulas provide a good approximation of the target estimate if there is a sufficiently large number of observations with each category value – in other words, if ni is large. However, in most datasets, categorical variables will only have categorical values present in a few observations. In these cases, target estimates derived from the precedent formulas can be unreliable.

To mitigate poor estimates returned for rare categories, the target estimates can be determined as a mixture of two probabilities: those returned by the preceding formulas and the prior probability of the target based on the entire training set. The two probabilities are blended using a weighting factor, which is a function of the category group size:

In this formula, ny is the total number of cases where the target takes a value of 1, N is the size of the train set, and 𝛌 is the weighting factor.

When the category group is large, 𝛌 approximates 1, so more weight is given to the first term of the equation. When the category group size is small, then 𝛌 tends to 0, so the estimate is mostly driven by the second term of the equation – that is, the target’s prior probability. In other words, if the group size is small, knowing the value of the category does not tell us anything about the value of the target.

The weighting factor, 𝛌, is a function of the group size, k, and a smoothing parameter, f, controls the rate of transition between the first and second term of the preceding equation:

Here, k is half of the minimal size for which we fully trust the first term of the equation. The f parameter is selected by the user either arbitrarily or with optimization.

Tip

Mean encoding was designed to encode highly cardinal categorical variables without expanding the feature space. For more details, check out the following article: Micci-Barreca D. A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations Newsletter, 2001.

In this recipe, we will perform mean encoding using pandas, Feature-engine, and Category Encoders.

How to do it...

In the first part of this recipe, we will replace categories with the target mean value, regardless of the number of observations per category. We will use pandas and Feature-engine to do this. In the second part of this recipe, we will introduce the weighting factor using Category Encoders. Let’s begin with this recipe:

  1. Import pandas and the data split function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s determine the mean target value per category of the A7 variable and then store them in a dictionary:
    mapping = y_train.groupby(X_train["A7"]).mean().to_dict()

We can display the content of the dictionary by executing print(mapping):

{'Missing': 1.0,
 'bb': 0.5128205128205128,
 'dd': 0.4,
 'ff': 0.14634146341463414,
 'h': 0.6039603960396039,
 'j': 0.2,
 'n': 0.6666666666666666,
 'o': 0.0,
 'v': 0.4187725631768953,
 'z': 0.7142857142857143}
  1. Let’s replace the categories with the mean target value using the dictionary we created in step 3 in the train and test sets:
    X_train["A7"] = X_train["A7"].map(mapping)
    X_test["A7"] = X_test["A7"].map(mapping)

You can inspect the encoded A7 variable by executing X_train["A7"].head().

Now, let’s perform target encoding with Feature-engine. First, we must split the data, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import MeanEncoder
  2. Let’s set up the target mean encoder to encode all categorical variables:
    mean_enc = MeanEncoder(variables=None)

Tip

MeanEncoder() will find and encode all categorical variables by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the variables argument.

  1. Let’s fit the transformer to the train set so that it learns and stores the mean target value per category per variable. Note that we need to pass both the train set and target to fit the encoder:
    mean_enc.fit(X_train, y_train)
  2. Finally, let’s encode the train and test sets:
    X_train_enc = mean_enc.transform(X_train)
    X_test_enc = mean_enc.transform(X_test)

Tip

The category-to-number pairs are stored as a dictionary of dictionaries in the encoder_dict_ attribute. To display the stored parameters, execute mean_enc.encoder_dict_.

Feature-engine returns pandas DataFrames containing the categorical variables, ready to use in machine learning models.

To wrap up, let’s implement mean encoding with Category Encoders blending the probabilities.

  1. Let’s import the encoder:
    from category_encoders.target_encoder import TargetEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables using blended probabilities when there are less than 25 observations in the category group:
    mean_enc = TargetEncoder(
        cols=None, min_samples_leaf=25,
        smoothing=1.0
    )

Tip

TargetEncoder() finds categorical variables automatically by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the cols argument. The smoothing parameter controls the blend of the prior and posterior probability. Higher values decrease the contribution of the posterior probability to the encoding.

  1. Let’s fit the transformer to the train set so that it learns and stores the numerical representations for each category:
    mean_enc.fit(X_train, y_train)

Note

The min_samples_leaf parameter refers to the minimum number of observations per category that a group should have to solely use the posterior probability. It is the equivalent of k in our weighting factor formula. In the original article, k was set to ½ of min_samples_leaf. Category encoders expose this value and thus, we can optimize it with cross-validation.

  1. Finally, let’s encode the train and test sets:
    X_train_enc = mean_enc.transform(X_train)
    X_test_enc = mean_enc.transform(X_test)

Category Encoders returns pandas DataFrames by default, where the original categorical variable values are replaced by their numerical representation. You can inspect the results by executing X_train_enc.head().

How it works…

In this recipe, we replaced the categories with the mean target value using pandas, Feature-engine, and Category Encoders.

With pandas groupby(), using the A7 categorical variable, followed by pandas mean() over the target variable, we created a pandas Series with the categories as indices and the target mean as values. With pandas to_dict(), we converted this Series into a dictionary. Finally, we used this dictionary to replace the categories in the train and test sets using pandas map().

To perform the encoding with Feature-engine, we used MeanEncoder(). With fit(), the transformer found and stored the categorical variables and the mean target value per category. With transform(), categories were replaced with numbers in the train and test sets, returning pandas DataFrames.

Finally, we used TargetEncoder() from Category Encoders to replace categories with a blend of prior and posterior probability estimates of the target. We set min_samples_leaf to 25, which meant that if a category group had 25 observations or more, then the posterior probability was used for the encoding; alternatively, a blend of probabilities was used for the encoding. With fit(), the transformer found the categorical variables and the numerical representation of the categories, while with transform(), the categories were replaced with numbers, returning pandas DataFrames with their encoded values.

There’s more…

There is an alternative way to return better target estimates when the category groups are small. The replacement value for each category is determined as follows:

Here, ni(Y=1) is the target mean for category i and ni is the number of observations with category i. The target prior is given by pY and m is the weighting factor. With this adjustment, the only parameter that we have to set is the weight, m. If m is large, then more importance is given to the target’s prior probability. This adjustment affects target estimates for all categories but mostly for those with fewer observations because, in such cases, m could be much larger than ni in the formula’s denominator.

For an implementation of this encoding using MEstimateEncoder(), visit this book’s GitHub repository.

 

Encoding with the Weight of Evidence

The Weight of Evidence (WoE) was developed primarily for credit and financial industries to facilitate variable screening and exploratory analysis and to build more predictive linear models to evaluate the risk of loan defaults.

The WoE is computed from the basic odds ratio:

Here, positive and negative refer to the values of the target being 1 or 0, respectively. The proportion of positive cases per category is determined as the sum of positive cases per category group divided by the total positive cases in the training set, and the proportion of negative cases per category is determined as the sum of negative cases per category group divided by the total number of negative observations in the training set.

The WoE has the following characteristics:

  • WoE = 0 if p(positive) / p(negative) = 1; that is, if the outcome is random
  • WoE > 0 if p(positive) > p(negative)
  • WoE < 0 if p(negative) > p(positive)

This allows us to directly visualize the predictive power of the category in the variable: the higher the WoE, the more likely the event will occur. If the WoE is positive, the event is likely to occur:

Logistic regression models a binary response, Y, based on X predictor variables, assuming that there is a linear relationship between X and the log of odds of Y.

Here, log (p(Y=1)/p(Y=0)) is the log of odds. As you can see, the WoE encodes the categories in the same scale – that is, the log of odds – as the outcome of the logistic regression.

Therefore, by using WoE, the predictors are prepared and coded on the same scale, and the parameters in the logistic regression model – that is, the coefficients – can be directly compared.

In this recipe, we will perform WoE encoding using pandas and Feature-engine.

How to do it...

Let’s begin by making some imports and preparing the data:

  1. Import the required libraries and functions:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s get the inverse of the target values to be able to calculate the negative cases:
    neg_y_train = pd.Series(
        np.where(y_train == 1, 0, 1),
        index=y_train.index
    )
  4. Let’s determine the number of observations where the target variable takes a value of 1 or 0:
    total_pos = y_train.sum()
    total_neg = neg_y_train.sum()
  5. Now, let’s calculate the numerator and denominator of the WoE’s formula, which we discussed earlier in this recipe:
    pos = y_train.groupby(
        X_train["A1"]).sum() / total_pos
    neg = neg_y_train.groupby(
        X_train["A1"]).sum() / total_neg
  6. Now, let’s calculate the WoE per category:
    woe = np.log(pos/neg)

We can display the series with the category to WoE pairs by executing print(woe):

A1
Missing    0.203599
a          0.092373
b         -0.042410
dtype: float64
  1. Finally, let’s replace the categories of A1 with the WoE:
    X_train["A1"] = X_train["A1"].map(woe)
    X_test["A1"] = X_test["A1"].map(woe)

You can inspect the encoded variable by executing X_train["A1"].head().

Now, let’s perform WoE encoding using Feature-engine. First, we need to separate the data into train and test sets, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import WoEEncoder
  2. Next, let’s set up the encoder so that we can encode three categorical variables:
    woe_enc = WoEEncoder(variables = ["A1", "A9", "A12"])

Tip

Feature-engine’s WoEEncoder() will return an error if p(0)=0 for any category because the division by 0 is not defined. To avoid this error, we can group infrequent categories, as we will discuss in the next recipe, Grouping rare or infrequent categories.

  1. Let’s fit the transformer to the train set so that it learns and stores the WoE of the different categories:
    woe_enc.fit(X_train, y_train)

Tip

We can display the dictionaries with the categories to WoE pairs by executing woe_enc.encoder_dict_.

  1. Finally, let’s encode the three categorical variables in the train and test sets:
    X_train_enc = woe_enc.transform(X_train)
    X_test_enc = woe_enc.transform(X_test)

Feature-engine returns pandas DataFrames containing the encoded categorical variables ready to use in machine learning models.

How it works...

First, with pandas sum(), we determined the total number of positive and negative cases. Next, using pandas groupby(), we determined the fraction of positive and negative cases per category. And with that, we calculated the WoE per category.

Finally, we automated the procedure with Feature-engine. We used WoEEncoder(), which learned the WoE per category with the fit() method, and then used transform(), which replaced the categories with the corresponding numbers.

See also

For an implementation of WoE with Category Encoders, visit this book’s GitHub repository.

 

Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the necessary Python libraries, functions, and classes:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from feature_engine.categorical_encoders import RareLabelEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s capture the fraction of observations per category in A7 in a variable:
    freqs = X_train["A7"].value_counts(normalize=True)

We can see the percentage of observations per category of A7, expressed as decimals, in the following output after executing print(freqs):

v	0.573499
h	0.209110
ff	0.084886
bb	0.080745
z	0.014493
dd	0.010352
j	0.010352
Missing	0.008282
n	0.006211
o	0.002070
Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then z, dd, j, Missing, n, and o are rare categories.

  1. Let’s create a list containing the names of the categories present in more than 5% of the observations:
    frequent_cat = [
        x for x in freqs.loc[freqs > 0.05].index.values]

If we execute print(frequent_cat), we will see the frequent categories of A7:

['v', 'h', 'ff', 'bb'].
  1. Let’s replace rare labels – that is, those present in <= 5% of the observations – with the "Rare" string:
    X_train["A7"] = np.where(
        X_train["A7"].isin(frequent_cat),
        X_train["A7"], "Rare"
    )
    X_test["A7"] = np.where(
        X_test["A7"].isin(frequent_cat),
        X_test["A7"], "Rare"
        )
  2. Let’s determine the percentage of observations in the encoded variable:
    X_train["A7"].value_counts(normalize=True)

We can see that the infrequent labels have now been re-grouped into the Rare category:

v       0.573499 h       0.209110 ff      0.084886 bb      0.080745 Rare    0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
    rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
  2. Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
    rare_encoder.fit(X_train)

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing rare_encoder.encoder_dict_, as well as the variables that will be encoded by executing rare_encoder.variables_.

  1. Finally, let’s group rare labels in the train and test sets:
    X_train_enc = rare_encoder.transform(X_train)
    X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the A7 variable using pandas value_counts() by setting the normalize parameter to True. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where(), we searched each row of A7, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin() method, its value was kept; otherwise, its original value was replaced with "Rare".

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder(). By setting tol to 0.05, we retained categories present in more than 5% of the observations. By setting n_categories to 4, we only group rare categories in variables with more than four unique values. With the fit() method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform() method, the transformer replaced infrequent categories with the "Rare" string.

 

Performing binary encoding

Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. How does it work? First, the categories are arbitrarily replaced with ordinal numbers, as shown in the intermediate step of the following table. Then, those numbers are converted into binary code. For example, integer 1 can be represented as sequence 10, integer 2 as 01, integer 3 as 11, and integer 0 as 00. The digits in the two positions of the binary string become the columns, which are the encoded representations of the original variable:

Figure 2.9 – Table showing the steps required for binary encoding of the color variable

Figure 2.9 – Table showing the steps required for binary encoding of the color variable

Binary encoding encodes the data in fewer dimensions than one-hot encoding. In our example, the Color variable would be encoded into k-1 categories by one-hot encoding – that is, three variables – but with binary encoding, we can represent the variable with only two features. More generally, we determine the number of binary features needed to encode a variable as log2(number of distinct categories); in our example, log2(4) = 2 binary features.

Binary encoding is an alternative method to one-hot encoding where we do not lose information about the variable, yet we obtain fewer features after the encoding. This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need 7 (log2(128)=7). Thus, this encoding prevents the feature space from exploding. In addition, binary-encoded features are also suitable for linear models. On the downside, the derived binary features lack human interpretability, so if we need to interpret the decisions made by our models, this encoding method may not be a suitable option.

In this recipe, we will learn how to perform binary encoding using Category Encoders.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python library, function, and class:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from category_encoders.binary import BinaryEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s inspect the unique categories in A7:
    X_train["A7"].unique()

In the following output, we can see that A7 has 10 different categories:

array(['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n', 'o'], dtype=object)
  1. Let’s create a binary encoder to encode A7:
    encoder = BinaryEncoder(cols=["A7"], drop_invariant=True)

Tip

BinaryEncoder(), as well as other encoders from the Category Encoders package, allow us to select the variables to encode. We simply pass the column names in a list to the cols argument.

  1. Let’s fit the transformer to the train set so that it calculates how many binary variables it needs and creates the variable-to-binary code representations:
    encoder.fit(X_train)
  2. Finally, let’s encode A7 in the train and test sets:
    X_train_enc = encoder.transform(X_train)
    X_test_enc = encoder.transform(X_test)

We can display the top rows of the transformed train set by executing print(X_train_enc.head()), which returns the following output:

Figure 2.10 – DataFrame with the variables after binary encoding

Figure 2.10 – DataFrame with the variables after binary encoding

Binary encoding returned four binary variables for A7, which are A7_0, A7_1, A7_2, and A7_3, instead of the nine that would have been returned by one-hot encoding.

How it works...

In this recipe, we performed binary encoding using the Category Encoders package. First, we loaded the dataset and divided it into train and test sets using train_test_split() from scikit-learn. Next, we used BinaryEncoder() to encode the A7 variable. With the fit() method, BinaryEncoder() created a mapping from category to set of binary columns, and with the transform() method, the encoder encoded the A7 variable in both the train and test sets.

Tip

With one-hot encoding, we would have created nine binary variables (k-1 = 10 unique categories - 1 = 9) to encode all of the information in A7. With binary encoding, we can represent the variable in fewer dimensions by using log2(10)=3.3; that is, we only need four binary variables.

See also

For more information about BinaryEncoder(), visit https://contrib.scikit-learn.org/category_encoders/binary.html.

For a nice example of the output of binary encoding, check out the following resource: https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding.

For a comparative study of categorical encoding techniques for neural network classifiers, visit https://www.researchgate.net/publication/320465713_A_Comparative_Study_of_Categorical_Variable_Encoding_Techniques_for_Neural_Network_Classifiers.

About the Author
  • Soledad Galli

    Soledad Galli is a lead data scientist with more than 10 years of experience in world-class academic institutions and renowned businesses. She has researched, developed, and put into production machine learning models for insurance claims, credit risk assessment, and fraud prevention. Soledad received a Data Science Leaders' award in 2018 and was named one of LinkedIn's voices in data science and analytics in 2019. She is passionate about enabling people to step into and excel in data science, which is why she mentors data scientists and speaks at data science meetings regularly. She also teaches online courses on machine learning in a prestigious Massive Open Online Course platform, which have reached more than 10,000 students worldwide.

    Browse publications by this author
Python Feature Engineering Cookbook - Second Edition
Unlock this book and the full library FREE for 7 days
Start now