# Encoding Categorical Variables

Categorical variables are those whose values are selected from a group of categories or labels. For example, the **Gender** variable with the values of **Male** and **Female** is categorical, and so is the **marital status** variable with the values of **never married**, **married**, **divorced**, and **widowed**. In some categorical variables, the labels have an intrinsic order; for example, in the **Student’s grade** variable, the values of **A**, **B**, **C**, and **Fail** are ordered, with **A** being the highest grade and **Fail** being the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the **City** variable, with the values of **London**, **Manchester**, **Bristol**, and so on.

The values of categorical variables are often encoded as strings. To train mathematical or machine learning models, we need to transform those strings into numbers. The act of replacing strings with numbers is called **categorical encoding**. In this chapter, we will discuss multiple categorical encoding methods.

This chapter will cover the following recipes:

- Creating binary variables through one-hot encoding
- Performing one-hot encoding of frequent categories
- Replacing categories with counts or the frequency of observations
- Replacing categories with ordinal numbers
- Performing ordinal encoding based on the target value
- Implementing target mean encoding
- Encoding with the Weight of Evidence
- Grouping rare or infrequent categories
- Performing binary encoding

# Technical requirements

In this chapter, we will use the pandas, NumPy, and Matplotlib Python libraries, as well as scikit-learn and Feature-engine. For guidelines on how to obtain these libraries, visit the *Technical requirements* section of *Chapter 1*, *Imputing **Missing Data*.

We will also use the open-source `Category Encoders`

Python library, which can be installed using `pip`

:

pip install category_encoders

To learn more about `Category Encoders`

, visit the following link: https://contrib.scikit-learn.org/category_encoders/.

We will also use the Credit Approval dataset, which is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/credit+approval.

To prepare the dataset, follow these steps:

- Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/ and click on
`crx.data`

to download the data:

Figure 2.1 – The index directory for the Credit Approval dataset

- Save
`crx.data`

to the folder where you will run the following commands.

After downloading the data, open up a Jupyter Notebook and run the following commands.

- Import the required libraries:
import random import numpy as np import pandas as pd

- Load the data:
data = pd.read_csv("crx.data", header=None)

- Create a list containing the variable names:
varnames = [f"A{s}" for s in range(1, 17)]

- Add the variable names to the DataFrame:
data.columns = varnames

- Replace the question marks in the dataset with NumPy NaN values:
data = data.replace("?", np.nan)

- Cast some numerical variables as
`float`

data types:data["A2"] = data["A2"].astype("float") data["A14"] = data["A14"].astype("float")

- Encode the target variable as binary:
data["A16"] = data["A16"].map({"+": 1, "-": 0})

- Rename the target variable:
data.rename(columns={"A16": "target"}, inplace=True)

- Make lists that contain categorical and numerical variables:
cat_cols = [ c for c in data.columns if data[c].dtypes=="O"] num_cols = [ c for c in data.columns if data[c].dtypes!= "O"]

- Fill in the missing data:
data[num_cols] = data[num_cols].fillna(0) data[cat_cols] = data[cat_cols].fillna("Missing")

- Save the prepared data:
data.to_csv("credit_approval_uci.csv", index=False)

You can find a Jupyter Notebook that contains these commands in this book’s GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook-Second-Edition/blob/main/ch02-categorical-encoding/donwload-prepare-store-credit-approval-dataset.ipynb.

Note

Some libraries require that you have **already imputed missing data**, for which you can use any of the recipes from *Chapter 1*, *Imputing **Missing Data*.

# Creating binary variables through one-hot encoding

In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable takes a value of 1 if the category is present in an observation, or 0 otherwise.

The following table shows the one-hot encoded representation of the **Gender** variable with the categories of **Male** and **Female**:

Figure 2.2 – One-hot encoded representation of the Gender variable

As shown in *Figure 2**.2*, from the **Gender** variable, we can derive the binary variable of **Female**, which shows the value of **1** for females, or the binary variable of **Male**, which takes the value of **1** for the males in the dataset.

For the categorical variable of **Color** with the values of **red**, **blue**, and **green**, we can create three variables called red, blue, and green. These variables will take the value of **1** if the observation is red, blue, or green, respectively, or 0 otherwise.

A categorical variable with *k* unique categories can be encoded using *k-1* binary variables. For **Gender**, *k* is 2 as it contains two labels (male and female), so we only need to create one binary variable (*k - 1 = 1*) to capture all of the information. For the **Color** variable, which has three categories (*k=3*; red, blue, and green), we need to create two (*k - 1 = 2*) binary variables to capture all the information so that the following occurs:

- If the observation is red, it will be captured by the
**red**variable (red = 1, blue = 0). - If the observation is blue, it will be captured by the
**blue**variable (red = 0, blue = 1) - If the observation is green, it will be captured by the combination of
**red**and**blue**(red = 0, blue = 0)

Encoding into *k-1* binary variables is well-suited for linear models. There are a few occasions in which we may prefer to encode the categorical variables with *k* binary variables:

- When training decision trees since they do not evaluate the entire feature space at the same time
- When selecting features recursively
- When determining the importance of each category within a variable

In this recipe, we will compare the one-hot encoding implementations of pandas, scikit-learn, Feature-engine, and Category Encoders.

## How to do it...

First, let’s make a few imports and get the data ready:

- Import
`pandas`

and the`train_test_split`

function from scikit-learn:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the Credit Approval dataset:
data = pd.read_csv("credit_approval_uci.csv")

- Let’s separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- Let’s inspect the unique categories of the
`A4`

variable:X_train["A4"].unique()

We can see the unique values of `A4`

in the following output:

array(['u', 'y', 'Missing', 'l'], dtype=object)

- Let’s encode
`A4`

into k-1 binary variables using pandas and then inspect the first five rows of the resulting DataFrame:dummies = pd.get_dummies(X_train["A4"], drop_first=True) dummies.head()

Note

With pandas `get_dummies()`

, we can either ignore or encode missing data through the `dummy_na`

parameter. By setting `dummy_na=True`

, missing data will be encoded in a new binary variable. To encode the variable into `k`

dummies, use `drop_first=False`

instead.

Here, we can see the output of *step 5*, where each label is now a binary variable:

l u y 596 0 1 0 303 0 1 0 204 0 0 1 351 0 0 1 118 0 1 0

- Now, let’s encode all of the categorical variables into
*k-1*binaries, capturing the result in a new DataFrame:X_train_enc = pd.get_dummies(X_train, drop_first=True) X_test_enc = pd.get_dummies(X_test, drop_first=True)

Note

The `get_dummies`

method from pandas will automatically encode all variables of the object or type. We can encode a subset of the variables by passing the variable names in a list to the `columns`

parameter.

Note

When encoding more than one variable, `get_dummies()`

captures the variable name – say, `A1`

– and places an underscore followed by the category name to identify the resulting binary variables.

We can see the binary variables in the following output:

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right

Note

The `get_dummies()`

method will create one binary variable per **seen** category. Hence, if there are more categories in the train set than in the test set, `get_dummies()`

will return more columns in the transformed train set than in the transformed test set, and vice versa. To avoid this, it is better to carry out one-hot encoding with scikit-learn or Feature-engine, as we will discuss later in this recipe.

- Let’s concatenate the binary variables to the original dataset:
X_test_enc = pd.concat([X_test, X_test_enc], axis=1)

- Now, let’s drop the categorical variables from the data:
X_test_enc.drop( labels=X_test_enc.select_dtypes( include="O").columns, axis=1, inplace=True, )

And that’s it! Now, we can use our categorical variables to train mathematical models. To inspect the result, use `X_test_enc.head()`

.

Now, let’s do one-hot encoding using scikit-learn.

- Import the encoder from scikit-learn:
from sklearn.preprocessing import OneHotEncoder

- Let’s set up the transformer. By setting
`drop`

to`"first"`

, we encode into*k-1*binary variables, and by setting`sparse`

to`False`

, the transformer will return a NumPy array (instead of a sparse matrix):encoder = OneHotEncoder(drop="first", sparse=False)

Tip

We can encode variables into k dummies by setting the `drop`

parameter to `None`

. We can also encode into k-1 if variables contain two categories and into `k`

if variables contain more than two categories by setting the `drop`

parameter to “`if_binary`

”. The latter is useful because encoding binary variables into `k`

dummies is redundant.

- First, let’s create a list containing the variable names:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()

- Let’s fit the encoder to a slice of the train set with the categorical variables:
encoder.fit(X_train[vars_categorical])

- Let’s inspect the categories for which dummy variables will be created:
encoder.categories_

We can see the result of the preceding command here:

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable

Note

Scikit-learn’s `OneHotEncoder()`

will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error by setting the `handle_unknown`

parameter to `'ignore' `

or `'`

`error'`

, respectively.

- Let’s create the NumPy arrays with the binary variables for the train and test sets:
X_train_enc = encoder.transform( X_train[vars_categorical]) X_test_enc = encoder.transform( X_test[vars_categorical])

- Let’s extract the names of the binary variables:
encoder.get_feature_names_out()

We can see the binary variable names that were returned in the following output:

Figure 2.5 – Arrays with the names of the one-hot encoded variables

- Let’s convert the array into a pandas DataFrame and add the variable names:
X_test_enc = pd.DataFrame(X_test_enc) X_test_enc.columns = encoder.get_feature_names_out()

- To concatenate the one-hot encoded data to the original dataset, we need to make their indexes match:
X_test_enc.index = X_test.index

Now, we are ready to concatenate the one-hot encoded variables to the original data and then remove the categorical variables using *steps 8* and *9* from this recipe.

To follow up, let’s perform one-hot encoding with Feature-engine.

- Let’s import the encoder from Feature-engine:
from feature_engine.encoding import OneHotEncoder

- Next, let’s set up the encoder so that it returns
*k-1*binary variables:ohe_enc = OneHotEncoder(drop_last=True)

Tip

Feature-engine automatically finds the categorical variables. To encode only a subset of the variables, we can pass the variable names in a list: `OneHotCategoricalEncoder(variables=["A1", "A4"])`

. To encode numerical variables, we can set the `ignore_format`

parameter to `True`

or cast the variables as the object type. This is useful because sometimes, numerical variables are used to represent categories, such as postcodes.

- Let’s fit the encoder to the train set so that it learns the categories and variables to encode:
ohe_enc.fit(X_train)

- Let’s explore the variables that will be encoded:
ohe_enc.variables_

The transformer found and stored the variables of the object or categorical type, as shown in the following output:

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

Note

Feature-engine’s `OneHotEncoder`

has the option to encode most variables into k dummies, while only returning k-1 dummies for binary variables. For this behavior, set the `drop_last_binary`

parameter to `True`

.

The following dictionary contains the categories that will be encoded in each variable:

{'A1': ['a', 'b'], 'A4': ['u', 'y', 'Missing'], 'A5': ['g', 'p', 'Missing'], 'A6': ['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing', 'aa'], 'A7': ['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n'], 'A9': ['t'], 'A10': ['t'], 'A12': ['t'], 'A13': ['g', 's']}

- Let’s encode the categorical variables in train and test sets:
X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)

Tip

Feature-engine’s `OneHotEncoder()`

returns a copy of the original dataset plus the binary variables and without the original categorical variables. Thus, this data is ready to train machine learning models.

If we execute `X_train_enc.head()`

, we will see the following DataFrame:

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right

Note how the **A4** categorical variable was replaced with **A4_u**, **A4_y**, and so on.

Note

We can get the names of all the variables in the transformed dataset by executing `ohe_enc.get_feature_names_out()`

.

## How it works...

In this recipe, we performed a one-hot encoding of categorical variables using pandas, scikit-learn, Feature-engine, and Category Encoders.

With `get_dummies()`

from pandas, we automatically created binary variables for each of the categories in the categorical variables.

The `OneHotEncoder`

transformers from the scikit-learn and Feature-engine libraries share the `fit()`

and `transform()`

methods. With `fit()`

, the encoders learned the categories for which the dummy variables should be created. With `transform()`

, they returned the binary variables either in a NumPy array or added them to the original DataFrame.

Tip

One-hot encoding expands the feature space. From nine original categorical variables, we created 36 binary ones. If our datasets contain many categorical variables or highly cardinal variables, we will easily increase the feature space dramatically, which increases the computational cost of training machine learning models or obtaining their predictions and may also deteriorate their performance.

## There’s more...

We can also perform one-hot encoding using `OneHotEncoder`

from the Category Encoders library.

`OneHotEncoder()`

from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. So does pandas `get_dummies()`

. Scikit-learn’s `OneHotEncoder()`

, on the other hand, will encode all variables in the dataset.

With pandas, Feature-engine, and Category Encoders, we can only encode a subset of the variables, indicating their names in a list. With scikit-learn, we need to use an additional class, `ColumnTransformer()`

, to slice the data before the transformation.

With Feature-engine and Category Encoders, the dummy variables are added to the original dataset and the categorical variables are removed after the encoding. With scikit-learn and pandas, we need to manually perform these procedures.

Finally, using `OneHotEncoder()`

from scikit-learn, Feature-engine, and Category Encoders, we can perform the encoding step within a scikit-learn pipeline, which is more convenient if we have various feature engineering steps or want to put the pipelines into production. pandas `get_dummies()`

is otherwise well suited for data analysis and visualization.

# Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the required Python libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from feature_engine.encoding import OneHotEncoder

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop( labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

- Let’s inspect the unique categories of the
`A6`

variable:X_train["A6"].unique()

The unique values of `A6`

are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)

- Let’s count the number of observations per category of
`A6`

, sort them in decreasing order, and then display the five most frequent categories:X_train["A6"].value_counts().sort_values( ascending=False).head(5)

We can see the five most frequent categories and the number of observations per category in the following output:

c 93 q 56 w 48 i 41 ff 38 Name: A6, dtype: int64

- Now, let’s capture the most frequent categories of
`A6`

in a list by using the code in*step 4*inside a list comprehension:top_5 = [ x for x in X_train["A6"].value_counts().sort_values( ascending=False).head(5).index ]

- Now, let’s add a binary variable per top category to the train and test sets:
for label in top_5: X_train[f"A6_{label}"] = np.where( X_train["A6"] ==label, 1, 0) X_test[f"A6_{label}"] = np.where( X_test["A6"] ==label, 1, 0)

- Let’s display the top
`10`

rows of the original and encoded variable,`A6`

, in the train set:X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)

In the output of *step 7*, we can see the `A6`

variable, followed by the binary variables:

A6 A6_c A6_q A6_w A6_i A6_ff 596 c 1 0 0 0 0 303 q 0 1 0 0 0 204 w 0 0 1 0 0 351 ff 0 0 0 0 1 118 m 0 0 0 0 0 247 q 0 1 0 0 0 652 i 0 0 0 1 0 513 e 0 0 0 0 0 230 cc 0 0 0 0 0 250 e 0 0 0 0 0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in *step 2*.

- Let’s set up the one-hot encoder to encode the five most frequent categories of the
`A6`

and`A7`

variables:ohe_enc = OneHotEncoder( top_categories=5, variables=["A6", "A7"] )

Tip

Feature-engine’s `OneHotEncoder()`

will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in *step 8*.

- Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of
`A6`

and`A7`

:ohe_enc.fit(X_train)

Note

The number of frequent categories to encode is arbitrarily determined by the user.

- Finally, let’s encode
`A6`

and`A7`

in the train and test sets:X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing `X_train_enc.head()`

. You can also find the top five categories learned by the encoder by executing `ohe_enc.encoder_dict_`

.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

## How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the `A6`

categorical variable. We inspected its unique categories with pandas `unique()`

. Next, we counted the number of observations per category using pandas `value_counts()`

,which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas `sort_values()`

. Next, we reduced the series to the five most popular categories by using pandas `head()`

. Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s `where()`

method, we created binary variables by placing a value of `1`

if the observation showed the category, or `0`

otherwise.

To perform a one-hot encoding of the five most popular categories of the `A6`

and `A7`

variables with Feature-engine, we used `OneHotEncoder()`

, indicating `5`

in the `top_categories`

argument, and passing the variable names in a list to the `variables`

argument. With `fit()`

, the encoder learned the top categories from the train set and stored them in its `encoder_dict_`

attribute. Then, with `transform()`

, `OneHotEncoder()`

replaced the original variables with the set of binary ones.

## There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, *Winning the KDD Cup Orange Challenge with Ensemble Selection* (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.

# Replacing categories with counts or the frequency of observations

In count or frequency encoding, we replace the categories with the count or the fraction of observations showing that category. That is, if 10 out of 100 observations show the category **blue** for the **Color** variable, we would replace **blue** with 10 when doing count encoding, or with 0.1 if performing frequency encoding. These encoding methods, which capture the representation of each label in a dataset, are very popular in data science competitions. The assumption is that the number of observations per category is somewhat predictive of the target.

Tip

Note that if two different categories are present in the same number of observations, they will be replaced by the same value, which leads to information loss.

In this recipe, we will perform count and frequency encoding using pandas, Feature-engine, and Category Encoders.

## How to do it...

Let’s begin by making some imports and preparing the data:

- Import
`pandas`

and the required function:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- Let’s count the number of observations per category of the
`A7`

variable and capture it in a dictionary:counts = X_train["A7"].value_counts().to_dict()

Tip

To encode categories with their frequency, execute `X_train["A6"].value_counts(normalize=True).to_dict()`

.

If we execute `print(counts)`

, we can observe the count of observations per category:

{'v': 277, 'h': 101, 'ff': 41, 'bb': 39, 'z': 7, 'dd': 5, 'j': 5, 'Missing': 4,, 'n': 3, 'o': 1}

- Let’s replace the categories in
`A7`

with the counts:X_train["A7"] = X_train["A7"].map(counts) X_test["A7"] = X_test["A7"].map(counts)

Go ahead and inspect the data by executing `X_train.head()`

to corroborate that the categories have been replaced by the counts.

Now, let’s carry out count encoding using Feature-engine. First, let’s load and divide the dataset, as we did in *step 2*.

- Let’s import the count encoder from Feature-engine:
from feature_engine.encoding import CountFrequencyEncoder

- Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
count_enc = CountFrequencyEncoder( encoding_method="count", variables=None, )

Tip

`CountFrequencyEncoder()`

will automatically find and encode all categorical variables in the train set. To encode only a subset of the variables, we can pass the variable names in a list to the `variables`

argument.

- Let’s fit the encoder to the train set so that it stores the number of observations per category per variable:
count_enc.fit(X_train)

Tip

The dictionaries with the category-to-counts pairs are stored in the `encoder_dict_`

attribute and can be displayed by executing `count_enc.encoder_dict_`

.

- Finally, let’s replace the categories with counts in the train and test sets:
X_train_enc = count_enc.transform(X_train) X_test_enc = count_enc.transform(X_test)

Tip

If there are categories in the test set that were not present in the train set, the transformer will replace those with `np.nan`

and return a warning to make you aware of this. A good idea to prevent this behavior is to group infrequent labels, as described in the *Grouping rare or infrequent **categories* recipe.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

To wrap up this recipe, let’s encode the variables using Category Encoders.

- Let’s import the encoder from Category Encoders:
from category_encoders.count import CountEncoder

- Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
count_enc = CountEncoder(cols=None)

Note

`CountEncoder()`

automatically finds and encodes *all* categorical variables in the train set. To encode only a subset of the categorical variables, we can pass the variable names in a list to the `cols`

argument. To replace the categories by frequency instead, we need to set the `Normalize`

parameter to `True`

.

- Let’s fit the encoder to the train set so that it counts and stores the number of observations per category per variable:
count_enc.fit(X_train)

Tip

The values used to replace the categories are stored in the mapping attribute and can be displayed by executing `count_enc.mapping`

.

- Finally, let’s replace the categories with counts in the train and test sets:
X_train_enc = count_enc.transform(X_train) X_test_enc = count_enc.transform(X_test)

Note

Categories present in the test set that were not seen in the train set are referred to as unknown categories. `CountEncoder()`

has different options to handle unknown categories, including returning an error, treating them as missing data, or replacing them with an indicated integer. `CountEncoder()`

can also automatically group categories with few observations.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

## How it works...

In this recipe, we replaced categories by the count of observations using pandas, Feature-engine, and Category Encoders.

Using pandas `value_counts()`

, we determined the number of observations per category of the `A7`

variable, and with pandas `to_dict()`

, we captured these values in a dictionary, where each key was a unique category, and each value the number of observations for that category. With pandas `map()`

and using this dictionary, we replaced the categories with the observation counts in both the train and test sets.

To perform count encoding with Feature-engine, we used `CountFrequencyEncoder()`

and set `encoding_method`

to `'count'`

. We left the `variables`

argument set to `None`

so that the encoder automatically finds all of the categorical variables in the dataset. With the `fit()`

method, the transformer found the categorical variables and stored the observation counts per category in the `encoder_dict_`

attribute. With the `transform()`

method, the transformer replaced the categories with the counts, returning a pandas DataFrame.

Finally, we performed count encoding with `CountEncoder()`

by setting `Normalize`

to `False`

. We left the `cols`

argument set to `None`

so that the encoder automatically finds the categorical variables in the dataset. With the `fit()`

method, the transformer found the categorical variables and stored the category to count mappings in the `mapping`

attribute. With the `transform()`

method, the transformer replaced the categories with the counts in, returning a pandas DataFrame.

# Replacing categories with ordinal numbers

Ordinal encoding consists of replacing the categories with digits from 1 to *k* (or 0 to *k-1*, depending on the implementation), where *k* is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.

In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and prepare the dataset:

- Import
`pandas`

and the`data`

`split`

function:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- To encode the
`A7`

variable, let’s make a dictionary of category-to-integer pairs:ordinal_mapping = {k: i for i, k in enumerate( X_train["A7"].unique(), 0) }

If we execute `print(ordinal_mapping)`

, we will see the digits that will replace each category:

{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}

- Now, let’s replace the categories with numbers in the original variables:
X_train["A7"] = X_train["A7"].map(ordinal_mapping) X_test["A7"] = X_test["A7"].map(ordinal_mapping)

With `print(X_train["A7"].head(10))`

, we can see the result of the preceding operation, where the original categories were replaced by numbers:

596 0 303 0 204 0 351 1 118 0 247 2 652 0 513 3 230 0 250 4 Name: A7, dtype: int64

Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in *step 2*.

- Let’s import the required classes:
from sklearn.preprocessing import OrdinalEncoder from sklearn.compose import ColumnTransformer

Tip

Do not confuse `OrdinalEncoder()`

with `LabelEncoder()`

from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.

- Let’s set up the encoder:
enc = OrdinalEncoder()

Note

Scikit-learn’s `OrdinalEncoder()`

will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s `ColumnTransformer()`

.

- Let’s make a list containing the categorical variables to encode:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()

- Let’s make a list containing the remaining variables:
vars_remainder = X_train.select_dtypes( exclude="O").columns.to_list()

- Now, let’s set up
`ColumTransformer()`

to encode the categorical variables. By setting the`remainder`

parameter to`"passthrough"`

, we make`ColumnTransformer()`

concatenate the variables that are not encoded at the back of the encoded features:ct = ColumnTransformer( [("encoder", enc, vars_categorical)], remainder="passthrough", )

- Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
ct.fit(X_train)

By executing `ct.named_transformers_["encoder"].categories_`

, you can visualize the unique categories per variable.

- Now, let’s encode the categorical variables in the train and test sets:
X_train_enc = ct.transform(X_train) X_test_enc = ct.transform(X_test)

Remember that scikit-learn returns a NumPy array.

- Let’s transform the arrays into pandas DataFrames by adding the columns:
X_train_enc = pd.DataFrame( X_train_enc, columns=vars_categorical+vars_remainder) X_test_enc = pd.DataFrame( X_test_enc, columns=vars_categorical+vars_remainder)

Note

Note that, with `ColumnTransformer()`

, the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of *step 12* with `X_train_enc.head()`

.

Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder

- Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in
*step 7*:enc = OrdinalEncoder(

**encoding_method=**"**arbitrary**", variables=vars_categorical)

Note

Feature-engine’s `OrdinalEncoder`

automatically finds and encodes all categorical variables if the `variables`

parameter is left set to `None`

. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s `OrdinalEncoder()`

can assign the integers according to the target mean value (see the *Performing ordinal encoding based on the target **value* recipe).

- Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
enc.fit(X_train)

Tip

The category to integer mappings are stored in the `encoder_dict_`

attribute and can be accessed by executing `enc.encoder_dict_`

.

- Finally, let’s encode the categorical variables in the train and test sets:
X_train_enc = enc.transform(X_train) X_test_enc = enc.transform(X_test)

Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.

## How it works...

In this recipe, we replaced categories with integers assigned arbitrarily.

With pandas `unique()`

, we returned the unique values of the `A7`

variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the `A7`

variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas `map()`

to replace the strings in `A7`

with the integers.

Next, we carried out ordinal encoding using scikit-learn’s `OrdinalEncoder()`

and used `ColumnTransformer()`

to select the columns to encode. With the `fit()`

method, the transformer created the category-to-integer mappings based on the categories in the train set. With the `transform()`

method, the categories were replaced with integers, returning a NumPy array. `ColumnTransformer()`

sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.

To perform ordinal encoding with Feature-engine, we used `OrdinalEncoder()`

, indicating that the integers should be assigned arbitrarily in `encoding_method`

and passing a list with the variables to encode in the `variables`

argument. With the `fit()`

method, the encoder assigned integers to each variable’s categories, which were stored in the `encoder_dict_`

attribute. These mappings were then used by the `transform()`

method to replace the categories in the train and test sets, returning DataFrames.

## There’s more...

You can also carry out ordinal encoding with `OrdinalEncoder()`

from Category Encoders.

The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.

scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, `ColumnTransformer()`

, to slice the data before the transformation.

Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.

Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.

# Performing ordinal encoding based on the target value

In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to *k-1* to the last category, where *k* is the number of distinct categories.

This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.

In this recipe, we will encode categories while following the target value using pandas and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the required Python libraries, functions, and classes:
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )

- Let’s determine the mean target value per category in
`A7`

, then sort the categories from that with the lowest to that with the highest target value:y_train.groupby(X_train["A7"]).mean().sort_values()

The following is the output of the preceding command:

A7 o 0.000000 ff 0.146341 j 0.200000 dd 0.400000 v 0.418773 bb 0.512821 h 0.603960 n 0.666667 z 0.714286 Missing 1.000000 Name: target, dtype: float64

- Now, let’s repeat the computation in
*step 3*, but this time, let’s retain the ordered category names:ordered_labels = y_train.groupby( X_train["A7"]).mean().sort_values().index

To display the output of the preceding command, we can execute `print(ordered_labels)`

:

Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')

- Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in
*step 4*:ordinal_mapping = { k: i for i, k in enumerate( ordered_labels, 0) }

We can visualize the result of the preceding code by executing `print(ordinal_mapping)`

:

{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}

- Let’s use the dictionary we created in
*step 5*to replace the categories in`A7`

in the train and test sets, returning the encoded features as new columns:X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping) X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)

Tip

Note that if the test set contains a category not present in the train set, the preceding code will introduce `np.nan`

.

To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the `A7`

variable with the target before and after the encoding.

- Let’s plot the mean target response per category of the
`A7`

variable:y_train.groupby(X_train["A7"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()

We can see the non-monotonic relationship between categories of `A7`

and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target

- Let’s plot the mean target value per category in the encoded variable:
y_train.groupby(X_train["A7_enc"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()

The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding

Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder

- Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
ordinal_enc = OrdinalEncoder( encoding_method="ordered", variables=None)

Tip

`OrdinalEncoder()`

will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.

- Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
ordinal_enc.fit(X_train, y_train)

Tip

When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

- Finally, let’s replace the categories with numbers in the train and test sets:
X_train_enc = ordinal_enc.transform(X_train) X_test_enc = ordinal_enc.transform(X_test)

Tip

A list of the categorical variables is stored in the `variables_`

attribute of `OrdinalEncoder()`

and the dictionaries with the category-to-integer mappings in the `encoder_dict_`

attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.

Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in *step 7* and changing the variable name in the `groupby()`

method.

## How it works...

In this recipe, we replaced the categories with integers according to the target mean.

In the first part of this recipe, we worked with the `A7`

categorical variable. With pandas `groupby()`

, we grouped the data based on the categories of `A7`

, and with pandas `mean()`

, we determined the mean value of the target for each of the categories of `A7`

. Next, we ordered the categories with pandas `sort_values()`

from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas `index`

, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas `map()`

in the train and test sets.

Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of `A7`

using pandas `groupby()`

, followed by pandas `mean()`

, as described in the preceding paragraph. We followed up with pandas `plot()`

to create a plot of category versus target mean value. We added a title and *y* labels with Matplotlib’s `title()`

and `ylabel()`

methods.

To perform the encoding with Feature-engine, we used `OrdinalEncoder()`

and indicated `"ordered"`

in the `encoding_method`

argument. We left the argument variables set to `None`

so that the encoder automatically detects all categorical variables in the dataset. With the `fit()`

method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the `variables_`

and `encoder_dict_`

attributes, respectively. Finally, using the `transform()`

method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.

## See also

For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.

# Implementing target mean encoding

Mean encoding or target encoding maps each category to the probability estimate of the target attribute. If the target is binary, the numerical mapping is the posterior probability of the target conditioned to the value of the category. If the target is continuous, the numerical representation is given by the expected value of the target given the value of the category.

In its simplest form, the numerical representation for each category is given by the mean value of the target variable for a particular category group. For example, if we have a **City** variable, with the categories of **London**, **Manchester**, and **Bristol**, and we want to predict the default rate (the target takes values 0 and 1); if the default rate for **London** is 30%, we replace **London** with 0.3; if the default rate for **Manchester** is 20%, we replace **Manchester** with 0.2; and so on. If the target is continuous – say we want to predict income – then we would replace London, Manchester, and Bristol with the mean income earned in each city.

In mathematical terms, if the target is binary, the replacement value, *S*, is determined like so:

Here, the numerator is the number of observations with a target value of 1 for category *i* and the denominator is the number of observations with a category value of *i*.

If the target is continuous, *S*, this is determined by the following formula:

Here, the numerator is the sum of the target across observations in category *i* and *ni* is the total number of observations in category *i*.

These formulas provide a good approximation of the target estimate if there is a sufficiently large number of observations with each category value – in other words, if *n*i is large. However, in most datasets, categorical variables will only have categorical values present in a few observations. In these cases, target estimates derived from the precedent formulas can be unreliable.

To mitigate poor estimates returned for rare categories, the target estimates can be determined as a mixture of two probabilities: those returned by the preceding formulas and the prior probability of the target based on the entire training set. The two probabilities are *blended* using a weighting factor, which is a function of the category group size:

In this formula, ny is the total number of cases where the target takes a value of 1, *N* is the size of the train set, and 𝛌 is the weighting factor.

When the category group is large, 𝛌 approximates 1, so more weight is given to the first term of the equation. When the category group size is small, then 𝛌 tends to 0, so the estimate is mostly driven by the second term of the equation – that is, the target’s prior probability. In other words, if the group size is small, knowing the value of the category does not tell us anything about the value of the target.

The weighting factor, 𝛌, is a function of the group size, *k*, and a smoothing parameter, *f*, controls the rate of transition between the first and second term of the preceding equation:

Here, *k* is half of the minimal size for which we *fully trust* the first term of the equation. The *f* parameter is selected by the user either arbitrarily or with optimization.

Tip

Mean encoding was designed to encode highly cardinal categorical variables without expanding the feature space. For more details, check out the following article: Micci-Barreca D. *A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems*. ACM SIGKDD Explorations Newsletter, 2001.

In this recipe, we will perform mean encoding using pandas, Feature-engine, and Category Encoders.

## How to do it...

In the first part of this recipe, we will replace categories with the target mean value, regardless of the number of observations per category. We will use pandas and Feature-engine to do this. In the second part of this recipe, we will introduce the weighting factor using Category Encoders. Let’s begin with this recipe:

- Import
`pandas`

and the data split function:import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
- Let’s determine the mean target value per category of the
`A7`

variable and then store them in a dictionary:mapping = y_train.groupby(X_train["A7"]).mean().to_dict()

We can display the content of the dictionary by executing `print(mapping)`

:

{'Missing': 1.0, 'bb': 0.5128205128205128, 'dd': 0.4, 'ff': 0.14634146341463414, 'h': 0.6039603960396039, 'j': 0.2, 'n': 0.6666666666666666, 'o': 0.0, 'v': 0.4187725631768953, 'z': 0.7142857142857143}

- Let’s replace the categories with the mean target value using the dictionary we created in
*step 3*in the train and test sets:X_train["A7"] = X_train["A7"].map(mapping) X_test["A7"] = X_test["A7"].map(mapping)

You can inspect the encoded `A7`

variable by executing `X_train["A7"].head()`

.

Now, let’s perform target encoding with Feature-engine. First, we must split the data, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import MeanEncoder

- Let’s set up the target mean encoder to encode all categorical variables:
mean_enc = MeanEncoder(variables=None)

Tip

`MeanEncoder()`

will find and encode all categorical variables by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the variables argument.

- Let’s fit the transformer to the train set so that it learns and stores the mean target value per category per variable. Note that we need to pass both the train set and target to fit the encoder:
mean_enc.fit(X_train, y_train)

- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)

Tip

The category-to-number pairs are stored as a dictionary of dictionaries in the `encoder_dict_`

attribute. To display the stored parameters, execute `mean_enc.encoder_dict_`

.

Feature-engine returns pandas DataFrames containing the categorical variables, ready to use in machine learning models.

To wrap up, let’s implement mean encoding with Category Encoders blending the probabilities.

- Let’s import the encoder:
from category_encoders.target_encoder import TargetEncoder

- Let’s set up the encoder so that it encodes all categorical variables using blended probabilities when there are less than 25 observations in the category group:
mean_enc = TargetEncoder( cols=None, min_samples_leaf=25, smoothing=1.0 )

Tip

`TargetEncoder()`

finds categorical variables automatically by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the `cols`

argument. The `smoothing`

parameter controls the blend of the prior and posterior probability. Higher values decrease the contribution of the posterior probability to the encoding.

- Let’s fit the transformer to the train set so that it learns and stores the numerical representations for each category:
mean_enc.fit(X_train, y_train)

Note

The `min_samples_leaf`

parameter refers to the minimum number of observations per category that a group should have to solely use the posterior probability. It is the equivalent of `k`

in our weighting factor formula. In the original article, `k`

was set to ½ of `min_samples_leaf`

. Category encoders expose this value and thus, we can optimize it with cross-validation.

- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)

Category Encoders returns pandas DataFrames by default, where the original categorical variable values are replaced by their numerical representation. You can inspect the results by executing `X_train_enc.head()`

.

## How it works…

In this recipe, we replaced the categories with the mean target value using pandas, Feature-engine, and Category Encoders.

With pandas `groupby()`

, using the `A7`

categorical variable, followed by pandas `mean()`

over the target variable, we created a pandas Series with the categories as indices and the target mean as values. With pandas `to_dict()`

, we converted this Series into a dictionary. Finally, we used this dictionary to replace the categories in the train and test sets using pandas `map()`

.

To perform the encoding with Feature-engine, we used `MeanEncoder()`

. With `fit()`

, the transformer found and stored the categorical variables and the mean target value per category. With `transform()`

, categories were replaced with numbers in the train and test sets, returning pandas DataFrames.

Finally, we used `TargetEncoder()`

from Category Encoders to replace categories with a blend of prior and posterior probability estimates of the target. We set `min_samples_leaf`

to 25, which meant that if a category group had 25 observations or more, then the posterior probability was used for the encoding; alternatively, a blend of probabilities was used for the encoding. With `fit()`

, the transformer found the categorical variables and the numerical representation of the categories, while with `transform()`

, the categories were replaced with numbers, returning pandas DataFrames with their encoded values.

## There’s more…

There is an alternative way to return *better* target estimates when the category groups are small. The replacement value for each category is determined as follows:

Here, ni(Y=1) is the target mean for category *i* and *n*i is the number of observations with category *i*. The target prior is given by *pY* and m is the weighting factor. With this adjustment, the only parameter that we have to set is the weight, *m*. If *m* is large, then more importance is given to the target’s prior probability. This adjustment affects target estimates for all categories but mostly for those with fewer observations because, in such cases, m could be much larger than *n*i in the formula’s denominator.

For an implementation of this encoding using `MEstimateEncoder()`

, visit this book’s GitHub repository.

# Encoding with the Weight of Evidence

The **Weight of Evidence** (**WoE**) was developed primarily for credit and financial industries to facilitate variable screening and exploratory analysis and to build more predictive linear models to evaluate the risk of loan defaults.

The WoE is computed from the basic odds ratio:

Here, positive and negative refer to the values of the target being 1 or 0, respectively. The proportion of positive cases per category is determined as the sum of positive cases per category group divided by the total positive cases in the training set, and the proportion of negative cases per category is determined as the sum of negative cases per category group divided by the total number of negative observations in the training set.

The WoE has the following characteristics:

- WoE = 0 if p(positive) / p(negative) = 1; that is, if the outcome is random
- WoE > 0 if p(positive) > p(negative)
- WoE < 0 if p(negative) > p(positive)

This allows us to directly visualize the predictive power of the category in the variable: the higher the WoE, the more likely the event will occur. If the WoE is positive, the event is likely to occur:

Logistic regression models a binary response, *Y*, based on *X* predictor variables, assuming that there is a linear relationship between *X* and the log of odds of *Y*.

Here, *log (p(Y=1)/p(Y=0))* is the log of odds. As you can see, the WoE encodes the categories in the same scale – that is, the log of odds – as the outcome of the logistic regression.

Therefore, by using WoE, the predictors are prepared and coded on the same scale, and the parameters in the logistic regression model – that is, the coefficients – can be directly compared.

In this recipe, we will perform WoE encoding using pandas and Feature-engine.

## How to do it...

Let’s begin by making some imports and preparing the data:

- Import the required libraries and functions:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split

- Let’s load the dataset and divide it into train and test sets:
- Let’s get the inverse of the target values to be able to calculate the negative cases:
neg_y_train = pd.Series( np.where(y_train == 1, 0, 1), index=y_train.index )

- Let’s determine the number of observations where the target variable takes a value of 1 or 0:
total_pos = y_train.sum() total_neg = neg_y_train.sum()

- Now, let’s calculate the numerator and denominator of the WoE’s formula, which we discussed earlier in this recipe:
pos = y_train.groupby( X_train["A1"]).sum() / total_pos neg = neg_y_train.groupby( X_train["A1"]).sum() / total_neg

- Now, let’s calculate the WoE per category:
woe = np.log(pos/neg)

We can display the series with the category to WoE pairs by executing `print(woe)`

:

A1 Missing 0.203599 a 0.092373 b -0.042410 dtype: float64

- Finally, let’s replace the categories of
`A1`

with the WoE:X_train["A1"] = X_train["A1"].map(woe) X_test["A1"] = X_test["A1"].map(woe)

You can inspect the encoded variable by executing `X_train["A1"].head()`

.

Now, let’s perform WoE encoding using Feature-engine. First, we need to separate the data into train and test sets, as we did in *step 2*.

- Let’s import the encoder:
from feature_engine.encoding import WoEEncoder

- Next, let’s set up the encoder so that we can encode three categorical variables:
woe_enc = WoEEncoder(variables = ["A1", "A9", "A12"])

Tip

Feature-engine’s `WoEEncoder()`

will return an error if `p(0)=0`

for any category because the division by `0`

is not defined. To avoid this error, we can group infrequent categories, as we will discuss in the next recipe, *Grouping rare or **infrequent categories*.

- Let’s fit the transformer to the train set so that it learns and stores the WoE of the different categories:
woe_enc.fit(X_train, y_train)

Tip

We can display the dictionaries with the categories to WoE pairs by executing `woe_enc.encoder_dict_`

.

- Finally, let’s encode the three categorical variables in the train and test sets:
X_train_enc = woe_enc.transform(X_train) X_test_enc = woe_enc.transform(X_test)

Feature-engine returns pandas DataFrames containing the encoded categorical variables ready to use in machine learning models.

## How it works...

First, with pandas `sum()`

, we determined the total number of positive and negative cases. Next, using pandas `groupby()`

, we determined the fraction of positive and negative cases per category. And with that, we calculated the WoE per category.

Finally, we automated the procedure with Feature-engine. We used `WoEEncoder()`

, which learned the WoE per category with the `fit()`

method, and then used `transform()`

, which replaced the categories with the corresponding numbers.

## See also

For an implementation of WoE with Category Encoders, visit this book’s GitHub repository.

# Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called **Rare** or **Other**.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the necessary Python libraries, functions, and classes:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.categorical_encoders import RareLabelEncoder

- Let’s load the dataset and divide it into train and test sets:
- Let’s capture the fraction of observations per category in
`A7`

in a variable:freqs = X_train["A7"].value_counts(normalize=True)

We can see the percentage of observations per category of `A7`

, expressed as decimals, in the following output after executing `print(freqs)`

:

v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 z 0.014493 dd 0.010352 j 0.010352 Missing 0.008282 n 0.006211 o 0.002070 Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then `z`

, `dd`

, `j`

, `Missing`

, `n`

, and `o`

are rare categories.

- Let’s create a list containing the names of the categories present in more than 5% of the observations:
frequent_cat = [ x for x in freqs.loc[freqs > 0.05].index.values]

If we execute `print(frequent_cat)`

, we will see the frequent categories of `A7`

:

['v', 'h', 'ff', 'bb'].

- Let’s replace rare labels – that is, those present in
*<= 5%*of the observations – with the`"`

`Rare"`

string:X_train["A7"] = np.where( X_train["A7"].isin(frequent_cat), X_train["A7"], "Rare" ) X_test["A7"] = np.where( X_test["A7"].isin(frequent_cat), X_test["A7"], "Rare" )

- Let’s determine the percentage of observations in the encoded variable:
X_train["A7"].value_counts(normalize=True)

We can see that the infrequent labels have now been re-grouped into the `Rare`

category:

v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 Rare 0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in *step 2*.

- Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
rare_encoder = RareLabelEncoder(tol=

**0.05**, n_categories=**4**) - Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
rare_encoder.fit(X_train)

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing `rare_encoder.encoder_dict_`

, as well as the variables that will be encoded by executing `rare_encoder.variables_`

.

- Finally, let’s group rare labels in the train and test sets:
X_train_enc = rare_encoder.transform(X_train) X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

## How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the `A7`

variable using pandas `value_counts()`

by setting the `normalize`

parameter to `True`

. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s `where()`

, we searched each row of `A7`

, and if the observation was one of the frequent categories in the list, which we checked using the pandas `isin()`

method, its value was kept; otherwise, its original value was replaced with `"Rare"`

.

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s `RareLabelEncoder()`

. By setting `tol`

to `0.05`

, we retained categories present in more than 5% of the observations. By setting `n_categories`

to `4`

, we only group rare categories in variables with more than four unique values. With the `fit()`

method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the `transform()`

method, the transformer replaced infrequent categories with the `"`

`Rare"`

string.

# Performing binary encoding

Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. How does it work? First, the categories are arbitrarily replaced with ordinal numbers, as shown in the intermediate step of the following table. Then, those numbers are converted into binary code. For example, integer 1 can be represented as sequence 10, integer 2 as 01, integer 3 as 11, and integer 0 as 00. The digits in the two positions of the binary string become the columns, which are the encoded representations of the original variable:

Figure 2.9 – Table showing the steps required for binary encoding of the color variable

Binary encoding encodes the data in fewer dimensions than one-hot encoding. In our example, the **Color** variable would be encoded into *k-1* categories by one-hot encoding – that is, three variables – but with binary encoding, we can represent the variable with only two features. More generally, we determine the number of binary features needed to encode a variable as *log2(number of distinct categories)*; in our example, *log2(4) = 2* binary features.

Binary encoding is an alternative method to one-hot encoding where we do not lose information about the variable, yet we obtain fewer features after the encoding. This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need *7 (log2(128)=7)*. Thus, this encoding prevents the feature space from exploding. In addition, binary-encoded features are also suitable for linear models. On the downside, the derived binary features **lack human interpretability**, so if we need to interpret the decisions made by our models, this encoding method may not be a suitable option.

In this recipe, we will learn how to perform binary encoding using Category Encoders.

## How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

- Import the required Python library, function, and class:
import pandas as pd from sklearn.model_selection import train_test_split from category_encoders.binary import BinaryEncoder

- Let’s load the dataset and divide it into train and test sets:
- Let’s inspect the unique categories in
`A7`

:X_train["A7"].unique()

In the following output, we can see that `A7`

has 10 different categories:

array(['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n', 'o'], dtype=object)

- Let’s create a binary encoder to encode
`A7`

:encoder = BinaryEncoder(cols=["A7"], drop_invariant=True)

Tip

`BinaryEncoder()`

, as well as other encoders from the Category Encoders package, allow us to select the variables to encode. We simply pass the column names in a list to the `cols`

argument.

- Let’s fit the transformer to the train set so that it calculates how many binary variables it needs and creates the variable-to-binary code representations:
encoder.fit(X_train)

- Finally, let’s encode
`A7`

in the train and test sets:X_train_enc = encoder.transform(X_train) X_test_enc = encoder.transform(X_test)

We can display the top rows of the transformed train set by executing `print(X_train_enc.head())`

, which returns the following output:

Figure 2.10 – DataFrame with the variables after binary encoding

Binary encoding returned four binary variables for `A7`

, which are `A7_0`

, `A7_1`

, `A7_2`

, and `A7_3`

, instead of the nine that would have been returned by one-hot encoding.

## How it works...

In this recipe, we performed binary encoding using the Category Encoders package. First, we loaded the dataset and divided it into train and test sets using `train_test_split()`

from scikit-learn. Next, we used `BinaryEncoder()`

to encode the `A7`

variable. With the `fit()`

method, `BinaryEncoder()`

created a mapping from category to set of binary columns, and with the `transform()`

method, the encoder encoded the `A7`

variable in both the train and test sets.

Tip

With one-hot encoding, we would have created nine binary variables (`k-1 = 10 unique categories - 1 = 9`

) to encode all of the information in `A7`

. With binary encoding, we can represent the variable in fewer dimensions by using `log2(10)=3.3`

; that is, we only need four binary variables.

## See also

For more information about `BinaryEncoder()`

, visit https://contrib.scikit-learn.org/category_encoders/binary.html.

For a nice example of the output of binary encoding, check out the following resource: https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding.

For a comparative study of categorical encoding techniques for neural network classifiers, visit https://www.researchgate.net/publication/320465713_A_Comparative_Study_of_Categorical_Variable_Encoding_Techniques_for_Neural_Network_Classifiers.