Encoding Categorical Variables
Categorical variables are those whose values are selected from a group of categories or labels. For example, the Gender variable with the values of Male and Female is categorical, and so is the marital status variable with the values of never married, married, divorced, and widowed. In some categorical variables, the labels have an intrinsic order; for example, in the Student’s grade variable, the values of A, B, C, and Fail are ordered, with A being the highest grade and Fail being the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the City variable, with the values of London, Manchester, Bristol, and so on.
The values of categorical variables are often encoded as strings. To train mathematical or machine learning models, we need to transform those strings into numbers. The act of replacing strings with numbers is called categorical encoding. In this chapter, we will discuss multiple categorical encoding methods.
This chapter will cover the following recipes:
- Creating binary variables through one-hot encoding
- Performing one-hot encoding of frequent categories
- Replacing categories with counts or the frequency of observations
- Replacing categories with ordinal numbers
- Performing ordinal encoding based on the target value
- Implementing target mean encoding
- Encoding with the Weight of Evidence
- Grouping rare or infrequent categories
- Performing binary encoding
Technical requirements
In this chapter, we will use the pandas, NumPy, and Matplotlib Python libraries, as well as scikit-learn and Feature-engine. For guidelines on how to obtain these libraries, visit the Technical requirements section of Chapter 1, Imputing Missing Data.
We will also use the open-source Category Encoders
Python library, which can be installed using pip
:
pip install category_encoders
To learn more about Category Encoders
, visit the following link: https://contrib.scikit-learn.org/category_encoders/.
We will also use the Credit Approval dataset, which is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/credit+approval.
To prepare the dataset, follow these steps:
- Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/ and click on
crx.data
to download the data:

Figure 2.1 – The index directory for the Credit Approval dataset
- Save
crx.data
to the folder where you will run the following commands.
After downloading the data, open up a Jupyter Notebook and run the following commands.
- Import the required libraries:
import random import numpy as np import pandas as pd
- Load the data:
data = pd.read_csv("crx.data", header=None)
- Create a list containing the variable names:
varnames = [f"A{s}" for s in range(1, 17)]
- Add the variable names to the DataFrame:
data.columns = varnames
- Replace the question marks in the dataset with NumPy NaN values:
data = data.replace("?", np.nan)
- Cast some numerical variables as
float
data types:data["A2"] = data["A2"].astype("float") data["A14"] = data["A14"].astype("float")
- Encode the target variable as binary:
data["A16"] = data["A16"].map({"+": 1, "-": 0})
- Rename the target variable:
data.rename(columns={"A16": "target"}, inplace=True)
- Make lists that contain categorical and numerical variables:
cat_cols = [ c for c in data.columns if data[c].dtypes=="O"] num_cols = [ c for c in data.columns if data[c].dtypes!= "O"]
- Fill in the missing data:
data[num_cols] = data[num_cols].fillna(0) data[cat_cols] = data[cat_cols].fillna("Missing")
- Save the prepared data:
data.to_csv("credit_approval_uci.csv", index=False)
You can find a Jupyter Notebook that contains these commands in this book’s GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook-Second-Edition/blob/main/ch02-categorical-encoding/donwload-prepare-store-credit-approval-dataset.ipynb.
Note
Some libraries require that you have already imputed missing data, for which you can use any of the recipes from Chapter 1, Imputing Missing Data.
Creating binary variables through one-hot encoding
In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable takes a value of 1 if the category is present in an observation, or 0 otherwise.
The following table shows the one-hot encoded representation of the Gender variable with the categories of Male and Female:

Figure 2.2 – One-hot encoded representation of the Gender variable
As shown in Figure 2.2, from the Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.
For the categorical variable of Color with the values of red, blue, and green, we can create three variables called red, blue, and green. These variables will take the value of 1 if the observation is red, blue, or green, respectively, or 0 otherwise.
A categorical variable with k unique categories can be encoded using k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), so we only need to create one binary variable (k - 1 = 1) to capture all of the information. For the Color variable, which has three categories (k=3; red, blue, and green), we need to create two (k - 1 = 2) binary variables to capture all the information so that the following occurs:
- If the observation is red, it will be captured by the red variable (red = 1, blue = 0).
- If the observation is blue, it will be captured by the blue variable (red = 0, blue = 1)
- If the observation is green, it will be captured by the combination of red and blue (red = 0, blue = 0)
Encoding into k-1 binary variables is well-suited for linear models. There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:
- When training decision trees since they do not evaluate the entire feature space at the same time
- When selecting features recursively
- When determining the importance of each category within a variable
In this recipe, we will compare the one-hot encoding implementations of pandas, scikit-learn, Feature-engine, and Category Encoders.
How to do it...
First, let’s make a few imports and get the data ready:
- Import
pandas
and thetrain_test_split
function from scikit-learn:import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the Credit Approval dataset:
data = pd.read_csv("credit_approval_uci.csv")
- Let’s separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s inspect the unique categories of the
A4
variable:X_train["A4"].unique()
We can see the unique values of A4
in the following output:
array(['u', 'y', 'Missing', 'l'], dtype=object)
- Let’s encode
A4
into k-1 binary variables using pandas and then inspect the first five rows of the resulting DataFrame:dummies = pd.get_dummies(X_train["A4"], drop_first=True) dummies.head()
Note
With pandas get_dummies()
, we can either ignore or encode missing data through the dummy_na
parameter. By setting dummy_na=True
, missing data will be encoded in a new binary variable. To encode the variable into k
dummies, use drop_first=False
instead.
Here, we can see the output of step 5, where each label is now a binary variable:
l u y 596 0 1 0 303 0 1 0 204 0 0 1 351 0 0 1 118 0 1 0
- Now, let’s encode all of the categorical variables into k-1 binaries, capturing the result in a new DataFrame:
X_train_enc = pd.get_dummies(X_train, drop_first=True) X_test_enc = pd.get_dummies(X_test, drop_first=True)
Note
The get_dummies
method from pandas will automatically encode all variables of the object or type. We can encode a subset of the variables by passing the variable names in a list to the columns
parameter.
Note
When encoding more than one variable, get_dummies()
captures the variable name – say, A1
– and places an underscore followed by the category name to identify the resulting binary variables.
We can see the binary variables in the following output:

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right
Note
The get_dummies()
method will create one binary variable per seen category. Hence, if there are more categories in the train set than in the test set, get_dummies()
will return more columns in the transformed train set than in the transformed test set, and vice versa. To avoid this, it is better to carry out one-hot encoding with scikit-learn or Feature-engine, as we will discuss later in this recipe.
- Let’s concatenate the binary variables to the original dataset:
X_test_enc = pd.concat([X_test, X_test_enc], axis=1)
- Now, let’s drop the categorical variables from the data:
X_test_enc.drop( labels=X_test_enc.select_dtypes( include="O").columns, axis=1, inplace=True, )
And that’s it! Now, we can use our categorical variables to train mathematical models. To inspect the result, use X_test_enc.head()
.
Now, let’s do one-hot encoding using scikit-learn.
- Import the encoder from scikit-learn:
from sklearn.preprocessing import OneHotEncoder
- Let’s set up the transformer. By setting
drop
to"first"
, we encode into k-1 binary variables, and by settingsparse
toFalse
, the transformer will return a NumPy array (instead of a sparse matrix):encoder = OneHotEncoder(drop="first", sparse=False)
Tip
We can encode variables into k dummies by setting the drop
parameter to None
. We can also encode into k-1 if variables contain two categories and into k
if variables contain more than two categories by setting the drop
parameter to “if_binary
”. The latter is useful because encoding binary variables into k
dummies is redundant.
- First, let’s create a list containing the variable names:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()
- Let’s fit the encoder to a slice of the train set with the categorical variables:
encoder.fit(X_train[vars_categorical])
- Let’s inspect the categories for which dummy variables will be created:
encoder.categories_
We can see the result of the preceding command here:

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable
Note
Scikit-learn’s OneHotEncoder()
will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error by setting the handle_unknown
parameter to 'ignore'
or '
error'
, respectively.
- Let’s create the NumPy arrays with the binary variables for the train and test sets:
X_train_enc = encoder.transform( X_train[vars_categorical]) X_test_enc = encoder.transform( X_test[vars_categorical])
- Let’s extract the names of the binary variables:
encoder.get_feature_names_out()
We can see the binary variable names that were returned in the following output:

Figure 2.5 – Arrays with the names of the one-hot encoded variables
- Let’s convert the array into a pandas DataFrame and add the variable names:
X_test_enc = pd.DataFrame(X_test_enc) X_test_enc.columns = encoder.get_feature_names_out()
- To concatenate the one-hot encoded data to the original dataset, we need to make their indexes match:
X_test_enc.index = X_test.index
Now, we are ready to concatenate the one-hot encoded variables to the original data and then remove the categorical variables using steps 8 and 9 from this recipe.
To follow up, let’s perform one-hot encoding with Feature-engine.
- Let’s import the encoder from Feature-engine:
from feature_engine.encoding import OneHotEncoder
- Next, let’s set up the encoder so that it returns k-1 binary variables:
ohe_enc = OneHotEncoder(drop_last=True)
Tip
Feature-engine automatically finds the categorical variables. To encode only a subset of the variables, we can pass the variable names in a list: OneHotCategoricalEncoder(variables=["A1", "A4"])
. To encode numerical variables, we can set the ignore_format
parameter to True
or cast the variables as the object type. This is useful because sometimes, numerical variables are used to represent categories, such as postcodes.
- Let’s fit the encoder to the train set so that it learns the categories and variables to encode:
ohe_enc.fit(X_train)
- Let’s explore the variables that will be encoded:
ohe_enc.variables_
The transformer found and stored the variables of the object or categorical type, as shown in the following output:
['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
Note
Feature-engine’s OneHotEncoder
has the option to encode most variables into k dummies, while only returning k-1 dummies for binary variables. For this behavior, set the drop_last_binary
parameter to True
.
The following dictionary contains the categories that will be encoded in each variable:
{'A1': ['a', 'b'], 'A4': ['u', 'y', 'Missing'], 'A5': ['g', 'p', 'Missing'], 'A6': ['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing', 'aa'], 'A7': ['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n'], 'A9': ['t'], 'A10': ['t'], 'A12': ['t'], 'A13': ['g', 's']}
- Let’s encode the categorical variables in train and test sets:
X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)
Tip
Feature-engine’s OneHotEncoder()
returns a copy of the original dataset plus the binary variables and without the original categorical variables. Thus, this data is ready to train machine learning models.
If we execute X_train_enc.head()
, we will see the following DataFrame:

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right
Note how the A4 categorical variable was replaced with A4_u, A4_y, and so on.
Note
We can get the names of all the variables in the transformed dataset by executing ohe_enc.get_feature_names_out()
.
How it works...
In this recipe, we performed a one-hot encoding of categorical variables using pandas, scikit-learn, Feature-engine, and Category Encoders.
With get_dummies()
from pandas, we automatically created binary variables for each of the categories in the categorical variables.
The OneHotEncoder
transformers from the scikit-learn and Feature-engine libraries share the fit()
and transform()
methods. With fit()
, the encoders learned the categories for which the dummy variables should be created. With transform()
, they returned the binary variables either in a NumPy array or added them to the original DataFrame.
Tip
One-hot encoding expands the feature space. From nine original categorical variables, we created 36 binary ones. If our datasets contain many categorical variables or highly cardinal variables, we will easily increase the feature space dramatically, which increases the computational cost of training machine learning models or obtaining their predictions and may also deteriorate their performance.
There’s more...
We can also perform one-hot encoding using OneHotEncoder
from the Category Encoders library.
OneHotEncoder()
from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. So does pandas get_dummies()
. Scikit-learn’s OneHotEncoder()
, on the other hand, will encode all variables in the dataset.
With pandas, Feature-engine, and Category Encoders, we can only encode a subset of the variables, indicating their names in a list. With scikit-learn, we need to use an additional class, ColumnTransformer()
, to slice the data before the transformation.
With Feature-engine and Category Encoders, the dummy variables are added to the original dataset and the categorical variables are removed after the encoding. With scikit-learn and pandas, we need to manually perform these procedures.
Finally, using OneHotEncoder()
from scikit-learn, Feature-engine, and Category Encoders, we can perform the encoding step within a scikit-learn pipeline, which is more convenient if we have various feature engineering steps or want to put the pipelines into production. pandas get_dummies()
is otherwise well suited for data analysis and visualization.
Performing one-hot encoding of frequent categories
One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.
In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from feature_engine.encoding import OneHotEncoder
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop( labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
Tip
The most frequent categories need to be determined in the train set. This is to avoid data leakage.
- Let’s inspect the unique categories of the
A6
variable:X_train["A6"].unique()
The unique values of A6
are displayed in the following output:
array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)
- Let’s count the number of observations per category of
A6
, sort them in decreasing order, and then display the five most frequent categories:X_train["A6"].value_counts().sort_values( ascending=False).head(5)
We can see the five most frequent categories and the number of observations per category in the following output:
c 93 q 56 w 48 i 41 ff 38 Name: A6, dtype: int64
- Now, let’s capture the most frequent categories of
A6
in a list by using the code in step 4 inside a list comprehension:top_5 = [ x for x in X_train["A6"].value_counts().sort_values( ascending=False).head(5).index ]
- Now, let’s add a binary variable per top category to the train and test sets:
for label in top_5: X_train[f"A6_{label}"] = np.where( X_train["A6"] ==label, 1, 0) X_test[f"A6_{label}"] = np.where( X_test["A6"] ==label, 1, 0)
- Let’s display the top
10
rows of the original and encoded variable,A6
, in the train set:X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)
In the output of step 7, we can see the A6
variable, followed by the binary variables:
A6 A6_c A6_q A6_w A6_i A6_ff 596 c 1 0 0 0 0 303 q 0 1 0 0 0 204 w 0 0 1 0 0 351 ff 0 0 0 0 1 118 m 0 0 0 0 0 247 q 0 1 0 0 0 652 i 0 0 0 1 0 513 e 0 0 0 0 0 230 cc 0 0 0 0 0 250 e 0 0 0 0 0
We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.
- Let’s set up the one-hot encoder to encode the five most frequent categories of the
A6
andA7
variables:ohe_enc = OneHotEncoder( top_categories=5, variables=["A6", "A7"] )
Tip
Feature-engine’s OneHotEncoder()
will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.
- Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of
A6
andA7
:ohe_enc.fit(X_train)
Note
The number of frequent categories to encode is arbitrarily determined by the user.
- Finally, let’s encode
A6
andA7
in the train and test sets:X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)
You can view the new binary variables in the DataFrame by executing X_train_enc.head()
. You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_
.
Note
Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.
How it works...
In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.
In the first part of this recipe, we worked with the A6
categorical variable. We inspected its unique categories with pandas unique()
. Next, we counted the number of observations per category using pandas value_counts()
,which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values()
. Next, we reduced the series to the five most popular categories by using pandas head()
. Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where()
method, we created binary variables by placing a value of 1
if the observation showed the category, or 0
otherwise.
To perform a one-hot encoding of the five most popular categories of the A6
and A7
variables with Feature-engine, we used OneHotEncoder()
, indicating 5
in the top_categories
argument, and passing the variable names in a list to the variables
argument. With fit()
, the encoder learned the top categories from the train set and stored them in its encoder_dict_
attribute. Then, with transform()
, OneHotEncoder()
replaced the original variables with the set of binary ones.
There’s more...
This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.
Replacing categories with counts or the frequency of observations
In count or frequency encoding, we replace the categories with the count or the fraction of observations showing that category. That is, if 10 out of 100 observations show the category blue for the Color variable, we would replace blue with 10 when doing count encoding, or with 0.1 if performing frequency encoding. These encoding methods, which capture the representation of each label in a dataset, are very popular in data science competitions. The assumption is that the number of observations per category is somewhat predictive of the target.
Tip
Note that if two different categories are present in the same number of observations, they will be replaced by the same value, which leads to information loss.
In this recipe, we will perform count and frequency encoding using pandas, Feature-engine, and Category Encoders.
How to do it...
Let’s begin by making some imports and preparing the data:
- Import
pandas
and the required function:import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s count the number of observations per category of the
A7
variable and capture it in a dictionary:counts = X_train["A7"].value_counts().to_dict()
Tip
To encode categories with their frequency, execute X_train["A6"].value_counts(normalize=True).to_dict()
.
If we execute print(counts)
, we can observe the count of observations per category:
{'v': 277, 'h': 101, 'ff': 41, 'bb': 39, 'z': 7, 'dd': 5, 'j': 5, 'Missing': 4,, 'n': 3, 'o': 1}
- Let’s replace the categories in
A7
with the counts:X_train["A7"] = X_train["A7"].map(counts) X_test["A7"] = X_test["A7"].map(counts)
Go ahead and inspect the data by executing X_train.head()
to corroborate that the categories have been replaced by the counts.
Now, let’s carry out count encoding using Feature-engine. First, let’s load and divide the dataset, as we did in step 2.
- Let’s import the count encoder from Feature-engine:
from feature_engine.encoding import CountFrequencyEncoder
- Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
count_enc = CountFrequencyEncoder( encoding_method="count", variables=None, )
Tip
CountFrequencyEncoder()
will automatically find and encode all categorical variables in the train set. To encode only a subset of the variables, we can pass the variable names in a list to the variables
argument.
- Let’s fit the encoder to the train set so that it stores the number of observations per category per variable:
count_enc.fit(X_train)
Tip
The dictionaries with the category-to-counts pairs are stored in the encoder_dict_
attribute and can be displayed by executing count_enc.encoder_dict_
.
- Finally, let’s replace the categories with counts in the train and test sets:
X_train_enc = count_enc.transform(X_train) X_test_enc = count_enc.transform(X_test)
Tip
If there are categories in the test set that were not present in the train set, the transformer will replace those with np.nan
and return a warning to make you aware of this. A good idea to prevent this behavior is to group infrequent labels, as described in the Grouping rare or infrequent categories recipe.
The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.
To wrap up this recipe, let’s encode the variables using Category Encoders.
- Let’s import the encoder from Category Encoders:
from category_encoders.count import CountEncoder
- Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
count_enc = CountEncoder(cols=None)
Note
CountEncoder()
automatically finds and encodes all categorical variables in the train set. To encode only a subset of the categorical variables, we can pass the variable names in a list to the cols
argument. To replace the categories by frequency instead, we need to set the Normalize
parameter to True
.
- Let’s fit the encoder to the train set so that it counts and stores the number of observations per category per variable:
count_enc.fit(X_train)
Tip
The values used to replace the categories are stored in the mapping attribute and can be displayed by executing count_enc.mapping
.
- Finally, let’s replace the categories with counts in the train and test sets:
X_train_enc = count_enc.transform(X_train) X_test_enc = count_enc.transform(X_test)
Note
Categories present in the test set that were not seen in the train set are referred to as unknown categories. CountEncoder()
has different options to handle unknown categories, including returning an error, treating them as missing data, or replacing them with an indicated integer. CountEncoder()
can also automatically group categories with few observations.
The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.
How it works...
In this recipe, we replaced categories by the count of observations using pandas, Feature-engine, and Category Encoders.
Using pandas value_counts()
, we determined the number of observations per category of the A7
variable, and with pandas to_dict()
, we captured these values in a dictionary, where each key was a unique category, and each value the number of observations for that category. With pandas map()
and using this dictionary, we replaced the categories with the observation counts in both the train and test sets.
To perform count encoding with Feature-engine, we used CountFrequencyEncoder()
and set encoding_method
to 'count'
. We left the variables
argument set to None
so that the encoder automatically finds all of the categorical variables in the dataset. With the fit()
method, the transformer found the categorical variables and stored the observation counts per category in the encoder_dict_
attribute. With the transform()
method, the transformer replaced the categories with the counts, returning a pandas DataFrame.
Finally, we performed count encoding with CountEncoder()
by setting Normalize
to False
. We left the cols
argument set to None
so that the encoder automatically finds the categorical variables in the dataset. With the fit()
method, the transformer found the categorical variables and stored the category to count mappings in the mapping
attribute. With the transform()
method, the transformer replaced the categories with the counts in, returning a pandas DataFrame.
Replacing categories with ordinal numbers
Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1, depending on the implementation), where k is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.
In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and prepare the dataset:
- Import
pandas
and thedata
split
function:import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- To encode the
A7
variable, let’s make a dictionary of category-to-integer pairs:ordinal_mapping = {k: i for i, k in enumerate( X_train["A7"].unique(), 0) }
If we execute print(ordinal_mapping)
, we will see the digits that will replace each category:
{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}
- Now, let’s replace the categories with numbers in the original variables:
X_train["A7"] = X_train["A7"].map(ordinal_mapping) X_test["A7"] = X_test["A7"].map(ordinal_mapping)
With print(X_train["A7"].head(10))
, we can see the result of the preceding operation, where the original categories were replaced by numbers:
596 0 303 0 204 0 351 1 118 0 247 2 652 0 513 3 230 0 250 4 Name: A7, dtype: int64
Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in step 2.
- Let’s import the required classes:
from sklearn.preprocessing import OrdinalEncoder from sklearn.compose import ColumnTransformer
Tip
Do not confuse OrdinalEncoder()
with LabelEncoder()
from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.
- Let’s set up the encoder:
enc = OrdinalEncoder()
Note
Scikit-learn’s OrdinalEncoder()
will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s ColumnTransformer()
.
- Let’s make a list containing the categorical variables to encode:
vars_categorical = X_train.select_dtypes( include="O").columns.to_list()
- Let’s make a list containing the remaining variables:
vars_remainder = X_train.select_dtypes( exclude="O").columns.to_list()
- Now, let’s set up
ColumTransformer()
to encode the categorical variables. By setting theremainder
parameter to"passthrough"
, we makeColumnTransformer()
concatenate the variables that are not encoded at the back of the encoded features:ct = ColumnTransformer( [("encoder", enc, vars_categorical)], remainder="passthrough", )
- Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
ct.fit(X_train)
By executing ct.named_transformers_["encoder"].categories_
, you can visualize the unique categories per variable.
- Now, let’s encode the categorical variables in the train and test sets:
X_train_enc = ct.transform(X_train) X_test_enc = ct.transform(X_test)
Remember that scikit-learn returns a NumPy array.
- Let’s transform the arrays into pandas DataFrames by adding the columns:
X_train_enc = pd.DataFrame( X_train_enc, columns=vars_categorical+vars_remainder) X_test_enc = pd.DataFrame( X_test_enc, columns=vars_categorical+vars_remainder)
Note
Note that, with ColumnTransformer()
, the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of step 12 with X_train_enc.head()
.
Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder
- Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in step 7:
enc = OrdinalEncoder(encoding_method="arbitrary", variables=vars_categorical)
Note
Feature-engine’s OrdinalEncoder
automatically finds and encodes all categorical variables if the variables
parameter is left set to None
. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s OrdinalEncoder()
can assign the integers according to the target mean value (see the Performing ordinal encoding based on the target value recipe).
- Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
enc.fit(X_train)
Tip
The category to integer mappings are stored in the encoder_dict_
attribute and can be accessed by executing enc.encoder_dict_
.
- Finally, let’s encode the categorical variables in the train and test sets:
X_train_enc = enc.transform(X_train) X_test_enc = enc.transform(X_test)
Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.
How it works...
In this recipe, we replaced categories with integers assigned arbitrarily.
With pandas unique()
, we returned the unique values of the A7
variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the A7
variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas map()
to replace the strings in A7
with the integers.
Next, we carried out ordinal encoding using scikit-learn’s OrdinalEncoder()
and used ColumnTransformer()
to select the columns to encode. With the fit()
method, the transformer created the category-to-integer mappings based on the categories in the train set. With the transform()
method, the categories were replaced with integers, returning a NumPy array. ColumnTransformer()
sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.
To perform ordinal encoding with Feature-engine, we used OrdinalEncoder()
, indicating that the integers should be assigned arbitrarily in encoding_method
and passing a list with the variables to encode in the variables
argument. With the fit()
method, the encoder assigned integers to each variable’s categories, which were stored in the encoder_dict_
attribute. These mappings were then used by the transform()
method to replace the categories in the train and test sets, returning DataFrames.
There’s more...
You can also carry out ordinal encoding with OrdinalEncoder()
from Category Encoders.
The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.
scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, ColumnTransformer()
, to slice the data before the transformation.
Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.
Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.
Performing ordinal encoding based on the target value
In the previous recipe, we replaced categories with integers, which were assigned arbitrarily. We can also assign integers to the categories given the target values. To do this, first, we must calculate the mean value of the target per category. Next, we must order the categories from the one with the lowest to the one with the highest target mean value. Finally, we must assign digits to the ordered categories, starting with 0 to the first category up to k-1 to the last category, where k is the number of distinct categories.
This encoding method creates a monotonic relationship between the categorical variable and the response and therefore makes the variables more adequate for use in linear models.
In this recipe, we will encode categories while following the target value using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python libraries, functions, and classes:
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s determine the mean target value per category in
A7
, then sort the categories from that with the lowest to that with the highest target value:y_train.groupby(X_train["A7"]).mean().sort_values()
The following is the output of the preceding command:
A7 o 0.000000 ff 0.146341 j 0.200000 dd 0.400000 v 0.418773 bb 0.512821 h 0.603960 n 0.666667 z 0.714286 Missing 1.000000 Name: target, dtype: float64
- Now, let’s repeat the computation in step 3, but this time, let’s retain the ordered category names:
ordered_labels = y_train.groupby( X_train["A7"]).mean().sort_values().index
To display the output of the preceding command, we can execute print(ordered_labels)
:
Index(['o', 'ff', 'j', 'dd', 'v', 'bb', 'h', 'n', 'z', 'Missing'], dtype='object', name='A7')
- Let’s create a dictionary of category-to-integer pairs, using the ordered list we created in step 4:
ordinal_mapping = { k: i for i, k in enumerate( ordered_labels, 0) }
We can visualize the result of the preceding code by executing print(ordinal_mapping)
:
{'o': 0, 'ff': 1, 'j': 2, 'dd': 3, 'v': 4, 'bb': 5, 'h': 6, 'n': 7, 'z': 8, 'Missing': 9}
- Let’s use the dictionary we created in step 5 to replace the categories in
A7
in the train and test sets, returning the encoded features as new columns:X_train["A7_enc"] = X_train["A7"].map(ordinal_mapping) X_test["A7_enc"] = X_test["A7"].map(ordinal_mapping)
Tip
Note that if the test set contains a category not present in the train set, the preceding code will introduce np.nan
.
To better understand the monotonic relationship concept, let’s plot the relationship of the categories of the A7
variable with the target before and after the encoding.
- Let’s plot the mean target response per category of the
A7
variable:y_train.groupby(X_train["A7"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()
We can see the non-monotonic relationship between categories of A7
and the target in the following plot:

Figure 2.7 – Relationship between the categories of A7 and the target
- Let’s plot the mean target value per category in the encoded variable:
y_train.groupby(X_train["A7_enc"]).mean().plot() plt.title("Relationship between A7 and the target") plt.ylabel("Mean of target") plt.show()
The encoded variable shows a monotonic relationship with the target – the higher the mean target value, the higher the digit assigned to the category:

Figure 2.8 – Relationship between A7 and the target after the encoding
Now, let’s perform ordered ordinal encoding using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import OrdinalEncoder
- Next, let’s set up the encoder so that it assigns integers by following the target value to all categorical variables in the dataset:
ordinal_enc = OrdinalEncoder( encoding_method="ordered", variables=None)
Tip
OrdinalEncoder()
will find and encode all categorical variables automatically. Alternatively, we can indicate which variables to encode by passing their names in a list to the variables argument.
- Let’s fit the encoder to the train set so that it finds the categorical variables, and then stores the category and integer mappings:
ordinal_enc.fit(X_train, y_train)
Tip
When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.
- Finally, let’s replace the categories with numbers in the train and test sets:
X_train_enc = ordinal_enc.transform(X_train) X_test_enc = ordinal_enc.transform(X_test)
Tip
A list of the categorical variables is stored in the variables_
attribute of OrdinalEncoder()
and the dictionaries with the category-to-integer mappings in the encoder_dict_
attribute. When fitting the encoder, we need to pass the train set and the target, like with many scikit-learn predictor classes.
Go ahead and check the monotonic relationship between other encoded categorical variables and the target by using the code in step 7 and changing the variable name in the groupby()
method.
How it works...
In this recipe, we replaced the categories with integers according to the target mean.
In the first part of this recipe, we worked with the A7
categorical variable. With pandas groupby()
, we grouped the data based on the categories of A7
, and with pandas mean()
, we determined the mean value of the target for each of the categories of A7
. Next, we ordered the categories with pandas sort_values()
from the ones with the lowest to the ones with the highest target mean response. The output of this operation was a pandas Series, with the categories as indices and the target mean as values. With pandas index
, we captured the ordered categories in an array; then, with Python dictionary comprehension, we created a dictionary of category-to-integer pairs. Finally, we used this dictionary to replace the category with integers using pandas map()
in the train and test sets.
Then, we plotted the relationship of the original and encoded variables with the target to visualize the monotonic relationship after the transformation. We determined the mean target value per category of A7
using pandas groupby()
, followed by pandas mean()
, as described in the preceding paragraph. We followed up with pandas plot()
to create a plot of category versus target mean value. We added a title and y labels with Matplotlib’s title()
and ylabel()
methods.
To perform the encoding with Feature-engine, we used OrdinalEncoder()
and indicated "ordered"
in the encoding_method
argument. We left the argument variables set to None
so that the encoder automatically detects all categorical variables in the dataset. With the fit()
method, the encoder found the categorical variables to encode and assigned digits to their categories, according to the target mean value. The variables to encode and dictionaries with category-to-digit pairs were stored in the variables_
and encoder_dict_
attributes, respectively. Finally, using the transform()
method, the transformer replaced the categories with digits in the train and test sets, returning pandas DataFrames.
See also
For an implementation of this recipe with Category Encoders, visit this book’s GitHub repository.
Implementing target mean encoding
Mean encoding or target encoding maps each category to the probability estimate of the target attribute. If the target is binary, the numerical mapping is the posterior probability of the target conditioned to the value of the category. If the target is continuous, the numerical representation is given by the expected value of the target given the value of the category.
In its simplest form, the numerical representation for each category is given by the mean value of the target variable for a particular category group. For example, if we have a City variable, with the categories of London, Manchester, and Bristol, and we want to predict the default rate (the target takes values 0 and 1); if the default rate for London is 30%, we replace London with 0.3; if the default rate for Manchester is 20%, we replace Manchester with 0.2; and so on. If the target is continuous – say we want to predict income – then we would replace London, Manchester, and Bristol with the mean income earned in each city.
In mathematical terms, if the target is binary, the replacement value, S, is determined like so:

Here, the numerator is the number of observations with a target value of 1 for category i and the denominator is the number of observations with a category value of i.
If the target is continuous, S, this is determined by the following formula:

Here, the numerator is the sum of the target across observations in category i and ni is the total number of observations in category i.
These formulas provide a good approximation of the target estimate if there is a sufficiently large number of observations with each category value – in other words, if ni is large. However, in most datasets, categorical variables will only have categorical values present in a few observations. In these cases, target estimates derived from the precedent formulas can be unreliable.
To mitigate poor estimates returned for rare categories, the target estimates can be determined as a mixture of two probabilities: those returned by the preceding formulas and the prior probability of the target based on the entire training set. The two probabilities are blended using a weighting factor, which is a function of the category group size:

In this formula, ny is the total number of cases where the target takes a value of 1, N is the size of the train set, and 𝛌 is the weighting factor.
When the category group is large, 𝛌 approximates 1, so more weight is given to the first term of the equation. When the category group size is small, then 𝛌 tends to 0, so the estimate is mostly driven by the second term of the equation – that is, the target’s prior probability. In other words, if the group size is small, knowing the value of the category does not tell us anything about the value of the target.
The weighting factor, 𝛌, is a function of the group size, k, and a smoothing parameter, f, controls the rate of transition between the first and second term of the preceding equation:

Here, k is half of the minimal size for which we fully trust the first term of the equation. The f parameter is selected by the user either arbitrarily or with optimization.
Tip
Mean encoding was designed to encode highly cardinal categorical variables without expanding the feature space. For more details, check out the following article: Micci-Barreca D. A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations Newsletter, 2001.
In this recipe, we will perform mean encoding using pandas, Feature-engine, and Category Encoders.
How to do it...
In the first part of this recipe, we will replace categories with the target mean value, regardless of the number of observations per category. We will use pandas and Feature-engine to do this. In the second part of this recipe, we will introduce the weighting factor using Category Encoders. Let’s begin with this recipe:
- Import
pandas
and the data split function:import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s determine the mean target value per category of the
A7
variable and then store them in a dictionary:mapping = y_train.groupby(X_train["A7"]).mean().to_dict()
We can display the content of the dictionary by executing print(mapping)
:
{'Missing': 1.0, 'bb': 0.5128205128205128, 'dd': 0.4, 'ff': 0.14634146341463414, 'h': 0.6039603960396039, 'j': 0.2, 'n': 0.6666666666666666, 'o': 0.0, 'v': 0.4187725631768953, 'z': 0.7142857142857143}
- Let’s replace the categories with the mean target value using the dictionary we created in step 3 in the train and test sets:
X_train["A7"] = X_train["A7"].map(mapping) X_test["A7"] = X_test["A7"].map(mapping)
You can inspect the encoded A7
variable by executing X_train["A7"].head()
.
Now, let’s perform target encoding with Feature-engine. First, we must split the data, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import MeanEncoder
- Let’s set up the target mean encoder to encode all categorical variables:
mean_enc = MeanEncoder(variables=None)
Tip
MeanEncoder()
will find and encode all categorical variables by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the variables argument.
- Let’s fit the transformer to the train set so that it learns and stores the mean target value per category per variable. Note that we need to pass both the train set and target to fit the encoder:
mean_enc.fit(X_train, y_train)
- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)
Tip
The category-to-number pairs are stored as a dictionary of dictionaries in the encoder_dict_
attribute. To display the stored parameters, execute mean_enc.encoder_dict_
.
Feature-engine returns pandas DataFrames containing the categorical variables, ready to use in machine learning models.
To wrap up, let’s implement mean encoding with Category Encoders blending the probabilities.
- Let’s import the encoder:
from category_encoders.target_encoder import TargetEncoder
- Let’s set up the encoder so that it encodes all categorical variables using blended probabilities when there are less than 25 observations in the category group:
mean_enc = TargetEncoder( cols=None, min_samples_leaf=25, smoothing=1.0 )
Tip
TargetEncoder()
finds categorical variables automatically by default. Alternatively, we can indicate the variables to encode by passing their names in a list to the cols
argument. The smoothing
parameter controls the blend of the prior and posterior probability. Higher values decrease the contribution of the posterior probability to the encoding.
- Let’s fit the transformer to the train set so that it learns and stores the numerical representations for each category:
mean_enc.fit(X_train, y_train)
Note
The min_samples_leaf
parameter refers to the minimum number of observations per category that a group should have to solely use the posterior probability. It is the equivalent of k
in our weighting factor formula. In the original article, k
was set to ½ of min_samples_leaf
. Category encoders expose this value and thus, we can optimize it with cross-validation.
- Finally, let’s encode the train and test sets:
X_train_enc = mean_enc.transform(X_train) X_test_enc = mean_enc.transform(X_test)
Category Encoders returns pandas DataFrames by default, where the original categorical variable values are replaced by their numerical representation. You can inspect the results by executing X_train_enc.head()
.
How it works…
In this recipe, we replaced the categories with the mean target value using pandas, Feature-engine, and Category Encoders.
With pandas groupby()
, using the A7
categorical variable, followed by pandas mean()
over the target variable, we created a pandas Series with the categories as indices and the target mean as values. With pandas to_dict()
, we converted this Series into a dictionary. Finally, we used this dictionary to replace the categories in the train and test sets using pandas map()
.
To perform the encoding with Feature-engine, we used MeanEncoder()
. With fit()
, the transformer found and stored the categorical variables and the mean target value per category. With transform()
, categories were replaced with numbers in the train and test sets, returning pandas DataFrames.
Finally, we used TargetEncoder()
from Category Encoders to replace categories with a blend of prior and posterior probability estimates of the target. We set min_samples_leaf
to 25, which meant that if a category group had 25 observations or more, then the posterior probability was used for the encoding; alternatively, a blend of probabilities was used for the encoding. With fit()
, the transformer found the categorical variables and the numerical representation of the categories, while with transform()
, the categories were replaced with numbers, returning pandas DataFrames with their encoded values.
There’s more…
There is an alternative way to return better target estimates when the category groups are small. The replacement value for each category is determined as follows:

Here, ni(Y=1) is the target mean for category i and ni is the number of observations with category i. The target prior is given by pY and m is the weighting factor. With this adjustment, the only parameter that we have to set is the weight, m. If m is large, then more importance is given to the target’s prior probability. This adjustment affects target estimates for all categories but mostly for those with fewer observations because, in such cases, m could be much larger than ni in the formula’s denominator.
For an implementation of this encoding using MEstimateEncoder()
, visit this book’s GitHub repository.
Encoding with the Weight of Evidence
The Weight of Evidence (WoE) was developed primarily for credit and financial industries to facilitate variable screening and exploratory analysis and to build more predictive linear models to evaluate the risk of loan defaults.
The WoE is computed from the basic odds ratio:

Here, positive and negative refer to the values of the target being 1 or 0, respectively. The proportion of positive cases per category is determined as the sum of positive cases per category group divided by the total positive cases in the training set, and the proportion of negative cases per category is determined as the sum of negative cases per category group divided by the total number of negative observations in the training set.
The WoE has the following characteristics:
- WoE = 0 if p(positive) / p(negative) = 1; that is, if the outcome is random
- WoE > 0 if p(positive) > p(negative)
- WoE < 0 if p(negative) > p(positive)
This allows us to directly visualize the predictive power of the category in the variable: the higher the WoE, the more likely the event will occur. If the WoE is positive, the event is likely to occur:

Logistic regression models a binary response, Y, based on X predictor variables, assuming that there is a linear relationship between X and the log of odds of Y.
Here, log (p(Y=1)/p(Y=0)) is the log of odds. As you can see, the WoE encodes the categories in the same scale – that is, the log of odds – as the outcome of the logistic regression.
Therefore, by using WoE, the predictors are prepared and coded on the same scale, and the parameters in the logistic regression model – that is, the coefficients – can be directly compared.
In this recipe, we will perform WoE encoding using pandas and Feature-engine.
How to do it...
Let’s begin by making some imports and preparing the data:
- Import the required libraries and functions:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s get the inverse of the target values to be able to calculate the negative cases:
neg_y_train = pd.Series( np.where(y_train == 1, 0, 1), index=y_train.index )
- Let’s determine the number of observations where the target variable takes a value of 1 or 0:
total_pos = y_train.sum() total_neg = neg_y_train.sum()
- Now, let’s calculate the numerator and denominator of the WoE’s formula, which we discussed earlier in this recipe:
pos = y_train.groupby( X_train["A1"]).sum() / total_pos neg = neg_y_train.groupby( X_train["A1"]).sum() / total_neg
- Now, let’s calculate the WoE per category:
woe = np.log(pos/neg)
We can display the series with the category to WoE pairs by executing print(woe)
:
A1 Missing 0.203599 a 0.092373 b -0.042410 dtype: float64
- Finally, let’s replace the categories of
A1
with the WoE:X_train["A1"] = X_train["A1"].map(woe) X_test["A1"] = X_test["A1"].map(woe)
You can inspect the encoded variable by executing X_train["A1"].head()
.
Now, let’s perform WoE encoding using Feature-engine. First, we need to separate the data into train and test sets, as we did in step 2.
- Let’s import the encoder:
from feature_engine.encoding import WoEEncoder
- Next, let’s set up the encoder so that we can encode three categorical variables:
woe_enc = WoEEncoder(variables = ["A1", "A9", "A12"])
Tip
Feature-engine’s WoEEncoder()
will return an error if p(0)=0
for any category because the division by 0
is not defined. To avoid this error, we can group infrequent categories, as we will discuss in the next recipe, Grouping rare or infrequent categories.
- Let’s fit the transformer to the train set so that it learns and stores the WoE of the different categories:
woe_enc.fit(X_train, y_train)
Tip
We can display the dictionaries with the categories to WoE pairs by executing woe_enc.encoder_dict_
.
- Finally, let’s encode the three categorical variables in the train and test sets:
X_train_enc = woe_enc.transform(X_train) X_test_enc = woe_enc.transform(X_test)
Feature-engine returns pandas DataFrames containing the encoded categorical variables ready to use in machine learning models.
How it works...
First, with pandas sum()
, we determined the total number of positive and negative cases. Next, using pandas groupby()
, we determined the fraction of positive and negative cases per category. And with that, we calculated the WoE per category.
Finally, we automated the procedure with Feature-engine. We used WoEEncoder()
, which learned the WoE per category with the fit()
method, and then used transform()
, which replaced the categories with the corresponding numbers.
See also
For an implementation of WoE with Category Encoders, visit this book’s GitHub repository.
Grouping rare or infrequent categories
Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.
Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.
In this recipe, we will group infrequent categories using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the necessary Python libraries, functions, and classes:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.categorical_encoders import RareLabelEncoder
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s capture the fraction of observations per category in
A7
in a variable:freqs = X_train["A7"].value_counts(normalize=True)
We can see the percentage of observations per category of A7
, expressed as decimals, in the following output after executing print(freqs)
:
v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 z 0.014493 dd 0.010352 j 0.010352 Missing 0.008282 n 0.006211 o 0.002070 Name: A7, dtype: float64
If we consider those labels present in less than 5% of the observations as rare, then z
, dd
, j
, Missing
, n
, and o
are rare categories.
- Let’s create a list containing the names of the categories present in more than 5% of the observations:
frequent_cat = [ x for x in freqs.loc[freqs > 0.05].index.values]
If we execute print(frequent_cat)
, we will see the frequent categories of A7
:
['v', 'h', 'ff', 'bb'].
- Let’s replace rare labels – that is, those present in <= 5% of the observations – with the
"
Rare"
string:X_train["A7"] = np.where( X_train["A7"].isin(frequent_cat), X_train["A7"], "Rare" ) X_test["A7"] = np.where( X_test["A7"].isin(frequent_cat), X_test["A7"], "Rare" )
- Let’s determine the percentage of observations in the encoded variable:
X_train["A7"].value_counts(normalize=True)
We can see that the infrequent labels have now been re-grouped into the Rare
category:
v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 Rare 0.051760 Name: A7, dtype: float64
Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.
- Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
- Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
rare_encoder.fit(X_train)
Tip
Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.
We can display the frequent categories per variable by executing rare_encoder.encoder_dict_
, as well as the variables that will be encoded by executing rare_encoder.variables_
.
- Finally, let’s group rare labels in the train and test sets:
X_train_enc = rare_encoder.transform(X_train) X_test_enc = rare_encoder.transform(X_test)
Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.
How it works...
In this recipe, we grouped infrequent categories using pandas and Feature-engine.
We determined the fraction of observations per category of the A7
variable using pandas value_counts()
by setting the normalize
parameter to True
. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where()
, we searched each row of A7
, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin()
method, its value was kept; otherwise, its original value was replaced with "Rare"
.
We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder()
. By setting tol
to 0.05
, we retained categories present in more than 5% of the observations. By setting n_categories
to 4
, we only group rare categories in variables with more than four unique values. With the fit()
method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform()
method, the transformer replaced infrequent categories with the "
Rare"
string.
Performing binary encoding
Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. How does it work? First, the categories are arbitrarily replaced with ordinal numbers, as shown in the intermediate step of the following table. Then, those numbers are converted into binary code. For example, integer 1 can be represented as sequence 10, integer 2 as 01, integer 3 as 11, and integer 0 as 00. The digits in the two positions of the binary string become the columns, which are the encoded representations of the original variable:

Figure 2.9 – Table showing the steps required for binary encoding of the color variable
Binary encoding encodes the data in fewer dimensions than one-hot encoding. In our example, the Color variable would be encoded into k-1 categories by one-hot encoding – that is, three variables – but with binary encoding, we can represent the variable with only two features. More generally, we determine the number of binary features needed to encode a variable as log2(number of distinct categories); in our example, log2(4) = 2 binary features.
Binary encoding is an alternative method to one-hot encoding where we do not lose information about the variable, yet we obtain fewer features after the encoding. This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need 7 (log2(128)=7). Thus, this encoding prevents the feature space from exploding. In addition, binary-encoded features are also suitable for linear models. On the downside, the derived binary features lack human interpretability, so if we need to interpret the decisions made by our models, this encoding method may not be a suitable option.
In this recipe, we will learn how to perform binary encoding using Category Encoders.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python library, function, and class:
import pandas as pd from sklearn.model_selection import train_test_split from category_encoders.binary import BinaryEncoder
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s inspect the unique categories in
A7
:X_train["A7"].unique()
In the following output, we can see that A7
has 10 different categories:
array(['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 'n', 'o'], dtype=object)
- Let’s create a binary encoder to encode
A7
:encoder = BinaryEncoder(cols=["A7"], drop_invariant=True)
Tip
BinaryEncoder()
, as well as other encoders from the Category Encoders package, allow us to select the variables to encode. We simply pass the column names in a list to the cols
argument.
- Let’s fit the transformer to the train set so that it calculates how many binary variables it needs and creates the variable-to-binary code representations:
encoder.fit(X_train)
- Finally, let’s encode
A7
in the train and test sets:X_train_enc = encoder.transform(X_train) X_test_enc = encoder.transform(X_test)
We can display the top rows of the transformed train set by executing print(X_train_enc.head())
, which returns the following output:

Figure 2.10 – DataFrame with the variables after binary encoding
Binary encoding returned four binary variables for A7
, which are A7_0
, A7_1
, A7_2
, and A7_3
, instead of the nine that would have been returned by one-hot encoding.
How it works...
In this recipe, we performed binary encoding using the Category Encoders package. First, we loaded the dataset and divided it into train and test sets using train_test_split()
from scikit-learn. Next, we used BinaryEncoder()
to encode the A7
variable. With the fit()
method, BinaryEncoder()
created a mapping from category to set of binary columns, and with the transform()
method, the encoder encoded the A7
variable in both the train and test sets.
Tip
With one-hot encoding, we would have created nine binary variables (k-1 = 10 unique categories - 1 = 9
) to encode all of the information in A7
. With binary encoding, we can represent the variable in fewer dimensions by using log2(10)=3.3
; that is, we only need four binary variables.
See also
For more information about BinaryEncoder()
, visit https://contrib.scikit-learn.org/category_encoders/binary.html.
For a nice example of the output of binary encoding, check out the following resource: https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding.
For a comparative study of categorical encoding techniques for neural network classifiers, visit https://www.researchgate.net/publication/320465713_A_Comparative_Study_of_Categorical_Variable_Encoding_Techniques_for_Neural_Network_Classifiers.