Encoding Categorical Variables
Categorical variables are a common feature in many datasets, representing discrete values such as categories, labels, or groups. However, most ML algorithms (well, computers in general, it should be said) require numerical input, making it essential to convert categorical data into a suitable format.
Categorical variables can be divided into two main types:
- Nominal Variables: These represent categories without any intrinsic ordering (e.g., color, brand).
- Ordinal Variables: These have a clear ordering among categories (e.g., ratings from 1 to 5).
Choosing the right encoding method depends on the type of categorical variable and the specific requirements of the ML algorithm being used.
Getting ready
To begin, like we did earlier, we will create a toy dataset only this time our features will be composed of qualitative data.
- Load libraries
import numpy as np
- Create sample categorical data with 20 records
np.random.seed(2024) Â # for reproducibility...