Introduction to Dimensionality Reduction
Dimensionality reduction is fundamental in developing robust ML pipelines on par with data preprocessing explored in Chapter 2. It involves reducing the number of features (or dimensions) in a dataset while retaining as much relevant information as possible. There are several reasons for performing this task including simplifying data, reducing computational costs, and enhancing model performance. Early in many data scientists’ careers, they are keen to throw as much data as available into model training, but even though more data is better in some regards, if that data isn’t related to our problem domain or doesn’t improve our model, we are simply adding in garbage (remember, garbage in, garbage out)! So, why is dimensionality reduction one of the most important tasks in data preprocessing? Let’s look more closely.
Why Dimensionality Reduction is Essential
We often think that more data is always a good thing in ML...