Preprocessing data
Preprocessing data is a technique that transforms raw data into a usable and efficient format. It is, in fact, the most important step in the data mining and machine learning (ML) process.
When we are preprocessing data, we are really cleaning it, transforming it, or doing a data reduction. In this section, we will take a look at what these all mean.
Data cleaning
Data cleaning refers to the process of making our dataset more efficient. If we go through data cleaning in really large datasets, we can expedite the algorithm, avoid errors, and get better results. There are a few things we deal with when data cleaning:
- Missing data: Address this by removing, imputing, or using domain-specific methods to handle missing values
 - Duplicate data: Detect and remove duplicates to ensure each observation is unique
 - Data types: Use appropriate functions to convert data types as needed
 - Noisy data: This can be fixed/improved by using binning, regression...