Handling Missing Data
Missing data can arise from various sources, including human error, technical failures, or data corruption. It is important to address missing values before training ML models, as most algorithms cannot handle them directly and most scikit-learn methods won’t even execute when they are detected in your training data. Sometimes, with large enough datasets, we can simply drop the records that contain missing values with little impact on the resulting model, but this isn’t always viable. Thankfully, scikit-learn provides several strategies for imputing missing values, allowing practitioners to fill in gaps with estimated values based on available data.
Getting ready
To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame()
object for better readability.
Load libraries
import numpy as np import pandas as...