Practical Exercises on Data Preprocessing
In this chapter, we’ve covered several methods commonly applied to data preprocessing. Now it’s time to put it all together! Can you guess what tool might be helpful for this exercise? You got it: the Pipeline()
class!
How to do it…
For these exercises, we will use a publicly available dataset, California Housing, which is included in the scikit-learn library. The dataset contains 20,640 records and 9 features where the target value (what we are trying to predict with our model) is the average home price per 100,000 homes.
You are tasked with building a comprehensive data pipeline composed of steps you learned in this chapter. In the Jupyter Notebook for Chapter 2, you will find an incomplete code block at the end called “Comprehensive Pipeline” where you should add your code to complete the following:
- Load the California Housing dataset
- Split the data
- Create a comprehensive pipeline with at least three steps...