Reader small image

You're reading from  Hands-On Data Preprocessing in Python

Product typeBook
Published inJan 2022
PublisherPackt
ISBN-139781801072137
Edition1st Edition
Concepts
Right arrow
Author (1)
Roy Jafari
Roy Jafari
author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Right arrow

Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors

In level I, we cleaned up the table without paying attention to the data structure or the recorded values. In level II, our attention was to have a data structure that would support our analytic goal, but we still didn't pay much attention to the correctness or appropriateness of the recorded values. That is the objective of data cleaning level III. In data cleaning level III, we will focus on the recorded values and will take measures to make sure that three matters regarding the values recorded in the data are addressed. First, we will make sure missing values in the data have been detected, that we know why this has happened, and that appropriate measures have been taken to address them. Second, we will ensure that we have taken appropriate measures so that the recorded values are correct. Third, we will ascertain that the extreme points in the data have been detected and appropriate measures...

Technical requirements

You will be able to find all of the code and the datasets that are used in this book in a GitHub repository exclusively created for this book. To find the repository, click on this link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. In this repository, you will find a folder titled Chapter11, from where you can download the code and the data for better learning.

Missing values

Missing values, as the name suggests, are values we expect to have but we don't. In the simplest terms, missing values are empty cells in a dataset that we want to use for analytic goals. For example, the following screenshot shows an example of a dataset with missing values—the first and third students' grade point average (GPA) is missing, the fifth student's height is missing, and the sixth student's personality type is missing:

Figure 11.1 – A dataset example with missing values

In Python, missing values are not presented with emptiness—they are presented via NaN, which is short for Not a Number. While the literal meaning of Not a Number does not completely capture all the possible situations for which we have missing values, NaN is used in Python whenever we have missing values.

The following screenshot shows a pandas DataFrame that has read and presented the table represented in Figure 11.1...

Outliers

Outliers, a.k.a. extreme points, are data objects whose values are too different than the rest of the population. Being able to recognize and deal with them is important from the following three perspectives:

  • Outliers may be data errors in data and should be detected and removed.
  • Outliers that are not errors can skew the results of analytic tools that are sensitive to the existence of outliers.
  • Outliers may be fraudulent entries.

We will first go over the tools we can use to detect outliers, and then we will cover dealing with them based on the analytic situation.

Detecting outliers

The tools we use for detecting outliers depend on the number of attributes involved. If we are interested in detecting outliers only based on one attribute, we call that univariate outlier detection; if we want to detect them based on two attributes, we call that bivariate outlier detection; and finally, if we want to detect outliers based on more than two attributes...

Errors

Errors are an inevitable part of any data collection and measurement. The following formula best captures this fact:

The True Signal is the reality we are trying to measure and present in the form of Data, but due to the incapability of our measurement system or data presentation, we cannot capture the True Signal. Therefore, Error is the difference between the True Signal and the recorded Data.

For instance, let's say we have purchased seven thermometers and we would like to accurately calculate the room temperature using these seven thermometers. At a given point in time, we take the following readings from them:

Figure 11.37 – Seven thermometers' readings

Looking at the preceding screenshot, what would you say the temperature of the room—the True Signal—is? The answer is that we cannot measure or capture the True Signal—in this case, the exact temperature of the room. With seven thermometers, we may...

Summary

Congratulations on your learning in this chapter. This chapter covered data cleaning level III. Together, we learned how to detect and deal with missing values, outliers, and errors. This may sound like too short of a summary for such a long chapter, but as we saw, detection, diagnosis, and dealing with each of the three issues (missing values, outliers, and errors) can have many details and delicacies. Finishing this chapter was a significant achievement, and now you know how to detect, diagnose, and deal with all of these three possible issues you may encounter when working with a dataset.

This chapter concludes our three-chapter-long data cleaning journey. In the next chapter, we move to another important data preprocessing area, and that is data fusion and integration. Before moving on to the next chapter, spend some time working on the following exercises to solidify your learnings.

Exercises

  1. In this exercise, we will be using Temperature_data.csv. This dataset has some missing values. Do the following:

    a) After reading the file into a pandas DataFrame, check whether the dataset is level I clean, and if not, clean it. Also, describe the cleanings (if any).

    b) Check whether the dataset is level II clean, and if not, clean it. Also, describe the cleanings (if any).

    c) The dataset has missing values. See how many, and run a diagnosis to see which types of missing values they are.

    d) Are there any outliers in the dataset?

    e) How should we best deal with missing values if our goal is to draw multiple boxplots that show the central tendency and variation of temperature across the months? Draw the described visualization after dealing with the missing values.

  2. In this exercise, we are going to use the Iris_wMV.csv file. The Iris dataset includes 50 samples of 3 types of iris flowers, totaling 150 rows of data. Each flower is described by its sepal and petal length...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Preprocessing in Python
Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari