You're reading from Hands-On Data Preprocessing in Python

Product typeBook

Published inJan 2022

PublisherPackt

ISBN-139781801072137

Edition1st Edition

Tools

PyTorch Azure Functions

Concepts

Big Data

Author (1)

Roy Jafari

Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors

In level I, we cleaned up the table without paying attention to the data structure or the recorded values. In level II, our attention was to have a data structure that would support our analytic goal, but we still didn't pay much attention to the correctness or appropriateness of the recorded values. That is the objective of data cleaning level III. In data cleaning level III, we will focus on the recorded values and will take measures to make sure that three matters regarding the values recorded in the data are addressed. First, we will make sure missing values in the data have been detected, that we know why this has happened, and that appropriate measures have been taken to address them. Second, we will ensure that we have taken appropriate measures so that the recorded values are correct. Third, we will ascertain that the extreme points in the data have been detected and appropriate measures...

Technical requirements

You will be able to find all of the code and the datasets that are used in this book in a GitHub repository exclusively created for this book. To find the repository, click on this link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. In this repository, you will find a folder titled Chapter11, from where you can download the code and the data for better learning.

Missing values

Missing values, as the name suggests, are values we expect to have but we don't. In the simplest terms, missing values are empty cells in a dataset that we want to use for analytic goals. For example, the following screenshot shows an example of a dataset with missing values—the first and third students' grade point average (GPA) is missing, the fifth student's height is missing, and the sixth student's personality type is missing:

Figure 11.1 – A dataset example with missing values

In Python, missing values are not presented with emptiness—they are presented via NaN, which is short for Not a Number. While the literal meaning of Not a Number does not completely capture all the possible situations for which we have missing values, NaN is used in Python whenever we have missing values.

The following screenshot shows a pandas DataFrame that has read and presented the table represented in Figure 11.1...

Outliers

Outliers, a.k.a. extreme points, are data objects whose values are too different than the rest of the population. Being able to recognize and deal with them is important from the following three perspectives:

Outliers may be data errors in data and should be detected and removed.
Outliers that are not errors can skew the results of analytic tools that are sensitive to the existence of outliers.
Outliers may be fraudulent entries.

We will first go over the tools we can use to detect outliers, and then we will cover dealing with them based on the analytic situation.

Detecting outliers

The tools we use for detecting outliers depend on the number of attributes involved. If we are interested in detecting outliers only based on one attribute, we call that univariate outlier detection; if we want to detect them based on two attributes, we call that bivariate outlier detection; and finally, if we want to detect outliers based on more than two attributes...

Errors

Errors are an inevitable part of any data collection and measurement. The following formula best captures this fact:

The True Signal is the reality we are trying to measure and present in the form of Data, but due to the incapability of our measurement system or data presentation, we cannot capture the True Signal. Therefore, Error is the difference between the True Signal and the recorded Data.

For instance, let's say we have purchased seven thermometers and we would like to accurately calculate the room temperature using these seven thermometers. At a given point in time, we take the following readings from them:

Figure 11.37 – Seven thermometers' readings

Looking at the preceding screenshot, what would you say the temperature of the room—the True Signal—is? The answer is that we cannot measure or capture the True Signal—in this case, the exact temperature of the room. With seven thermometers, we may...

Summary

Congratulations on your learning in this chapter. This chapter covered data cleaning level III. Together, we learned how to detect and deal with missing values, outliers, and errors. This may sound like too short of a summary for such a long chapter, but as we saw, detection, diagnosis, and dealing with each of the three issues (missing values, outliers, and errors) can have many details and delicacies. Finishing this chapter was a significant achievement, and now you know how to detect, diagnose, and deal with all of these three possible issues you may encounter when working with a dataset.

This chapter concludes our three-chapter-long data cleaning journey. In the next chapter, we move to another important data preprocessing area, and that is data fusion and integration. Before moving on to the next chapter, spend some time working on the following exercises to solidify your learnings.

Exercises

In this exercise, we will be using Temperature_data.csv. This dataset has some missing values. Do the following:
a) After reading the file into a pandas DataFrame, check whether the dataset is level I clean, and if not, clean it. Also, describe the cleanings (if any).
b) Check whether the dataset is level II clean, and if not, clean it. Also, describe the cleanings (if any).
c) The dataset has missing values. See how many, and run a diagnosis to see which types of missing values they are.
d) Are there any outliers in the dataset?
e) How should we best deal with missing values if our goal is to draw multiple boxplots that show the central tendency and variation of temperature across the months? Draw the described visualization after dealing with the missing values.
In this exercise, we are going to use the Iris_wMV.csv file. The Iris dataset includes 50 samples of 3 types of iris flowers, totaling 150 rows of data. Each flower is described by its sepal and petal length...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Preprocessing in Python

Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages