Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 8: Dealing with Common Data Problems

The ability to quickly assess the shortcomings of data and correct them can be the difference between being able to accomplish what you need to on time or falling behind. In this chapter, we're going to give you the tools to identify some of these problems, which you'll find are present in much of the data found in the industry.

We'll first look at when there can be too much data. This can be an issue where features can have an extremely high correlation with one another and in turn complicate a model. You'll see how to find this information and then remove the offending entries.

After that, we'll check into ways to get rid of blank, empty, or Not a Number (NaN) data that muddy the waters. This problem causes empty spaces without adding value.

We'll also look at what to do when you have categorical values. There are times when you'll need to maintain the relationship between categories, and times...

Technical requirements

There are a few things that you will need to get the most out of this chapter. They are as follows:

  • Anaconda Distribution. This includes conda and Navigator.You can download that from the following URL: https://www.anaconda.com/products/distribution
  • A conda environment with scikit-learn, pandas, and matplotlib.
  • A Jupyter notebook to perform all the coding segments. You also can use any IDE of choice or even the command line, but the assumption is that you will be working in a notebook.

After you have that set up, we can look at our first topic – how to deal with having extra data.

Dealing with too much data

It's true that more data is usually better, but this isn't always the case. There are many times when having extra data has a negative impact on an outcome. Such a case was covered in Chapter 1, Understanding the AI/ML Landscape, where a father gave his child an extra example of what a tiger was, but that extra example was actually of a panther. That additional bit of information would then turn into a negative addition to the training set and create a worse learning outcome for your model.

How are you supposed to know this? Understand the data. This will be a common theme in this chapter, the book, and in the real world. If you don't start there, then everything else is more challenging. It's similar to being able to understand bias, as discussed in Chapter 6, Overcoming Bias in AI/ML.

Sometimes though, you won't or can't have a full grasp of the data, but you can use tools to help you out. The first clue that you can...

Finding and correcting data entries

In the age of computers, human error will always come into play. Unfortunately, those mistaken keystrokes will manifest themselves in the datasets that we are tasked to work with. This will be present in everything from medical information to a car's service record.

You can check for anomalies in a few ways; one is to simply group items together and see which stand out among the other items in that group. Looking back at our college football dataset, we want to confirm that the school's conferences are all correct.

We can simply call on the Conference column, which will be in a pandas series object. This object has many methods you can access, but the one we are interested in is pandas' Series.value_counts() method.

Let's use that to check whether there are lone conferences:

df_ncaa_error.Conference.value_counts()

This will show the following:

Figure 8.6 – A count by conference

...

Working with categorical values with one-hot encoding

Machine learning and statistics can be quite good at determining relationships between numbers. But what if you have a feature that is categorical and doesn't have a relationship? The definition of a categorical feature is when the variable is a label or category with discrete possibilities, such as colors , the animal kingdom, or cities.

One option when you have this type of data is to use use one-hot encoding. This is the process of converting a categorical value into a set of ones and zeroes so that the model can interpret them as independent, but not infer that there is a relationship between them. This also prevents the inference that some categories are superior or inferior.

You can see an example of what this looks like in the following figure. Say you are looking at sales data for bouncy balls and one of the features is the color. There are three colors – red, blue and green. This is represented as data...

Feature scaling

When you are working with a large spread of numbers, the higher the deviation, the harder it will be to train a good model on them. This issue with deviation is for a number of reasons we won't cover now, but we'll cover scaling techniques more in depth in the Scaling the data section in Chapter 9, Building a Regression Model with scikit-learn. But you should know that sometimes you will come across datasets where someone has already scaled the data.

You can't always know where a dataset has come from, so you may not have the benefit of understanding why a particular decision was made.

This data could come from a colleague, a Kaggle competition, or it is just an example dataset included in scikit-learn, like the one we are using now. This is the same California training dataset that was used in Chapter 2, Analyzing Open Source Software, and we'll assume that you already have the y_test and y_predict setup. If not, refer back to Chapter 2,...

Working with date formats

Dates and times are often found in datasets and can present a few unique problems with data, becoming a huge thorn in a data scientist's side. There are many formats across the world, which differ across countries and systems. For example, the United States commonly uses the month/day/year format (mm/dd/yyyy), but in Europe, you are more likely to see day/month/year (dd/mm/yyyy).

Python has a built-in datetime object, but we'll make use of pandas' built-in datetime type as well. This will allow us to easily perform a few different operations on them, including grabbing just the month value, specifying a specific format, and other operations.

Time zones also come into play. There are many different rules across the world on what happens when. This is one reason UTC has become more common. UTC is a set standard that can be used no matter what your specific time zone is.

Specifying a date field in pandas

The easiest way to call out...

Summary

Every situation and dataset you see will be unique; however, the problems you encounter with them won't be. In this chapter, you saw issues that will come up repeatedly with the datasets you'll be working with.

We saw how having too much data can be a problem by having highly correlated features, and how you can find that correlation and remove it. We used the example of college recruiting points and rank, but you can easily find others in the real world, such as housing prices – you might have the price per square footage but also have those as separate features.

Working with categorical data is common, but at the end of the day, machine learning models need numbers to be able to work. We saw that there are times when we want to keep relationships between categorical values, such as a rating system, and other times when we don't. We saw how we can use one-hot encoding to encode these categories when we don't want to keep the relationships.

...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador