You're reading from Building Data Science Solutions with Anaconda

Product typeBook

Published inMay 2022

PublisherPackt

ISBN-139781800568785

Edition1st Edition

Tools

Anaconda

Concepts

Data Science

Author (1)

Dan Meador

Chapter 8: Dealing with Common Data Problems

The ability to quickly assess the shortcomings of data and correct them can be the difference between being able to accomplish what you need to on time or falling behind. In this chapter, we're going to give you the tools to identify some of these problems, which you'll find are present in much of the data found in the industry.

We'll first look at when there can be too much data. This can be an issue where features can have an extremely high correlation with one another and in turn complicate a model. You'll see how to find this information and then remove the offending entries.

After that, we'll check into ways to get rid of blank, empty, or Not a Number (NaN) data that muddy the waters. This problem causes empty spaces without adding value.

We'll also look at what to do when you have categorical values. There are times when you'll need to maintain the relationship between categories, and times...

Technical requirements

There are a few things that you will need to get the most out of this chapter. They are as follows:

Anaconda Distribution. This includes conda and Navigator.You can download that from the following URL: https://www.anaconda.com/products/distribution
A conda environment with scikit-learn, pandas, and matplotlib.
A Jupyter notebook to perform all the coding segments. You also can use any IDE of choice or even the command line, but the assumption is that you will be working in a notebook.

After you have that set up, we can look at our first topic – how to deal with having extra data.

Dealing with too much data

It's true that more data is usually better, but this isn't always the case. There are many times when having extra data has a negative impact on an outcome. Such a case was covered in Chapter 1, Understanding the AI/ML Landscape, where a father gave his child an extra example of what a tiger was, but that extra example was actually of a panther. That additional bit of information would then turn into a negative addition to the training set and create a worse learning outcome for your model.

How are you supposed to know this? Understand the data. This will be a common theme in this chapter, the book, and in the real world. If you don't start there, then everything else is more challenging. It's similar to being able to understand bias, as discussed in Chapter 6, Overcoming Bias in AI/ML.

Sometimes though, you won't or can't have a full grasp of the data, but you can use tools to help you out. The first clue that you can...

Finding and correcting data entries

In the age of computers, human error will always come into play. Unfortunately, those mistaken keystrokes will manifest themselves in the datasets that we are tasked to work with. This will be present in everything from medical information to a car's service record.

You can check for anomalies in a few ways; one is to simply group items together and see which stand out among the other items in that group. Looking back at our college football dataset, we want to confirm that the school's conferences are all correct.

We can simply call on the Conference column, which will be in a pandas series object. This object has many methods you can access, but the one we are interested in is pandas' Series.value_counts() method.

Let's use that to check whether there are lone conferences:

df_ncaa_error.Conference.value_counts()

This will show the following:

Figure 8.6 – A count by conference

...

Working with categorical values with one-hot encoding

Machine learning and statistics can be quite good at determining relationships between numbers. But what if you have a feature that is categorical and doesn't have a relationship? The definition of a categorical feature is when the variable is a label or category with discrete possibilities, such as colors , the animal kingdom, or cities.

One option when you have this type of data is to use use one-hot encoding. This is the process of converting a categorical value into a set of ones and zeroes so that the model can interpret them as independent, but not infer that there is a relationship between them. This also prevents the inference that some categories are superior or inferior.

You can see an example of what this looks like in the following figure. Say you are looking at sales data for bouncy balls and one of the features is the color. There are three colors – red, blue and green. This is represented as data...

Feature scaling

When you are working with a large spread of numbers, the higher the deviation, the harder it will be to train a good model on them. This issue with deviation is for a number of reasons we won't cover now, but we'll cover scaling techniques more in depth in the Scaling the data section in Chapter 9, Building a Regression Model with scikit-learn. But you should know that sometimes you will come across datasets where someone has already scaled the data.

You can't always know where a dataset has come from, so you may not have the benefit of understanding why a particular decision was made.

This data could come from a colleague, a Kaggle competition, or it is just an example dataset included in scikit-learn, like the one we are using now. This is the same California training dataset that was used in Chapter 2, Analyzing Open Source Software, and we'll assume that you already have the y_test and y_predict setup. If not, refer back to Chapter 2,...

Working with date formats

Dates and times are often found in datasets and can present a few unique problems with data, becoming a huge thorn in a data scientist's side. There are many formats across the world, which differ across countries and systems. For example, the United States commonly uses the month/day/year format (mm/dd/yyyy), but in Europe, you are more likely to see day/month/year (dd/mm/yyyy).

Python has a built-in datetime object, but we'll make use of pandas' built-in datetime type as well. This will allow us to easily perform a few different operations on them, including grabbing just the month value, specifying a specific format, and other operations.

Time zones also come into play. There are many different rules across the world on what happens when. This is one reason UTC has become more common. UTC is a set standard that can be used no matter what your specific time zone is.

Specifying a date field in pandas

The easiest way to call out...

Summary

Every situation and dataset you see will be unique; however, the problems you encounter with them won't be. In this chapter, you saw issues that will come up repeatedly with the datasets you'll be working with.

We saw how having too much data can be a problem by having highly correlated features, and how you can find that correlation and remove it. We used the example of college recruiting points and rank, but you can easily find others in the real world, such as housing prices – you might have the price per square footage but also have those as separate features.

Working with categorical data is common, but at the end of the day, machine learning models need numbers to be able to work. We saw that there are times when we want to keep relationships between categorical values, such as a rating system, and other times when we don't. We saw how we can use one-hot encoding to encode these categories when we don't want to keep the relationships.

...

The rest of the chapter is locked

You have been reading a chapter from

Building Data Science Solutions with Anaconda

Published in: May 2022Publisher: PacktISBN-13: 9781800568785

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages