Reader small image

You're reading from  Building Data Science Solutions with Anaconda

Product typeBook
Published inMay 2022
PublisherPackt
ISBN-139781800568785
Edition1st Edition
Concepts
Right arrow
Author (1)
Dan Meador
Dan Meador
author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador

Right arrow

Chapter 5: Cleaning and Visualizing Data

According to Anaconda's latest State of Data Science Report (https://bit.ly/3F2D8YM), 39% of your time as a data scientist will be spent on either data preparation or cleaning. This might come as no surprise, but being able to set up a problem correctly is vital to being able to get good answers from your data.

Rarely will data come to you in a perfect form, and even then, you might want to manipulate it to answer different questions from it. Being able to quickly find general statistics, discovering and removing bad columns, and altering fields in place will all be needed.

After it's in the right form, visualization is a key tool to be able to not only present your findings to those that might care about it but also as a guide for yourself at this data exploration stage. Cleaning and visualization go hand in hand, and many times you'll see that certain aspects of data need to be adjusted after seeing them. This chapter...

Technical requirements

To follow along with this chapter, you'll need the same basic setup as the previous chapters, which includes the following:

  • The Anaconda distibution installed. This includes conda, Navigator, and the Jupyter Notebook by default.

Go to https://github.com/fivethirtyeight/data/tree/master/college-majors to download the dataset. Alternatively, if you don't want to download the whole repository, go to https://data.fivethirtyeight.com/ and search for The Economic Guide To Picking A College Major. On the right side, you will see a button with an arrow on it to download the datasets.

Cleaning data with pandas

One of the most important aspects that come into play when working with data is ensuring that it's in the correct format that you need. Along with getting enough data, this might be the most vital component to training an accurate model. In this section, we're going to walk through the steps of importing a CSV file and then seeing how to analyze and clean it to make sure that it's prepped for us.

The example that we are going to look at is the data for various US university majors and how it relates to pay. Having a general sense of the domain we are looking into is critical, and this is an area that you might already have a grasp of. This dataset is provided by the excellent FiveThirtyEight site, and more information can be found here: https://github.com/fivethirtyeight/data/tree/master/college-majors.

Our goal is to see whether we can figure out whether we should have chosen another major using this data. We might even find out that...

Visualization with Matplotlib

Like many other things discussed in this book, there are many packages that can tackle any particular area. For the job of visualizing, Matplotlib is easily one of the most widely used. Not only is it quite easy to show simple graphs, but there are also many advanced options that you can use as well. It also works very well with panda DataFrames and has carved out its place as one of the most widely used packages in data science.

Let's start with a straightforward example of how to display a plot.

There are a few basic steps that you should take almost every time you want to show a plot:

  1. Preparing the data
  2. Plotting the data
  3. Customizing the plot
  4. Showing the plot

We'll walk through all of these, but we've already done much of step one in the Cleaning data with pandas section. Let's take that data and group it to focus on the categories of majors.

Preparing data for plotting

Let's take the existing...

Summary

In this chapter, we saw how we can take a dataset and then analyze what it holds, before moving on to cleaning. We looked at how pandas give us a lot of powerful tools that allow us to quickly pull in CSV data, calculate basic statistics, and clean up issues such as missing values using functions such as forward fill and backfill.

We then looked at how we can bring some visual flair to the underlying data with Matplotlib to create bar charts and scatterplots. This tool is a vital component in being able to get a better sense of data that you have and to easily convey the information and analysis to other colleagues.

These two tools, pandas and Matplotlib, are ones you will come back to repeatedly. We are now equipped with Conda, Jupyter notebooks, NumPy, pandas, and Matplotlib. Using just these tools, you will already be able to answer many questions in the real world such as, do more students pick a major with higher pay? Even though we can get these simple answers...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022Publisher: PacktISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Meador

Dan Meador is an Engineering Manager at Anaconda and is the creator of Conda as well as a champion of open source at Anaconda. With a history of engineering and client facing roles, he has the ability to jump into any position. He has a track record of delivering as a leader and a follower in companies from the Fortune 10 to startups.
Read more about Dan Meador