Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 14: Applying pandas Data Processing for Case Studies

So far in this book, we have progressively learned different data processing techniques using pandas, such as working with different types of data structures, accessing data from multiple sources, data cleaning, data transformation, visualization, code optimization, and finally, data modeling. This chapter aims to harness all these techniques you have learned so far, in analyzing four different case studies. The different case studies you will work through in this chapter will expose you to the different ways data needs to be preprocessed to be workable and help you see how good preparation is the key to good analysis. By the end of this chapter, you will have reinforced your understanding of all the data processing techniques you learned in this book by applying them to four case studies.

This chapter covers the following topics:

  • Introduction to the case studies and datasets
  • Recap of the preprocessing steps...

Introduction to the case studies and datasets

Data cleaning and preparation usually take up to 80% of the time in a data analytics life cycle. Transactional datasets can have multiple failure modes, some of the prominent ones being missing data points, incompatible formats, variability in data types, incorrect spellings in data, and unwanted characters and white spaces in data.

These are just some examples of how data can be messy. The success of a data analyst will depend on how well they are able to traverse these quagmires of messy data and transform the data into the required format. A sure-shot way to be adept at this all-too-important process is to get hands-on experience with multiple real-world datasets. In this chapter, you will analyze four different datasets, with each analysis focusing on different facets of data wrangling. The following list offers a snapshot of the datasets we will be dealing with in this chapter and the different techniques we will be applying to...

Recap of the preprocessing steps

Unlike the previous chapters, in this chapter, we will only be reinforcing the skills that were taught in the previous chapters. This will be in the form of various exercises and an activity.

This section will help you recap some of the important preprocessing steps covered in this book so far and also go through some techniques that will be used in the exercises:

  1. Reading CSV files
    pd.read_csv('file path' , delimiter=';')

As you may recall, the pd.read_csv function is used to read the data from a CSV file available at the specified path.

  1. Recasting data

One of the most frequent transformation steps is changing the format from wide format to long format. For example, the following figure shows some data in wide format. You can see that the data for each month is spread across the columns:

Figure 14.1 – Wide format data

Often, when we have to preprocess data, we need data...

Activity 14.01 – analyzing air quality data

Consider that you're working as a data analyst for your city's municipality. The Department for the Environment needs your help in getting answers to some questions related to emissions. The following are the questions the department wants answers to:

  • Which day of the week has the highest NO2(GT) emissions?
  • At what time of the day are NMHC(GT) emissions highest?
  • Which month has the lowest CO(GT) emissions?

    Note

    The emissions dataset has been sourced from the following link:

    https://archive.ics.uci.edu/ml/machine-learning-databases/00360/

    You can find the dataset in the GitHub repository for this book. Download the data, unzip the data, and then load the CSV file in a data folder of your local machine. The department needs the answers through good visualizations.

The following steps will help you complete this activity:

  1. Open a new Jupyter notebook.
  2. Download the data and then read the data using...

Summary

In this final chapter, you got hands-on practice with different data processing tasks done on real-world datasets. In the first dataset, you explored different methods of data processing. Some of the key methods implemented were for converting from wide format to long format, merging two DataFrames, and imputing missing data using the interpolate method.

With the second dataset, you practiced preprocessing tasks before plotting, such as grouping and aggregation, and converting continuous data into categorical data using binning. You also answered questions about the data using line plots and bar charts.

Using the third dataset, you extracted geolocations from latitude and longitude information. After extracting geolocation information, you also answered some questions on the service level of bus routes.

Finally, with the fourth dataset, we used different methods to preprocess data to build a classification model. You should now be able to confidently tackle most data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So