You're reading from The Pandas Workshop

Product typeBook

Published inJun 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800208933

Edition1st Edition

Languages

Python

Tools

NumPy Pandas

Concepts

Data Science

Authors (4):

Blaine Bateman

Saikat Basak

Thomas V. Joseph

William So

View More author details

Chapter 14: Applying pandas Data Processing for Case Studies

So far in this book, we have progressively learned different data processing techniques using pandas, such as working with different types of data structures, accessing data from multiple sources, data cleaning, data transformation, visualization, code optimization, and finally, data modeling. This chapter aims to harness all these techniques you have learned so far, in analyzing four different case studies. The different case studies you will work through in this chapter will expose you to the different ways data needs to be preprocessed to be workable and help you see how good preparation is the key to good analysis. By the end of this chapter, you will have reinforced your understanding of all the data processing techniques you learned in this book by applying them to four case studies.

This chapter covers the following topics:

Introduction to the case studies and datasets
Recap of the preprocessing steps...

Introduction to the case studies and datasets

Data cleaning and preparation usually take up to 80% of the time in a data analytics life cycle. Transactional datasets can have multiple failure modes, some of the prominent ones being missing data points, incompatible formats, variability in data types, incorrect spellings in data, and unwanted characters and white spaces in data.

These are just some examples of how data can be messy. The success of a data analyst will depend on how well they are able to traverse these quagmires of messy data and transform the data into the required format. A sure-shot way to be adept at this all-too-important process is to get hands-on experience with multiple real-world datasets. In this chapter, you will analyze four different datasets, with each analysis focusing on different facets of data wrangling. The following list offers a snapshot of the datasets we will be dealing with in this chapter and the different techniques we will be applying to...

Recap of the preprocessing steps

Unlike the previous chapters, in this chapter, we will only be reinforcing the skills that were taught in the previous chapters. This will be in the form of various exercises and an activity.

This section will help you recap some of the important preprocessing steps covered in this book so far and also go through some techniques that will be used in the exercises:

Reading CSV files

pd.read_csv('file path' , delimiter=';')

As you may recall, the pd.read_csv function is used to read the data from a CSV file available at the specified path.

Recasting data

One of the most frequent transformation steps is changing the format from wide format to long format. For example, the following figure shows some data in wide format. You can see that the data for each month is spread across the columns:

Figure 14.1 – Wide format data

Often, when we have to preprocess data, we need data...

Activity 14.01 – analyzing air quality data

Consider that you're working as a data analyst for your city's municipality. The Department for the Environment needs your help in getting answers to some questions related to emissions. The following are the questions the department wants answers to:

Which day of the week has the highest NO2(GT) emissions?
At what time of the day are NMHC(GT) emissions highest?
Which month has the lowest CO(GT) emissions?
Note
The emissions dataset has been sourced from the following link:
https://archive.ics.uci.edu/ml/machine-learning-databases/00360/
You can find the dataset in the GitHub repository for this book. Download the data, unzip the data, and then load the CSV file in a data folder of your local machine. The department needs the answers through good visualizations.

The following steps will help you complete this activity:

Open a new Jupyter notebook.
Download the data and then read the data using...

Summary

In this final chapter, you got hands-on practice with different data processing tasks done on real-world datasets. In the first dataset, you explored different methods of data processing. Some of the key methods implemented were for converting from wide format to long format, merging two DataFrames, and imputing missing data using the interpolate method.

With the second dataset, you practiced preprocessing tasks before plotting, such as grouping and aggregation, and converting continuous data into categorical data using binning. You also answered questions about the data using line plots and bar charts.

Using the third dataset, you extracted geolocations from latitude and longitude information. After extracting geolocation information, you also answered some questions on the service level of bus routes.

Finally, with the fourth dataset, we used different methods to preprocess data to build a classification model. You should now be able to confidently tackle most data...

The rest of the chapter is locked

You have been reading a chapter from

The Pandas Workshop

Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages