Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 5: Cleaning, Transforming, and Enriching Data

In the previous two chapters, you learned how to build data pipelines that could read and write from files and databases. In many instances, these skills alone will enable you to build production data pipelines. For example, you will read files from a data lake and insert them into a database. You now have the skills to accomplish this. Sometimes, however, you will need to do something with the data after extraction but prior to loading. What you will need to do is clean the data. Cleaning is a vague term. More specifically, you will need to check the validity of the data and answer questions such as the following: Is it complete? Are the values within the proper ranges? Are the columns the proper type? Are all the columns useful?

In this chapter, you will learn the basic skills needed to perform exploratory data analysis. Once you have an understanding of the data, you will use that knowledge to fix common data problems that...

Performing exploratory data analysis in Python

Before you can clean your data, you need to know what your data looks like. As a data engineer, you are not the domain expert and are not the end user of the data, but you should know what the data will be used for and what valid data would look like. For example, you do not need to be a demographer to know that an age field should not be negative, and the frequency of values over 100 should be low.

Downloading the data

In this chapter, you will use real e-scooter data from the City of Albuquerque. The data contains trips taken using e-scooters from May to July 22, 2019. You will need to download the e-scooter data from https://github.com/PaulCrickard/escooter/blob/master/scooter.csv. The repository also contains the original Excel file as well as some other summary files provided by the City of Albuquerque.

Basic data exploration

Before you can clean your data, you have to know what your data looks like. The process of understanding...

Handling common data issues using pandas

Your data may feel special, it is unique, you have created the world's best systems for collecting it, and you have done everything you can to ensure it is clean and accurate. Congratulations! But your data will almost certainly have some problems, and these problems are not special, or unique, and are probably a result of your systems or data entry. The e-scooter dataset is collected using GPS with little to no human input, yet there are end locations that are missing. How is it possible that a scooter was rented, ridden, and stopped, yet the data doesn't know where it stopped? Seems strange, yet here we are. In this section, you will learn how to deal with common data problems using the e-scooter dataset.

Drop rows and columns

Before you modify any fields in your data, you should first decide whether you are going to use all the fields. Looking at the e-scooter data, there is a field named region_id. This field is a code used...

Cleaning data using Airflow

Now that you can clean your data in Python, you can create functions to perform different tasks. By combining the functions, you can create a data pipeline in Airflow. The following example will clean data, and then filter it and write it out to disk.

Starting with the same Airflow code you have used in the previous examples, set up the imports and the default arguments, as shown:

import datetime as dt
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import pandas as pd
default_args = {
    'owner': 'paulcrickard',
    'start_date': dt.datetime(2020, 4, 13),
    'retries': 1,
    'retry_delay': dt.timedelta(minutes=5),
}

Now you can write the functions that will perform the cleaning tasks. First...

Summary

In this chapter, you learned how to perform basic EDA with an eye toward finding errors or problems within your data. You then learned how to clean your data and fix common data issues. With this set of skills, you built a data pipeline in Apache Airflow.

In the next chapter, you will walk through a project, building a 311 data pipeline and dashboard in Kibana. This project will utilize all of the skills you have acquired up to this point and will introduce a number of new skills – such as building dashboards and making API calls.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard