Reader small image

You're reading from  The Data Analysis Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839211386
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Shubhangi Hora
Shubhangi Hora
author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

Konstantin Palagachev
Konstantin Palagachev
author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev

View More author details
Right arrow

10. Analyzing Air Quality

Overview

In this chapter, you will search for, and handle, missing values. You will also carry out feature engineering, exploratory data analysis, design visualizations, and thereafter summarize insights provided by your data. By the end of this chapter, you will have a firm grasp of various data analysis techniques pertaining to a specific dataset—the Beijing Multi-Site Air Quality dataset.

Introduction

In the previous chapter, we performed data analysis techniques on a dataset that described the relationship between temperature, humidity, and the energy consumed by household appliances; that is, how much the appliances were used depending on the weather.

In the last chapter of this book, we will continue through the same journey of exploratory data analysis, but on a dataset describing the air quality in multiple localities of Beijing, China. We will be using several data analysis techniques (many of which you may have encountered in previous chapters) to clean the dataset and observe trends such as which time of year and which year had the highest concentration of pollutants.

About the Dataset

The dataset we are using in this chapter has been obtained from the UCI repository of datasets. There are 12 separate CSV files consisting of approximately 35,000 entries each. Each file contains data specific to one locality. In total, across all 12 files, there are around 420,000 instances in the dataset.

The attributes include the amounts of a variety of pollutants found in the air, such as sulphur dioxide and ozone, and also the temperature and pressure. This data has been collected over 4 years—from March 1, 2013 to February 28, 2017.

Let's begin our data analysis process by taking a closer look at the data.

Note

To find out more about the dataset, click here: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data#.

For further information on this topic, refer to the following: Zhang, S., Guo, B., Dong, A., He, J., Xu, Z., and Chen, S.X. (2017) Cautionary Tales on Air Quality Improvement in Beijing. Proceedings...

Outliers

You should recall that an outlier is a data point that is different from the majority of data points. When visualized, this data point is far away from the rest—hence, the name outlier. For example, if you have a set of 12 numbers, of which 11 are between 1 and 6 and 1 has the value of 37, that data point will be an outlier because it is extremely different and far away from the rest of the data points.

Boxplots are a type of visualization that are great for visualizing outliers. They provide us with a lot of information about our data, such as the median, the first quartile, the third quartile, the minimum and maximum values, as well as the existence of outliers.

Let's do a quick exercise based on the example of 12 numbers to understand how to spot an outlier from a boxplot.

Exercise 10.02: Identifying Outliers

In this exercise, you will create a small DataFrame with only 12 rows, each consisting of a random number. You will then plot this column...

Missing Values

Most real-world datasets have instances with values that are NaN or blank. These are missing values. The significance of missing values depends on multiple factors: the number of missing values, the number of features that have missing values, the tasks that are going to be carried out on data, and so on.

If the data is going to be fed into a machine learning model, then missing values should be dealt with. While some algorithms are capable of learning and predicting from data with missing values, it obviously makes more sense to train a model on data without missing values. This ensures that the model will learn relationships and patterns accurately.

Additionally, if there are many missing values or missing values in significant features of a dataset, they should also be dealt with.

There are two main ways to deal with missing values: deleting the instances or columns that have them (if they aren't significant), or imputing them with other values.

...

Heatmaps

Heatmaps are a type of visualization that display correlations between different features of a dataset. Correlations can be positive or negative, and strong or weak.

The features are set as rows and columns, and the cells are color-coded based on their correlation value. Features with a high positive number are strongly positively correlated.

Exercise 10.05: Checking for Correlations between Features

In this exercise, you will plot a heatmap to observe whether there are any correlations between features of the new_air DataFrame:

  1. Import numpy as np:
    import numpy as np
  2. Create a variable called corr that will store the correlations between the features of new_air. Calculate these correlations by applying the .corr() function on new_air:
    corr = new_air.corr()
  3. Mask the zero values using the zeros_like() function, with corr as the correlations to check, and set dtype as np.bool:
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)]...

Summary

In this chapter, we played around with data pertaining to the quality of air in multiple localities of Beijing, China. We observed trends over different measures of time to see how the concentration of various pollutants differed.

In this book, we looked at several data cleaning, preparation, analysis, and visualization techniques and applied them to a diverse range of datasets from a variety of domains. We made informed decisions to delete or impute instances based on the data available, and tweaked existing features to create new ones by converting them into different formats and breaking them down into several features.

These processes helped us to derive additional insights from our data. Additionally, we learned to ensure that we ask our data the right questions and understand what information it can and cannot provide us with. It is important not to have unreasonable expectations from your data.

You are now equipped with the tools and knowledge required to...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Analysis Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781839211386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev