Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

4. A Deep Dive into Data Wrangling with Python

Overview

This chapter will cover pandas DataFrames in depth, thus teaching you how to perform subsetting, filtering, and grouping on DataFrames. You will be able to apply Boolean filtering and indexing to a DataFrame to choose specific elements from it. Later on in the chapter, you will learn how to perform JOIN operations in pandas that are analogous to the SQL command. By the end of this chapter you will be able to apply imputation techniques to identify missing or corrupted data and choose to drop it.

Introduction

In the previous chapter, we learned how to use the pandas, numpy, and matplotlib libraries while handling various datatypes. In this chapter, we will learn about several advanced operations involving pandas DataFrames and numpy arrays. We will be working with several powerful DataFrame operations, including subsetting, filtering grouping, checking uniqueness, and even dealing with missing data, among others. These techniques are extremely useful when working with data in any way. When we want to look at a portion of the data, we must subset, filter, or group the data. Pandas contains the functionality to create descriptive statistics of the dataset. These methods will allow us to start shaping our perception of the data. Ideally, when we have a dataset, we want it to be complete, but in reality, there is often missing or corrupt data. This can happen for a variety of reasons that we can't control, such as user error and sensor malfunction. Pandas has built-in functionalities...

Subsetting, Filtering, and Grouping

One of the most important aspects of data wrangling is to curate the data carefully from the deluge of streaming data that pours into an organization or business entity from various sources. Lots of data is not always a good thing; rather, data needs to be useful and of high quality to be effectively used in downstream activities of a data science pipeline, such as machine learning and predictive model building. Moreover, one data source can be used for multiple purposes, and this often requires different subsets of data to be processed by a data wrangling module. This is then passed on to separate analytics modules.

For example, let's say you are doing data wrangling on US state-level economic output. It is a fairly common scenario that one machine learning model may require data for large and populous states (such as California and Texas), while another model demands processed data for small and sparsely populated states (such as Montana...

Detecting Outliers and Handling Missing Values

Outlier detection and handling missing values fall under the subtle art of data quality checking. A modeling or data mining process is fundamentally a complex series of computations whose output quality largely depends on the quality and consistency of the input data being fed. The responsibility of maintaining and gatekeeping that quality often falls on the shoulders of a data wrangling team.

Apart from the obvious issue of poor-quality data, missing data can sometimes wreak havoc with the Machine Learning (ML) model downstream. A few ML models, such as Bayesian learning, are inherently robust to outliers and missing data, but common techniques such as Decision Trees and Random Forest have an issue with missing data because the fundamental splitting strategy employed by these techniques depends on an individual piece of data and not a cluster. Therefore, it is almost always imperative to impute missing data before handing it over to...

Concatenating, Merging, and Joining

Merging and joining tables or datasets are highly common operations in the day-to-day job of a data wrangling professional. These operations are akin to the JOIN query in SQL for relational database tables. Often, the key data is present in multiple tables, and those records need to be brought into one combined table that matches on that common key. This is an extremely common operation in any type of sales or transactional data, and therefore must be mastered by a data wrangler. The pandas library offers nice and intuitive built-in methods to perform various types of JOIN queries involving multiple DataFrame objects.

Exercise 4.07: Concatenation in Datasets

In this exercise, we will concatenate DataFrames along various axes (rows or columns).

Note

The superstore dataset file can be found here: https://packt.live/3dcVnMs.

This is a very useful operation as it allows you to grow a DataFrame as the new data comes in or new feature...

Useful Methods of Pandas

In this section, we will discuss some small utility functions that are offered by pandas so that we can work efficiently with DataFrames. They don't fall under any particular group of functions, so they are mentioned here under the Miscellaneous category. Let's discuss these miscellaneous methods in detail.

Randomized Sampling

In this section, we will discuss random sampling data from our DataFrames. This is a very common task in a variety of pipelines, one of which is machine learning. Sampling is often used in machine learning data-wrangling pipelines when choosing which data to train and which data to test against. Sampling a random fraction of a big DataFrame is often very useful so that we can practice other methods on them and test our ideas. If you have a database table of 1 million records, then it is not computationally effective to run your test scripts on the full table.

However, you may also not want to extract only the first...

Summary

In this chapter, we deep-dived into the pandas library to learn advanced data wrangling techniques. We started with some advanced subsetting and filtering on DataFrames and rounded this off by learning about boolean indexing and conditionally selecting a subset of data. We also covered how to set and reset the index of a DataFrame, especially while initializing.

Next, we learned about a particular topic that has a deep connection with traditional relational database systems – the groupBy method. Then, we deep-dived into an important skill for data wrangling – checking for and handling missing data. We showed you how pandas helps in handling missing data using various imputation techniques. We also discussed methods for dropping missing values. Furthermore, methods and usage examples of concatenation and merging DataFrame objects were shown. We saw the join method and how it compares to a similar operation in SQL.

Lastly, miscellaneous useful methods on DataFrames...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar