You're reading from Mastering pandas. - Second Edition

Product type Book

Published in Oct 2019

Publisher

ISBN-13 9781789343236

Pages 674 pages

Edition 2nd Edition

Languages

Python

Concepts

Scientific Computing

Author (1):

Ashish Kumar

Special Data Operations in pandas

pandas has an array of special operators for generating, aggregating, transforming, reading, and writing data from and to a variety of data types, such as number, string, date, timestamp, and time series. The basic operators in pandas were introduced in the previous chapter. In this chapter, we will continue that discussion and elaborate on the methods, syntax, and usage of some of these operators.

Reading this chapter will allow you to perform the following tasks with confidence:

Writing custom functions and applying them on a column or an entire DataFrame
Understanding the nature of missing values and handling them
Transforming and performing calculations on series using functions
Miscellaneous numeric operations on data

Let's delve into it right away. For the most part, we will generate our own data to demonstrate the methods.

The following...

Writing and applying one-liner custom functions

Python provides lambda functions, which are a way to write one-liner custom functions so that we can perform certain tasks on a DataFrame's column(s) or the entire DataFrame. Lambda functions are similar to the traditional functions that are defined using the def keyword but are more elegant, are more amenable to apply on DataFrame columns, and have lucid and crisp syntax, much like a list comprehension for implementing for loops on lists. Let's look at how lambda functions are defined and applied.

lambda and apply

In order to see how the lambda keyword can be used, we need to create some data. We'll create data containing date columns. Handling date columns is...

Handling missing values

Missing values and NANs are commonplace occurrences in a dataset and need to be taken care of before data can be put to any use. We will look into various sources of missing values and the different types, as well as how to handle them in the upcoming sections.

Sources of missing values

A missing value can enter a dataset because of or during the following processes.

Data extraction

This entails the data that's available but we missed during its extraction from a source. It deals with engineering tasks such as the following:

Scraping from...

A survey of methods on series

Let's use the following DataFrame to understand some methods and functions that can be used with a series:

sample_df = pd.DataFrame([["Pulp Fiction", 62, 46], ["Forrest Gump", 38, 46], ["Matrix", 26, 39], ["It's a Wonderful Life", 6, 0], ["Casablanca", 5, 6]], columns = ["Movie", "Wins", "Nominations"])
sample_df

The following is the output:

Sample DataFrame—IMDB database

The items() method

The items() method provides a means of iteratively accessing each row in a series or DataFrame. It performs a lazy evaluation to store each value in a row, along with the index in the form of a tuple. The results...

pandas string methods

This section talks about the pandas string methods. These methods are useful when dealing with messy text data. These methods clean the text data, structure it, segment it, and search important chunks of it. Let's look into these methods and find out what each of them contains.

upper(), lower(), capitalize(), title(), and swapcase()

String methods such as upper(), lower(), capitalize(), title(), and swapcase() help when we wish to convert all the string elements into an entire series. The upper and lower methods convert the entire string into uppercase or lowercase. The following command shows converting a series into uppercase:

sample_df["Movie"].str.upper()

The following is the output...

Binary operations on DataFrames and series

Some binary functions such as, add, sub, mul, div, mod, and pow, perform common arithmetic operations involving two DataFrames or series.

The following example shows the addition of two DataFrames. One of the DataFrames has the shape (2,3) while the other has the shape (1,3). The add function performs an elementwise addition. When a corresponding element is missing in any of the DataFrames, the missing values are filled with NaNs:

df_1 = pd.DataFrame([[1,2,3],[4,5,6]])
df_2 = pd.DataFrame([[6,7,8]])
df_1.add(df_2)

The following is the output:

Adding two DataFrames elementwise

Instead of using NaNs, we can choose to fill it with any value using the fill_value argument. Let's explore this through the mul function for multiplication:

df_1.mul(df_2, fill_value = 0)

The following is the output:

The fill_value parameter in binary operators...

Binning values

The pandas cut() function bins values in a 1-dimensional array. Consider the following 1-dimensional array with 10 values. Let's group it into three bins:

bin_data = np.array([1, 5, 2, 12, 3, 25, 9, 10, 11, 4])
pd.cut(bin_data, bins = 3)

The following is the output:

pandas cut function with three bins

Each of the 10 elements is mapped to one of the three bins. The cut function maps the items to a bin and provides information about each bin. Instead of specifying the number of bins, the boundaries of the bins could also be provided in a sequence:

pd.cut(bin_data, bins = [0.5, 7, 10, 20, 30])

The following is the output:

pandas cut function with bin values

The intervals for binning can be directly defined using the pandas interval_range function. Consider the following example, demonstrating the creation of a pandas IntervalIndex object:

interval = pd.interval_range...

Using mathematical methods on DataFrames

Computations such as sum, mean, and median can be performed with ease on pandas DataFrames using the built-in mathematical methods in the pandas library. Let's make use of a subset of the sales data to explore the mathematical functions and methods in pandas. While applying these mathematical functions, it should be ensured that the selected columns are numeric. The following screenshot shows the data with five rows and three columns, all of which will be used in this section:

Sample sales data

The abs() function

The abs() function returns the absolute values of records in the DataFrame. For columns with complex values in the form x+yj, the absolute value is computed as :

abs_df...

Summary

This chapter provided a collection of special methods that show the flexibility and usefulness of pandas. This chapter has been like an illustrated glossary in which each function serves a very unique purpose. Now, you should have an idea of how to create and apply one-liner functions in pandas, and you should understand the concepts of missing values and the methods that take care of them. This is also a compendium of all the miscellaneous methods that can be applied to a series and the numeric methods that can be applied to any kind of Python data structure.

In the next chapter, we will take a look at how we can handle time series data and plot it using matplotlib. We will also have a look into the manipulation of time series data by looking at rolling, resampling, shifting, lagging, and time element separation.