Hands-On Data Preprocessing in Python

By Roy Jafari

Early Access

This is an Early Access product. Early Access chapters haven’t received a final polish from our editors yet. Every effort has been made in the preparation of these chapters to ensure the accuracy of the information presented. However, the content in this book will evolve and be updated during the development process.

Learn more
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. 1Review of the core modules of NumPy and Pandas

About this book

Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

This book will equip you with optimum data preprocessing techniques from multiple perspectives. You'll learn different technical and analytical aspects of data preprocessing - data collection, data cleaning, data integration, data reduction, and data transformation - and get to grips with implementing them using the open-source Python programming environment. The book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify analytics opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to create queries to pull data from relational databases.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data, perform data cleaning, integration, reduction techniques, and handle outliers or missing values to implement the appropriate data transformation method.

Publication date:
January 2022


1Review of the core modules of NumPy and Pandas

NumPy and pandas modules are capable of meeting your needs for the majority of data analytics and data preprocessing tasks. Before we start reviewing these two valuable modules, I would like to let you know that this chapter is not meant to be a comprehensive teaching guide to these modules, but rather a collection of concepts, functions, and examples that will be invaluable, as we will cover data analytics and data preprocessing in proceeding chapters.

In this chapter, we will first review Jupyter notebooks and their capability as an excellent coding User Interface (UI). Next, we will review the most relevant data analytic resources of the NumPy and pandas Python modules.

The following topics will be covered in this chapter:

  • Overview of Jupyter Notebook
  • Are we analyzing data via computer programming?
  • Overview of the basic functions of NumPy
  • Overview of pandas

Technical requirements

The easiest way to get started with Python programming is by installing Anaconda Navigator. It is open source software that brings together many useful open source tools for developers. You can download Anaconda Navigator by following this link: https://www.anaconda.com/products/individual.

We will be using Jupyter Notebook throughout this book. Jupyter Notebook is one of the open source tools that Anaconda Navigator provides. Anaconda Navigator also installs a Python version on your computer. So, following Anaconda Navigator's easy installation, all you need to do is open Anaconda Navigator and then select Jupyter Notebook.

You will be able to find all of the codes and the dataset that is used in this book in a GitHub repository exclusively created for this book. To find the repository, click on the following link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. Each chapter in this book will have a folder that contains all of the code and datasets that were used in the chapter.


Overview of Jupyter Notebook

Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing. As a UI, Jupyter Notebook provides an interactive environment where you can run your Python codes, see immediate outputs, and take notes.

Fernando Pérezthe, the architect of Jupyter Notebook, outlines the following reasons in terms of what he was looking for in an innovative programming UI:

  • Space for individual exploratory work
  • Space for collaboration
  • Space for learning and education

If you have used Jupyter Notebook already, you can attest that it delivers all these promises, and if you have not yet used it, I have good news for you. We will be using Jupyter Notebook in the entirety of this book. Most of the code that I will be sharing will be in the form of screenshots from the Jupyter Notebook UI.

The UI design of Jupyter Notebook is very simple. You can think of it as one column of material. These materials could be under code chunks or markdown chunks. The solution development and the actual coding happens under the code chunks, whereas notes for yourself or other developers are presented under markdown chunks. The following screenshot shows both an example of a markdown chunk and a code chunk. You can see that the code chunk has been executed and the requested print has taken place and the output is shown immediately after the code chunk:

Figure 1.1 – Code for printing Hello World in a Jupyter notebook

To create a new chunk, you can click on the + sign on the top ribbon of the UI. The newly added chunk will be a code chunk by default. You can switch the code chunk to a markdown chunk by using the drop-down list on the top ribbon. Moreover, you can move the chunks up or down by using the correct arrows on the ribbon. You can see these three buttons in the following screenshot:

Figure 1.2 – Jupyter Notebook control ribbon

You can see the following in the preceding screenshot:

  • The ribbon as shown in the screenshot also allows you to Cut, Copy, and Paste the chunks.
  • The Run button on the ribbon is to execute the code of a chunk.
  • The Stop button is to stop a running code. You normally use this button if your code has been running for a while with no output.
  • The Restart button wipes the slate clean; it removes all of the variables you have defined so you can start over.
  • Finally, the Restart & Run button restarts the kernel and runs all of the chunks of code in the Jupyter Notebook files.

There is more to Jupyter Notebook, such as useful short keys to speed up development and specific markdown syntax to format the text under markdown chunks. However, the introduction here is just enough for you to start meaningfully analyzing data using Python through the Jupyter Notebook UI.


Are we analyzing data via computer programming?

To benefit most from the two modules that we will cover in this chapter, we need to understand what they really are and what we are really doing when we use them. I am sure whoever is in the business of content development for data analytics using Python, including me (guilty as charged), would tell you that when you use these modules to manipulate your data, you are analyzing your data using computer programming. However, what you are actually doing is not computer programming. The computer programming part has already been done for the most part. In fact, this has been done by the top-notch programmers who put together these invaluable packages. What you do is use their code made available to you as programming objects and functions under these modules. Well, if I am being completely honest, you are doing a tad bit of computer programming, but just enough to access the good stuff (these modules). Thanks to these modules, you will not experience any difficulty in analyzing data using computer programming.

So, before embarking on your journey on this chapter and this book, remember this. For the most part, our job as data analysts is to connect three things: our business problem, our data, and technology. The technology could be commercial software such as Excel or Tableau, or, in the case of this book, these modules.


Overview of the basic functions of NumPy

In short, as the name suggests, NumPy is a Python module brimming with useful functions for dealing with numbers. The Num in the first part of the name NumPy stands for numbers, and Py stands for Python. There you have it. If you have numbers and you are in Python, you know what you need to import. That is correct; you need to import NumPy, simple as that. See the following screenshot:

Figure 1.3 – Code for importing the NumPy module

As you can see, we have given the alias np to the module after importing it. You can actually assign any alias that you wish and your code would function; however, I suggest sticking with np. I have two compelling reasons for doing so:

  • First, everyone else uses this alias, so if you share your code with others, they know what you are doing throughout your project.
  • Second, a lot of the time, you end up using code written by others in your projects, so consistency will make your job easier. You will see that most of the famous modules also have a famous alias, for example, pd for pandas, and plt for matplotlib.pyplot.

Good practice advice

NumPy can handle all types of mathematical and statistical calculations for a collection of numbers, such as mean, median, standard deviation (std), and variance (var). If you have something else in mind and are not sure whether NumPy has it, I suggest googling it before trying to write your own. If it involves numbers, chances are NumPy has it.

The following screenshot shows the mean, for example, applied to a collection of numbers:

Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

As shown in Figure 1.4, there are two ways to do this. The first one, portrayed on the top chunk, uses np.mean(). This function is one of the properties of the NumPy module and can be accessed directly. The great aspect of using this approach is that you do not need to change your data type most of the time before NumPy honors your request. You can input lists, pandas series, or DataFrames. You can see on the top chunk that np.mean() easily calculated the mean of lst_nums, which is of the list type. The second way, as shown in the bottom chunk, is to first use np.array() to transform the list into a NumPy array and then use the .mean() function, which is a property of any NumPy array. Before continuing to progress with this chapter, take a moment and use the Python type() function to see the different types of lst_numbs and ary_nums, as shown in the following screenshot:

Figure 1.5 – The application of the type() function

Next we will learn four NumPy functions: np.arramge(), np.zeros(), np.ones(), and np.linspace().

The np.arange() function

This function as shown in the following screenshot produces a sequence of numbers with equal increments. You can see in the figures that by changing the two inputs, you can get the function to output many different sequences of numbers that are required for your analytic purposes:

Figure 1.6 – Examples of using the np.arange() function

Pay attention to the three chunks of code in the preceding figure to see the default behavior of np.arange() when only one or two inputs are passed.

  • When only one input is passed, as in the first chunk of code, the default of np.arange() is that you want a sequence of numbers from zero to the inputted number with increments of one.
  • When two inputs are passed, as in the second chunk of code, the default of the function is that you want a sequence of numbers from the first input to the second input with increments of one.

The np.zeros() and np.ones() functions

np.ones() creates a NumPy array filled with ones, and np.zeros() does the same thing with zeros. Unlike np.arange(), which takes the input to calculate what needs to be included in the output array, np.zeros() and np.ones() take the input to structure the output array. For instance, the top chunk of the following screenshot specifies the request for an array with four rows and five columns filled with zeros. As you can see in the bottom chunk, if you only pass in one number, the output array will have only one dimension:

Figure 1.7 – Examples of np.zeros() and np.ones()

These two functions are excellent resources for creating a placeholder to keep the results of calculations in a loop. For instance, review the following example and observe how this function facilitated the coding.

Example – Using a placeholder to accommodate analytics

Given the grade data of 10 students, create a code using NumPy that calculates and reports their grade average.

The data of the 10 students and the solution to this example are provided in the following screenshots. Please review and try this code before progressing:

Figure 1.8 – Grade data for the example

Now that you’ve had a chance to engage with this example, allow me to highlight a few matters about the provided solution presented in Figure 1.9:

  • Notice how np.zeros() facilitated the solution by streamlining it significantly. After the code is done, all of the average grades are calculated and saved already. Compare the printed values before and after the for loop.
  • The enumerate() function in the for loop might sound strange to you. What that does is help the code to have both an index (i) and the item (name) from the collection (Names).
  • The .formate() function is an invaluable property of any string variable. If there are any symbols such as {} in the string, this function will replace them with what has been inputted sequentially.
  • # better-looking report is a comment in the second chunk of the code. Comments are not compiled and their only purpose is to communicate something with whoever reads the source code.
Figure 1.9 – Solution to the preceding example

The np.linspace() function

This function returns evenly spaced numbers over a specified interval. The function takes three inputs. The first two inputs specify the interval, and the third shows the number of elements that the output will have. For example, refer to the following screenshot:

Figure 1.10 – Solution to the preceding example

In the first code block, 19 numbers are evenly spaced between 0 and 1, altogether creating an array with 21 numbers. The second gives another example. After trying out the two examples in the code, try np.linspace(0,1,20) and after investigating the results, think about why I chose 21 over 20 in my example.

np.linspace() is a very handy function for situations where you need to try out different values to find the one that best fits your needs. The following example showcases a simple situation like that.

Example – np.linspace() to create solution candidates

We are interested in finding the value(s) that holds the following mathematical statement: .

Formula_B17397_01_001And imagine that we don’t know that the statement can be simplified easily to ascertain that either 2 or 3 will hold the statement.

so we would like to use NumPy to try out any whole numbers between -1000 and 1000 and find the answer.

The following screenshot shows a Python code that provides a solution to this problem:

Figure 1.11 – Solution to the preceding example

Please review and try this code before moving on.

Now that you’ve had a chance to engage with this example, allow me to highlight a few things.

  • Notice how smart use of np.linspace() leads to an array with all of the numbers that we were interested in trying out.
  • Uncomment #print(Candidates) and review all of the numbers that were tried out to establish the desired answers.

This concludes our review of the NumPy module. Next, we will review another very useful Python module, pandas.


Overview of pandas

In short, Pandas is our main module for working with data. The module is brimming with useful functions and tools, but let’s get down to the basics first. The greatest tool of pandas is its data structure, which is known as a DataFrame. In short, DataFrame is a two-dimensional data structure with a good interface and great codability.

DataFrame makes itself useful to you right off the bat. The moment you read a data source using pandas, the data is restructured and shown to you as a DataFrame. Let’s give it a try.

We will use the famous adult dataset (adult.csv) to practice and learn the different functionalities of pandas. Refer to the following screenshot, which shows the importing of pandas and then reading and showing the dataset. In this code, .head() requests that only the top five rows of data are outputted. The .tail() code could do the same for the bottom five rows of the data.

Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its five first rows

The adult dataset has six continuous and eight categorical attributes. Due to print limitations, I have only been able to include some part of the data; however, if you pay attention to Figure 1.12, the output comes with a scroll bar at the bottom that you can scroll to see the rest of the attributes. Give this code a try and study its attributes. As you will see, all of the attributes in this dataset are self-explanatory, apart from fnlwgt. The title is short for the final weight and it is calculated by Census Bureau to represent the ratio of the population that each row represents.

Good practice advice

It is good practice to always get to know the dataset you are about to work on. This process always starts with making sure you understand each attribute, the way I just did now. If you have just received a dataset and you don’t know what each attribute is, ask. Trust me, you will look more like a pro than not.

There are other steps to get to know a dataset. I will mention them all here and you will learn how to do them by the end of this chapter.

Step one: Understand each attribute as I just explained.

Step two: Check the shape of the dataset. How many rows and columns does the dataset have? This one is easy. For instance, just try adult_df.shape and review the result.

Step three: Check whether the data has any missing values.

Step four: Calculate summarizing values for numerical attributes such as mean, median, and standard deviation, and compute all the possible values for categorical attributes.

Step five: Visualize the attributes. For numerical attributes, use a histogram or a boxplot, and for categorical ones, use a bar chart.

As you just saw, before you know it, you are enjoying the benefits of a pandas DataFrame. So it is important to better understand the structure of a DataFrame. Simply put, a DataFrame is a collection of series. Series is another pandas data structure that does not get as much credit, but is useful all the same, if not more.

To understand this better, try to call some of the columns of the adult dataset. Each column is a property of a DataFrame, so to access it, all you need to do is to use .ColumnName after the DataFrame. For instance, try running adult_df.age to see the column age. Try running all of the columns and study them, and if you come across errors for some of them, do not worry about it, we will address them soon if you continue reading. The following screenshot shows how you can confirm what was just described for the adult dataset:

Figure 1.13 – Checking the type of adult_df and adult_df.age

It gets more exciting. Not only is each attribute a series, but each row is also a series. To access each row of a DataFrame, you need to use .loc[] after the DataFrame. What comes between the brackets is the index of each row. Go back and study the output of df_adult.head() in Figure 1.12 and you will see that each row is represented by an index. The indices do not have to be numerical and we will see how indices of a pandas DataFrame can be adjusted, but when reading data using pd.read_csv() with default properties, numerical indices will be assigned. So give it a try and access some of the rows and study them. For instance, you can access the second row by running adult_df.loc[1]. After running a few of them, run type(adult_df.loc[1]) to confirm that each row is a series.

When accessed separately, each column or row of a DataFrame is a series. The only difference between a column series and a row series is that the index of a column series is the index of the DataFrame, and the index of a row series is the column names. Study the following screenshot, which compares the index of the first row of adult_df and the index of the first column of adult_df:

Figure 1.14 – Investigating the index for a column series and a row series

Now that we have been introduced to pandas data structures, next we will cover how we can access the values that are presented in them.

Pandas data access

One of the greatest advantages of both pandas series and DataFrames is the excellent access they afford us. Let’s start with DataFrames, and then we will move on to series as there are lots of commonalities between the two.

Pandas DataFrame access

As DataFrames are two-dimensional, this section first addresses how to access rows, and then columns. The end part of the section will address how to access each value.

DataFrame access rows

The only two keywords you will ever need to access the rows of a DataFrame are .loc[] and .iloc[]. To understand the difference between them, you need to know that each pandas series or DataFrame carries two types of indices: default indices or assigned indices. The default indices are the integer numbers that are automatically assigned to your dataset upon reading. However, pandas allows you to update them. The function that you can use to do so is .set_index(). For instance, we would like to make sure all of the indices in adult_df have five digits, so instead of indices between 0 and 32651 (run len(adult_df) to see that this is the number of rows adult_df has), we want indices to be from 10000 to 42651. The following screenshot uses np.arange() and .set_index() to do this. In this code, inplace=True indicates to the .set_index() function that you want the change to be applied to the DataFrame itself.

Why is it that when inplace=True is incorporated, there is no output, and when it is included, Jupyter Notebook shows the updated DataFrame?

The answer is that the .set_index() function, by default, outputs a new DataFrame that has the requested index unless inplace=True is specified, which requests the change to be applied to the original DataFrame.

Figure 1.15 – Updating the index of adult_df as described

Now, each row of the DataFrame can be accessed by specifying the index in between the brackets of .loc[]. For instance, running adult_df.loc[10001] will give you the second row. This is how you would always access the DataFrame using the assigned indices. If you started missing the default indices, as you often do when you go about preprocessing your data, pandas has you covered.

You can use .iloc[] to access the data using the default integer indices. For instance, running adult_df.iloc[1] will also return the second row. In other words, pandas will change the index to your liking, but behind the scenes, it will also keep its integer default index and also lets you use it if you so wish.

DataFrame access columns

As there are two ways to access each row, there are also two ways to access each column. The easier and better way to access your columns is to know that each column is coded to be a property of a DataFrame. So, you can access each column by using .ColumnName. For instance, run adult_df.age, adult_df.occupation, and so on to see how easy it is to access the columns using this way.

If you happened to run adult_df.education-number, you have already seen that this gives you an error. If you haven’t, go ahead and do so to study the error. Why does this error happen?

Figure 1.16 – Running adult_df.education-number and its error

If you study the error message, it is prompting that ‘num’ is not defined. That is true; we do not have anything named ‘num’. That is the key to use this error to answer my question.

Python deciphers dashes as subtraction operators unless presented inside a quotation. So it all comes down to this. Because of the way this variable is named, you cannot use the .ColumnName method to access the variable. You either need to change the name of the variable or use the second method to access the columns.

The second method passes the name as a string, or, in other words, inside a quotation. Try running adult_df[‘education-num’] and this time you will not get an error.

Good practice advice

If you are new to programming, one of the pieces of advice that I have for you is not to be intimidated by errors, and not only that, welcome errors with open arms because they are an excellent opportunity to learn. I just used an error to teach you something.

DataFrame access values

Imagine you want to access the education value for the third row of adult_df. There are so many ways you can go about this. You can start from the column and once you get a column series, access the value, or you can go from the row, and once you get a row series, access the value. Study the following screenshot; the three first chunks of code show different possibilities of doing that. However, my favorite way to access the values is to use .at[], shown in the last chunk.

Figure 1.17 – Four different methods of accessing the records of a pandas DataFrame

Accessing values with .at[] is my favorite for two reasons. First, it is much neater and more straightforward. Second, you can treat the DataFrame like a matrix as it is one, at least visually.

Pandas series access

Access to the values of series is very similar to that of DataFrames, just simpler. You can access the values of a series using all of the methods mentioned for DataFrame, except for .at[]. You can see all of the possibilities in the following screenshot. If you were to try the last line of the second chunk of code, Python would generate a syntax error as numbers cannot be the name of programming objects. To use this method, you have to make sure that the series indices are of the string type.

Figure 1.18 – Different methods of accessing the values of a pandas series


Slicing applies to both NumPy and pandas; however, since this is a book about data preprocessing, we will use it more often with a pandas DataFrame. Let’s begin by slicing NumPy arrays to understand slicing and then apply it to a pandas DataFrame.

Slice a NumPy array

We slice a NumPy array when we need access to more than one value of the data. For instance, consider the code in the following screenshot:

Figure 1.19 – Examples of slicing NumPy arrays

Here, my_array, which is a 4 x 4 matrix, has been sliced in different ways. The second chunk of code is not slicing; as you can see, only one value is accessed. What separates normal access from slicing access is the presence of a colon (:) in any of the index inputs. For instance, a colon in the third chunk of code means you are requesting all of the columns, and the output includes all of the columns, but since only the second row (index 1) is specified, the entirety of the second row is outputted. The fourth chunk of code is the opposite; one column is specified and the whole rows are requested, so the entirety of the second column is outputted.

You can also use a colon (:) to only specify access from a certain index to another one. For instance, in the second chunk of the following code, while all the columns are requested, only the second to fourth rows (1:3) are requested. The third chunk of code shows that both columns and rows can be sliced at the same time. Finally, the last chunk of code shows that you can pass a list of indices that you want to include in your slice.

Figure 1.20 – More complex examples of slicing

Slice a pandas DataFrame

Just like NumPy arrays, pandas DataFrames can also be sliced both on the columns and rows. However, the slicing function can only be done inside either .loc[] or .iloc[]. The access method, .at[], and the other ways of accessing data do not support slicing. For instance, the following code slices adult_df to show all of the rows, but only the columns from education to occupation. Running adult_df.iloc[:,3:6] will result in the same output.

Figure 1.21 – Example of slicing a pandas DataFrame

You want to become comfortable with slicing a pandas DataFrame. It is a very useful way to access your data. See the following example, which showcases one practical way in which you could use slicing.

Practical example of slicing

Run adult_df.sort_values('education-num'). You will see this code sort the DataFrame based on the education-num column. In Jupyter Notebook output, you only see the first five and the last five rows of this sorting. Slice the output of the rows from across the DataFrame instead of just from the beginning and the end.

The following screenshot shows how slicing the DataFrame can make this happen:

Figure 1.22 – Solution to the practical example of slicing a pandas DataFrame

Let’s go over this code step by step:

  • The first part, .sort_values('education-num'), as mentioned, sorts the DataFrame by education-num. I hope you have given this a try before reading on. Pay attention to the indices of the sorted adult_df. They look jumbled up, as they should. The reason is that the DataFrame is now sorted by another column.
  • If we want to have a new index that matches this new order, we can use .reset_index(), as it has been used in the preceding screenshot. Go ahead and give this a try as well. Run adult_df.sort_values('education-num').reset_index(). You will see that the old index is presented as a new column and that the new index looks as ordered as any newly read dataset.
  • Adding .iloc[1:32561:3617] achieves what this example is asking. This specific slice requests the first row and every 3617th row after that until the end of the DataFrame. The number 32561 is the number of rows in adult_df (run len(adult_df)), and 3617 is the quotient of the division of 32561 by 9. This division calculates the equal jumps that take us from row one to nearly the end of adult_df. Pay attention if the division of 32561 by 9 didn’t have a remainder; the code would take you all the way to the end of the DataFrame.

Good practice advice

Being able to slice the DataFrames this way is advantageous in the initial stages of getting to know a dataset. One of the disadvantages of data manipulations using programming instead of spreadsheet software such as Excel is that you cannot scroll through the data as you would in Excel. However, slicing the data this way can allow you to somehow mitigate this shortcoming.

Now that we have learned how to access and slice a dataset, we need to learn how to filter the data based on our needs. To do that, next we will learn Boolean masking, which is a powerful filtering technique.

Boolean mask for filtering a DataFrame

One of the simplest and yet most powerful tools of working with data is Boolean Masking. When you want to filter a DataFrame using a Boolean mask, you need a one-dimensional collection of Boolean values (True or False) that has as many Boolean values as the number of rows of DataFrames you want to filter.

The following screenshot shows an example of Boolean masking:

Figure 1.23 – Example of Boolean masking

The code portrays Boolean masking in three steps:

  1. The code first creates the pandas series twopowers_sr, which contains the values of 2 to the power of zero through 10 (20, 21, 22, …, 210).
  2. Then, a Boolean mask is set up. Pay attention as twopowers_sr has 11 numerical values, while BM also has 11 Boolean values. From now on in this book, every time you see BM, you can safely assume it stands for Boolean Mask.
  3. The last line of code filters the series using the mask.

The way a Boolean mask works is straightforward. If the counterpart of the numerical value from twopowers_sr in the Boolean mask (BM) is False, the mask blocks the number, and if it is True, the mask lets it through. Check whether that has been the case regarding the output of the preceding code. This is shown in the following figure:

Figure 1.24 – Depiction of Boolean masking

What is great about pandas is that you can use the DataFrame or series themselves to create useful Boolean masks. You can use any of the mathematical comparison operators to do this. For instance, the following screenshot first creates a Boolean mask that would only include True for numbers greater than or equal to 500. Then, the Boolean mask is applied to twopowers_sr to filter out the numbers in two ways.

Both of these ways are legitimate, correct, and they work. On the first one, you still give the Boolean mask a name. We use the name BM to do this as mentioned earlier. Then, we use BM to apply the Boolean mask. On the second one, you create and use the Boolean mask on the fly, as programmers say. That means you do everything in one line of code. I use the first one more often than not as I believe it makes the code more readable.

Figure 1.25 – Example of Boolean masking to filter data

You might be asking from the preceding code, so what if we can filter the data using Boolean masking? That is a legitimate question. Boolean masks come into their own when you use them on DataFrames for analytics. The following two examples will clarify this for you.

Analytic example 1 that uses Boolean masking

We are interested in calculating the mean and median age of people with preschool education in adult_df.

This can be easily done using Boolean masking. The following screenshot first creates BM using the series adult_df.education.

Figure 1.26 – Solution to the preceding example

Since the BM series has as many elements as the adult_df DataFrame (why?), BM can be applied to filter it. Once the DataFrame is filtered using adult_df[BM], it only contains rows that their education is ‘Preschool’. So now you can easily use np.mean() and np.median() to calculate the mean and median of age for these filtered rows.

Analytic example 2 that uses Boolean masking

We are interested in comparing the CapitalGain of individuals with f than 10 years’ education with individuals with more than 10 years’ education.

Figure 1.27 – Solution to the preceding example

Again, Boolean masks can help us immensely here. Two of them, BM1 and BM2, are first created based on what we are interested in calculating. Then, two calculations and reports show the mean of the capital gain for people with more than, and less than, 10 years of education.

Pandas functions for exploring a DataFrame

When you compare spreadsheet software such as Excel with coding, one of the stark disadvantages of coding is that you cannot create as tangible of a relationship with your data as you would with Excel. That is a fair comparison as Excel lets you scroll up and down on your data and so allows you to get to know it. While coding does not grant you this privilege, pandas has a handful of useful functions that helps you to familiarize yourself with the data.

Getting to know a dataset has two aspects. The first is to get to know the structure of the data, such as the number of rows, columns, and the name of columns. The second one is to get to know the values under each column. So we first cover getting to know the structure of the dataset and then we will focus on the values under each column.

Getting to know the structure of a dataset

You can use three useful properties of a pandas Dataframe to study the structure of a dataset. These are .shape, .columns, and .info(). In the following section, we will go over them one by one.

The .shape property

.shape is the property of any pandas DataFrame. It tells you how many rows and columns the DataFrame has. So, once you apply this to adult_df, as executed by the code in the following screenshot, you can see that the DataFrame has 32,561 rows and 15 columns:

Figure 1.28 – Example of using the .shape property of a DataFrame to get to know the dataset

The .columns property

.columns allows you to see and edit the column names in your DataFrame. In the following code, you can see that adult_df.columns resulted in the output of all the column names of adult_df. Of course, you could have scrolled to see all of the columns when you read the dataset; however, this is not possible when the data has more than 20 columns.

Figure 1.29 – Example of using the .columns property of a DataFrame to get to know the dataset

Furthermore, .columns can be used to update the columns’ names. This has been shown in the following screenshot. After running the following code, now you can safely use adult_df.education_num to access the relevant attribute. We just change the attributed name from ‘education-num’ to ‘education_num’ and now the attribute can be accessed using the .columnName method. Refer to Figure 1.16, which showed the code you’d get if you were to run adult_df.education-num.

Figure 1.30 – Example of updating the column titles of a DataFrame

The .info() function

This function provides information about both the shape and the columns of the DataFrame. If you run adult_df.Info(), you will see other information, such as the number of non-null values and also the type of data under each column that will be reported.

Getting to know the values of a dataset

The functions that pandas has to get to know the numerical columns are different than those of categorical columns. The difference between numerical and categorical columns is that categorical columns are not represented by numbers or, more accurately, do not carry numerical information.

To get to know numerical columns, the .describe(), .plot.hist(), and .plot.box() functions are very useful. On the other hand, the .unique() and .value_counts() functions are instrumental for categorical columns. We will cover these one by one.

The .describe() function

This function outputs many useful statistical metrics that are meant to summarize data for each column. These metrics include Count, Mean, Standard Deviation (std), Minimum (min), first quartile (25%), second quartile (50%) or median, third quartile (75%), and Maximum (max). The following screenshot shows the execution of the function for adult_df and its output:

Figure 1.31 – Example of using the .describe() function to get to know a dataset

The metrics that the .describe() function outputs are very valuable summarizing tools, especially if these metrics are meant to be used for algorithmic analytics. However, studying them all at once still overwhelms our human comprehension. To summarize data for human comprehension, there are more effective tools, such as visualizing the data using histograms and boxplots.

Histograms and boxplots to visualize numerical columns

Pandas makes drawing these visuals very easy. Each pandas series has a very useful collection of plot functions. For instance, the following screenshot shows how easy it is to draw the histogram for the age column. To create the boxplot for the age column, all you need to change is the last part of the code: adult_df.age.box(). Give it a try. Also, draw the boxplot and histogram for all of the other numerical attributes and see for yourself how easier it is to understand each column using visualization.

Figure 1.32 – Drawing the histogram of the adult_df.age column

Let’s move on to the functions that we will use for categorical attributes. We will start with .unique().

The .unique() function

If the column is categorical, our approach to get to know it would be completely different. First, we need to see what are all the possibilities for the column. The .unique() function does just that. It simply returns all the possible values of the columns. See the following screenshot, which is an example of all the possible values of the relationship column in adult_df:

Figure 1.33 – Example of using the .unique() function to get to know a dataset

Now that we have covered the .unique() function, we will cover the .value_counts() function next.

The .value_counts() function

The next step in getting to know a categorical column is realizing how often each possibility happens. The .value_counts() function does exactly that. The following screenshot shows the outcome of this function on the column’s relationship:

Figure 1.34 – Example of using the .value_counts() function to get to know a dataset

The output of the .value_counts() function is also known as the frequency table. There is also the relative frequency table, which shows the ratio of occurrences instead of the number of occurrences for each possibility. To get the relative frequency table, all you need to do is to specify that you want the table to be normalized: .value_counts(Normalize = True). Give it a try!

Barcharts for visualizing numerical columns

To draw the bar chart of a categorical attribute, even though you might be tempted to try out something like adult_df.relationship.plot.bar(), it won’t work. Give it a try and study the error.

To create the bar chart, you would have to first create the frequency table. As the frequency table is a pandas series itself, you can then draw the bar chart using that. The following screenshot shows how we can draw the bar chart for the relationship column using the functions .value_counts() and .plot.bar():

Figure 1.35 – Drawing the bar chart of the adult_df.relationship column

In this part, we learned how we can take advantage of pandas resources to get to know new datasets. Next, we will learn a pandas function that is a game-changer in analyzing and preprocessing data using programming.

Pandas applying a function

There are a lot of instances where we will want to do the same calculations for each row in a dataset. The traditional approach to going about such calculations is to loop through the data and, on every iteration of the loop, perform and save the calculations. Python and pandas have changed this paradigm by introducing the concept of applying a function. When you apply a function to a DataFrame, you request pandas to run it for every row.

You can apply a function to a series or a DataFrame. Since applying a function to a series is somewhat easier, we will learn that first and then we will move to apply a function to a DataFrame.

Applying a function to a series

Let’s say we want to multiply the series adult_df.age by 2. First, you need to write a function that assumes one input as a number, multiply the input by 2, and then output the result. The following screenshot shows this. First, the MutiplyBy2() function is defined, and then, using adult_df.age.apply(MutiplyBy2), is applied to the series.

Figure 1.36 – Example of using the .apply() function

Now, let’s see an analytic example where the .apply() function can be instrumental.

Applying a function – Analytic example 1

Not only does the series adult_df.fnlwgt not have an intuitive name, but also its values are not easily relatable. As mentioned earlier, the values are meant to be the ratio of the population that each row represents. As the numbers are neither percentages nor the actual number of people that each row represents, these values are neither intuitive nor relatable.

Now that we know how to do a calculation for each value in a series, let’s fix this with a simple calculation. How about we divide every value by the sum of all the values in the series?

The following screenshot shows the steps for going about this:

  1. First, total_fnlwgt, which is the sum of all the values in the fnlwgt column, is calculated.
  2. Second, the CalculatePercentage function is defined. This function outputs the inputted values divided by total_fnlwgt and multiplied by 100 (to develop a percentage).
  3. Third, the CalculatePercentage function is applied to the series adult_df.fnlwgt.

Now, pay attention! Instead of just seeing the results of the calculations, the following code has assigned the result to adult_df.fnlwgt itself, which substitutes the original values with the newly calculated percentages. The following code does not show the output of the code, but give it a try on your Jupyter notebook and study the output on your own:

total_fnlwgt = adult_df.fnlwgt.sum()
def CalculatePercentage(v):
    return v/total_fnlwgt*100
adult_df.fnlwgt = adult_df.fnlwgt.apply(                  CalculatePercentage)

Applying a Lambda function

A lambda function is a function that is expressed in one line. So, a lot of the time, applying a lambda function may make coding easier and perhaps help our codes become a bit more readable at times. For instance, if you wanted to answer the preceding calculations “on the fly,” you could simply apply a lambda function instead of an explicit function. See the following code and compare the simplicity and conciseness of using a lambda function instead of an explicit function:

total_fnlwgt = adult_df.fnlwgt.sum()
adult_df.fnlwgt = adult_df.fnlwgt.apply(lambda v: v/total_fnlwgt*100)

It is important to understand that the right choice between a lambda function or an explicit function depends on the situation. Sometimes, having to jam a perhaps complicated function into a line causes coding to become more difficult and renders the code less readable. This will be the case if the function has more than one conditional statement.

Applying a function to a DataFrame

The major difference between applying a function to a DataFrame and a series is when you are defining the function. While, for a series, we had to assume that one value would be inputted in the function, for a DataFrame, we have to assume that a row series will be inputted. So, when you are defining a function to apply to a DataFrame, you can engage any column that you need.

For instance, the following code has defined and applied a function that subtracts education_num from age for every column. Pay attention to three aspects:

  1. First, when defining the CalcLifeNoEd() function, the input row was assumed to be a row series of adult_df. In other words, the CalcLifeNoEd() function is tailored just for application to adult_df or any DataFrame that has age and eduction_num as columns.
  2. Second, the .apply() function comes right after the DataFrame itself instead of after any columns. Compare the code for applying a function to a DataFrame to that of a series. Compare the last two code snippets with the following code snippet.
  3. Third, the inclusion of axis=1 is necessary, and what this means is that you want to apply the function to every row and not every column. You could also apply a function to every column. That almost never happens for analytics, but if you ever needed to, you would have to change it to axis=0.

I have not included the output of this executed code. Give the code a try and study its output:

def CalcLifeNoEd(row):
    return row.age - row.education_num

This could have easily been done using the lambda function as well. The code that you will need to run is the following. Give it a try:

adult_df.apply(lambda r: r.age-r.education_num,axis=1)

Applying a function – Analytic example 2

Which one is more important in terms of your financial success: education or life experience?

To answer this question, we could use adult_df as a sample dataset and extract some insight from the population of people in 1966. The code in the following screenshot first creates two new columns in the data:

  • lifeNoEd: The number of years for which you have lived without formal education
  • capitalNet: The subtraction of capitalLoss from capitalGain

To answer this question, we can check which one of education_num or lifeNoEd has a higher correlation with capitalNet. Doing this is very easy using pandas, as each pandas DataFrame comes with a function, .corr(), which calculates the Pearson Correlation coefficient for all the combinations of the numerical attributes in the DataFrame. As we are only interested in the correlations between education_num, lifeNoEd, and CapitalNet, the last line of the code has removed other columns before running the .corr() function.

Figure 1.37 – Solution to the preceding example

From the output, you can see that while the correlation between lifeNoEd and capitalNet is 0.051490, the correlation between education_num and CapitalNet is higher, at 0.117891. So we have some evidence that education has a more effective role in financial success than just life experience.

Now that you’ve learned how to effectively apply a function for analytics purposes, we can move on to learn about another very powerful and useful function in pandas that is invaluable for data analytics and preprocessing.

Pandas Groupby function

This is one of the most useful analytics and preprocessing tools of pandas. As the name Groupby suggests, it groups your data by something. Normally, you would want to group your data by categorical attributes.

If you are familiar with SQL queries, pandas Groupby is almost identical to SQL Groupby. For both SQL queries and pandas, grouping your data by itself will not have any added value or any output, unless it is accompanied by an aggregate function.

For instance, if you want to count the number of rows per marital_status category, you can use the Groupby function. See and try the following code:


You can group the DataFrame by more than one column as needed. To do so, you will have to introduce the columns you are grouping the DataFrame by in the form of a list of column names. For instance, the following code groups the data based on both the marital_status and sex columns:


Pay attention that the two columns are introduced to the function as a list of string values.

The only aggregate function that works without having to specify a column of interest is .size(), as seen above. However, once you specify the column of interest that you want to aggregate its data, you could use any aggregate function that you can use on a pandas series or DataFrame. The following table shows a list of all the aggregate functions that you can use:

Figure 1.38 – List of pandas aggregate functions

For instance, the following shows the code to group adult_df by martial_status and sex, and calculates the median of each group:


As you study the code and its output, you can start appreciating the analytic value of the .groupby() function. Next, we will look at an example that will help you appreciate this value function even further.

Analytic example using Groupby

Were the race and gender of individuals in 1966 influential in their financial success?

Incidentally, adult_df was collected in 1966, so we can use it to provide some insight into this question. You may take different approaches in going about this. One approach, as depicted in the following screenshot, is to group the data by race and sex and then calculate the mean of capitalNet for the groups and study the differences.

Figure 1.39 – Solution to the preceding example

Another approach would be to group the data based on race, sex, and income and then calculate the mean of fnlwgt. Give this one a try and see whether you would come to a different conclusion.

Pandas multi-level indexing

Let’s first understand what is multi-level indexing. If you look at the output of grouping a DataFrame by more than one column, the indexing of the output looks different than normal. Although the output is a pandas series, it looks different. The reason for this dissimilarity is multi-level indexing. The following screenshot shows you the index of the .groupby() output for the previous screenshot. You can see that the index of the series has two levels, specifically, race and sex:

Figure 1.40 – An example of multi-level indexing

Now, let’s learn a few useful and relevant functions that can help us with data analytics and preprocessing. These functions are .stack() and .unstack().

The .unstack() function

This function pushes the outer level of the multi-level index to the columns. If the multi-level index only has two levels, after running .unstack(), it will become single-level. Likewise, if the .unstack() function is run for a series with a multi-level index, the output will be a DataFrame whose columns are the outer level index that was pushed. For instance, the following screenshot demonstrates the change in appearance and structure of the output when the .unstack() function is executed:

Figure 1.41 – Example of the .unstack() function

If there are more than two levels, executing .unstack() more than once will, one by one, push the outer level of the index to the columns. For instance, you can see in the following screenshot that the code in the first chunk results in grb_result, which is a series with a three-level index. The second chunk of code executes .unstack() once and the outer level of the index in grb_result, which is income is pushed to the columns. The third chunk of code, however, executes .unstack() twice, and the second outer level of the index in grb_result, which is sex, joins income in the columns.

Figure 1.42 – Another example of the .unstack() function with two levels of indexing

As an index can be multi-level in pandas, columns can also have multiple levels. For instance, in the first chunk of the following screenshot, you can see that the output DataFrame has two levels. The second chunk of code outputs the columns of the DataFrame. You can see that the columns have the two levels that were pushed from the index using .unstack():

Figure 1.43 – An example of multi-level columns

The .stack() function

The opposite of .unstack() is .stack(), where the outer level of the columns is pushed to be added as the outer level of the index. For example, in the following screenshot, you can see that mlt_df, which we saw has two-level columns, has undergone .stack() twice. The first .stack() function pushed the income level to the index, and the second .stack()function pushed the sex level to the index. This made the data be presented as a series as there is only one column of data.

Figure 1.44 – Example of the .stack() function

Multi-level access

The value access in series or DataFrames with multi-level indexes, or DataFrames with multi-level columns, is slightly different. Exercise 2 at the end of this chapter is designed to help you learn that.

In this subchapter, we gathered sizable exposure to multi-level indexing and columns. Now we are moving on to another set of functions that are somewhat similar to the .stack() and .unstack() functions, but different at the same time. These functions are .pivot() and .melt().

Pandas pivot and melt functions

In a nutshell, .pivot() and .melt() help you to switch between two forms of 2-dimensional data structures: wide form and long form. The following figure depicts the difference between the two forms. The wide form is what you are typically used to if you are a spreadsheet user. The wide form uses many columns to introduce new dimensions in the dataset. The long form, however, uses a different logic of data structure and uses one index column to include all the relevant dimensions. The .melt() function, as you may picture it in your mind based on the meaning of the word melt, can easily reshape a dataset from the long form to the wide form. The .pivot() function can do the opposite.

To practice and learn these two functions, we will read wide.csv using pandas into wide_df, and read long.csv using pandas into long_df.

Figure 1.45 – Comparison of the long and wide forms

To switch between the long and the wide format, all you need to do is to provide the right input to these functions. The following screenshot shows the application of .melt() on wide_df, reshaping it into a long format. In the second chunk of code, you can see that .melt() requires four inputs.

  • id_vars: This input takes the identifying columns.
  • value_vars: This input takes the columns that hold the values.
  • var_name: This input takes the name you would like to give to the identifying column that will be added to the long format.
  • value_name: This input takes the name you would like to give to the new value column that will be added to the long format.

The following screenshot shows an example of using the .melt() function to switch the data from wide format to long format:

Figure 1.46 – Example of using the .melt() function to switch the data from wide format to long format

The .pivot() function reshapes a DataFrame from the long form to the wide form. For instance, the following screenshot shows the application of the function on long_df. Unlike, .melt(), which requires four inputs, .pivot() needs three:

  • index: This input takes what will be the index of the wide form.
  • columns: This input takes the columns of the long form that will be expanded to create the columns for the wide form.
  • values: This input takes the column in which the long form keeps the values.
Figure 1.47 – Example of using the .pivot() function to switch the data from the long format to the wide format


Congratulations on your excellent progress so far! In this chapter, you first learned about Jupyter Notebook, which is the UI we will be using throughout this book. Then, you learned the most important functions of the two Python core modules for data analytics and data preprocessing. In the next chapter, you will learn the functions of another core module: Matplotlib. This module will be our core module for visualization needs.

Before moving on to the next chapter, I highly encourage you to spend some time and meaningfully engage with the following exercises.



  1. Use the adult.csv dataset and run the codes shown in the following screenshots. Then, answer the questions that follow:

    Figure 1.48 – Exercise 1

    Use the output to answer what is the difference in behavior of .loc and .iloc when it comes to slicing?Without running, but just by looking at the data, what will be the output of adult_df.loc[‘10000’:’10003’, ’relationship’:’sex’]?Without running, but just by looking at the data, what will be the output of adult_df.iloc[0:3, 7:9]?

  2. Use pandas to read adult.csv into adult_df and then use the .groupby() function to run the following code and create the multi-index series mlt_sr:

    import pandas as pd
    adult_df = pd.read_csv('adult.csv')
    mlt_seris =adult_df.groupby(['race','sex','income']).fnlwgt.mean()

    Now that you have created a multi-index series, run the following codes, study the outputs, and answer the following questions:Run the following code first and then answer the question. When we use .iloc[] for a multi-index series or DataFrame, what should we expect?


    Run the following code first and then answer the question. When we use .loc[] to access the data of one of the innermost index levels of the multi-index series, what should we expect?


    Run the following code first and then answer the question. When we use .loc[] to access the data of one of the non-innermost index levels of a multi-index series, what should we expect?When you run either line of the following code, you will get an error, and that is the point of this question. Study the error and try to answer the question:


    Run the following code first and then answer the question. How does the use of .loc[] or .iloc[] differ when working with a multi-index series or a DataFrame?

  3. For this exercise, you need to use a new dataset: billboard.csv. Visit https://www.billboard.com/charts/hot-100 and see the latest song rankings of the day. This dataset presents information and rankings for 317 song tracks in 80 columns. The first four columns are artist, track, time, and date_e. The first columns are intuitive descriptions of song tracks. The date_e column shows the date that the songs entered the hot 100 list. The rest of the 76 columns are song rankings at the end of each week from 'w1' to 'w76'. Download and read this dataset using pandas and answer the following questions:Write one line of code that gives you a great idea of how many null values each column has. If any columns have no non-null values, drop them. With a for loop, draw and study the values in each of the remaining W columns.The dataset is in wide format. Use an appropriate function to switch to a long format and name the transformed DataFrame mlt_df.Write a code that shows mlt_df every 1,200 rows.Run the following code first and answer the question. Could this also have been done by using Boolean masking?

    mlt_df.query('artist == "Spears, Britney"')

    Use either the approach in e or the Boolean mask to extract all the unique songs that Britney Spears has in this dataset.In mlt_df, show all of the weeks when the song “Oops!.. I Did It Again” was in the top 100.

  4. We will use LaqnData.csv for this exercise. Each row of this dataset shows an hourly measurement recording of one of the following five air pollutants: NO, NO2, NOX, PM10, and PM2.5. The data was collected in a location in London for the entirety of the year 2017. Read the data using pandas and perform the following tasks:The dataset has six columns. Three of them, named ‘Site’, ‘Units’, and ‘Provisional or Ratified’ are not adding any informational values as they are the same across the whole dataset. Use the following code to drop them:

    air_df.drop(columns=['Site','Units','Provisional or Ratified'], inplace=True)

    The dataset is in a long format. Apply the appropriate function to switch it to the wide format. Name the transformed Dataframe pvt_df.Draw and study the histogram and boxplots for columns of pvt_df.

  5. We will continue working with LaqnData.csv:Run the following code, see its output, and then study the code to answer what each line of this code does:

    air_df = pd.read_csv('LaqnData.csv')
    air_df.drop(columns=['Site','Units','Provisional or Ratified'], inplace=True)
    datetime_df = air_df.ReadingDateTime.str.split(' ',expand=True)
    datetime_df.columns = ['Date','Time']
    date_df = datetime_df.Date.str.split('/',expand=True)
    date_df.columns = ['Day','Month','Year']
    air_df = air_df.join(date_df).join(datetime_df.Time).drop(columns=['ReadingDateTime','Year'])

    Run the following code, see its output, and then study the code to answer what this line of code does:

    air_df = air_df.set_index(['Month','Day','Time','Species'])

    Run the following code, see its output, and then study the code to answer what this line of code does:


    Compare the output of the preceding code with pvt_df from Exercise 4. Are they the same?Explain what the differences and similarities are between the pair .melt()/.pivot() and the pair .stack()/.unstack()?If you were to choose one counterpart for .melt() between .stack()/.unstack(), which one would you choose?

About the Author

  • Roy Jafari

    Roy Jafari, Ph.D. is an Assistant Professor of Business Analytics at the University of Redlands.

    Roy has taught and developed advanced college-level courses that cover data cleaning, decision making, data science, data mining, machine learning, and optimization.

    Roy’s style of teaching is hands-on and he believes the best way to learn is to learn by doing. He likes to use the active learning teaching philosophy and Readers will get to experience active learning in this book.

    Roy, who is known as Ruholla Jafari Marandi in the research community, has authored scientific research publications. In his research, Roy has contributed new data mining and decision-making algorithms. He has also shown various applications of data mining and machine learning and provided steps for more fair and equitable applications of machine learning.

    Roy believes successful data preprocessing happens only when one is equipped with the most efficient tools, has the appropriate understanding of data analytic goals, and knows the steps of data preprocessing, and can compare a variety of methods. This belief has shaped the structure his upcoming book Hands-on Data Preprocessing in Python.

    Browse publications by this author
Hands-On Data Preprocessing in Python
Unlock this book and the full library for $5 a month*
Start now