Reader small image

You're reading from  Data Wrangling with R

Product typeBook
Published inFeb 2023
PublisherPackt
ISBN-139781803235400
Edition1st Edition
Concepts
Right arrow
Author (1)
Gustavo R Santos
Gustavo R Santos
author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Right arrow

Transformations with Tidyverse Libraries

The journey through data wrangling is still at its core. We have just finished studying the major transformations from the perspective of the built-in functions of base R and counting on the support of data.table library.

We saw how easy it was to reach the solution for some of those transformations, without needing to load extra libraries. However, as the problems get more complicated, the basic functions will not be able to provide a sufficiently clean and fast solution. The code will get busier and will probably underperform as the size of the dataset increases. For complex cases, there are several libraries built for R language that can help us to get through most problems with better performance and clean code. If you are interested in comparison times between base R, data.table, and tidyverse, refer to this page (https://tinyurl.com/2udfcvx2), where the author compares the most common tasks using the three libraries.

Speaking of...

Technical requirements

Dataset: We will use the Census Income dataset (https://archive.ics.uci.edu/ml/datasets/Adult) for this chapter.

All the code can be found in the book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part2/Chapter8. The package to be used can be installed and loaded using the following single commands:

install.packages('tidyverse')
library(tidyverse)

Figure 8.1 shows the message displayed once the package is loaded. Be aware that there are some red-colored warning messages displayed once you load packages in R, but they are not errors. Those are just to alert you to what version of the R language is used to build their current versions.

Figure 8.1 – The tidyverse package will load eight libraries. The warning messages are not errors

In the next section, let’s remind ourselves about tidy data and why it is important for data wrangling with tidyverse.

What is tidy data

To tidy something means to arrange it, to put it in order. Consequently, tidy data means that our data has a specific order and should follow a set of rules to be considered ready to be worked.

A dataset can be arranged in different ways. For those that, like me, worked for many years with Microsoft Excel, at first sight, a tidy dataset may seem odd, as there will be plenty of repeated cells. Many datasets I worked with in MS Excel had the same measurement split among many columns. A classic example of that is the monthly reports that bring the first columns as the descriptive part of the data (for example, product, profit, and loss), and the values refering to them are shown in one column each month.

Figure 8.2 – Example of dataset not in Tidy format

The table from Figure 8.2 is comfortable to look at but not useful for an algorithm or a programming language. If you try to determine what is the best month for sales, it will require...

Slicing and filtering

Slicing and filtering a dataset are two similar ways to zoom in on a desired part of the data. In tidyverse, the dplyr library deals with the most common data wrangling tasks. Slicing and filtering are among those tasks, as well as the select() function, as we will see.

Slicing

As discussed in Chapter 7, slicing cuts out unwanted parts of the dataset, returning just part of the rows and/or columns. There is more than one way to slice a dataset, and we will learn the more interesting functions to do that, starting with the most basic one, as follows. In the code, when we use ., it means we are considering everything from the object that precedes it, which is df, followed by a slicing notation. Rows 1 to 5 and columns 1 to 4:

# Slicing rows 1 to 5, columns 1 to 4.
df %>% .[1:5, c(1:4)]

The result is shown in Figure 8.5.

Figure 8.5 – Sliced dataset

To provide summary notes about the previous code: the dataset object...

Grouping and summarizing data

Grouping and summarizing are two complementary functions. Generally, they will be used together, as there is not much use in grouping a dataset and not calculating anything or using the groups for a purpose. That is when summarizing plays the important role of transforming the data from each group into a summary or a number that we can understand.

In the business world, requests such as the average number of sales by store, the median number of customers by day, the standard deviation of a distribution, and many other examples, are part of the routine of a data scientist. These tasks can be performed using the group_by() and summarise()functions from dplyr.

Starting with the group_by() function, observe that it alone cannot bring much value:

# group by not summarized
df %>% group_by(workclass)

Here is the result.

Figure 8.9 – Dataset grouped but not summarized

We can see in Figure 8.9 that it worked because...

Replacing and filling data

A dataset can and certainly will be acquired with imperfections. An example of imperfection is the use of the ? sign instead of the default NA for missing values for the Census Income dataset. This problem will require the question mark to be replaced with NA first, and then filled with another value, such as the mean, the most frequent observation, or using more complex methods, even machine learning.

This case clearly illustrates the necessity of replacing and filling data points from a dataset. Using tidyr, there are specific functions to replace and fill in missing data.

First, the ? sign needs to be replaced with NA, before we can think of filling the missing values. As seen in Chapter 7, there are only missing values for the workclass (1836), occupation (1843), and native_country (583) columns. To confirm that, a loop through the variables searching for ? would be the fastest resource:

# Loop through variables looking for cells == "...

Arranging data

Arranging data is useful to create a rank, making the dataset ordinated. The orders can be from low to high values, also known as increasing order, as well as from high to low or decreasing order. In RStudio, visualizing a dataset using the software’s viewer pane already allows the analyst to arrange the data with the click of a button. Just like many dynamic tables, if you click on a column name, that variable becomes ordered. For simply eyeballing it, the feature is terrific, but for programming purposes, it won’t have any effect. You will have to take advantage of the arrange() function from dplyr.

The most basic ways to arrange a dataset are by running the succeeding pieces of. First, let's try arranging by increasing order:

# Arrange data in increasing order
df_no_na %>% arrange(native_country)

Next, arranging in decreasing order:

# Arrange data in decreasing order
df_no_na %>% arrange( desc(native_country) )

Notice that adding...

Creating new variables

Creating new variables can be useful for data scientists when they need to analyze something that is not present in the data as it was acquired. Common tasks to create new data are splitting a column, creating a calculation, encoding text, and applying a custom function over a variable.

We went over some good examples of column splitting in this book, such as a datetime split. Now, to illustrate the separate() function from tidyr, the example to be used is based on the Census Income dataset. Look at the target column: it has values such as <=50k and > 50k. Let’s say we wanted to separate only the > or <= signs and put them in a separate column; here is how to do that:

# Split variable target into sign and amount
df_no_na %>% separate(target, into=c("sign", "amt"), sep="\\b")

We took the dataset clean of NAs and separated the target column into two new variables: sign and amt. To accomplish that, the...

Joining datasets

Datasets can come from different sources or different tables within the same database or data lake. Many times, those tables are related to each other by key columns, which means that you will be able to find a certain column A in table 1 and a column A in table 2 that hold similar information so they can be related to each other using that common key element.

To better explain the join concept, imagine we are engineers from a retail company. Our goal is to store data about transactions from each store, including date, product, descriptions, quantity, and amount. Well, we can put everything in the same table, resulting in a big heavy file that the database will have to deal with every time we want to query some information. Think about that for a moment: it won’t be every time that we will need to pull the product description, or store address, for example. Consequently, the optimal solution for that problem is splitting that information into smaller tables...

Reshaping a table

Data frames are composed of rows and columns, and we already know that data that is considered tidy has one observation per row and one variable per column. However, data is everywhere, being generated by different kinds of sensors, machines, devices, and people. Naturally, not all that data will come to you in a perfectly beautiful, tidy format. Many will be the time when you will face messy datasets.

One of the things that can happen is to receive datasets in a wide format. As the name already suggests, this data comes with variables spread along the rows instead of columns, meaning that measurements from the same variable can be displayed on different columns. A common example of a wide formatted dataset is a monthly report with a column for the description of a project, followed by a column with the expense amount for each month, as shown in Figure 8.29.

Figure 8.29 – Projects dataset in wide format

A data frame in a wide shape...

Do more with tidyverse

Closing this chapter, we will quickly study a few functions from tidyverse that were not mentioned in any of the previous sections but that can be very helpful when solving data wrangling problems.

Consider again the mtcars dataset (load it with data('mtcars')), which has information about 32 cars from the 1974 Motor Trend Use magazine. We are already familiar with that data, and we can use it as a reference to learn about the next few transformations.

Let’s dive right in on a couple of functions of the purrr library. This library brings functions like those from the Apply family, studied in Chapter 5. The most interesting function to look at is the map() function. It applies the same function to every element of a vector or list. If we want to map the average of the variables’ horsepower and weight, this is how to do it:

# Map
mtcars %>% 
  select(hp, wt) %>%  map(mean)
$hp
[1] 146.6875
$wt
[1] 3.21725...

Summary

We have been through a lot in this chapter. After all, the core of data wrangling is about the transformations that we covered here. Most of the work of analysts, data scientists, and developers involves filtering, reshaping, joining, and doing all kinds of data wrangling to get the data into the shape needed for the project.

We started by studying slicing and filtering, allowing us to zoom in on parts of a dataset and revisited the difference between both. Then, we moved on to show you how to group and summarize data, which turns out to be an important task, possibly one of the most used functions when munging data.

Replacing and filtering were the next two subjects. That section covered how to replace values and how to fill in missing data in our dataset. Once the data was cleaned of NAs, the subsequent step was ordering the data, making it easier to read and understand the information.

It is worth repeating that datasets have more information than what it may initially...

Exercises

  1. What function is used to filter data?
  2. What happens when group_by() is used without the summarizing function?
  3. Write the code to return the capital_gain mean grouped by native_country.
  4. What’s function is used to drop all NAs from a dataset?
  5. Write the code to replace NAs with another value.
  6. Write the code to return the average net gain (capital_gain – capital_loss) by education level, arranged in decreasing order.
  7. What is the main function to be used to create new variables?
  8. List all the types of joins.
  9. What is the difference between left join and inner join?
  10. What are the two functions used to reshape data?

Further reading

https://tinyurl.com/yexdrev5

https://en.wikipedia.org/wiki/Join_(SQL)

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos