Reader small image

You're reading from  Data Wrangling with R

Product typeBook
Published inFeb 2023
PublisherPackt
ISBN-139781803235400
Edition1st Edition
Concepts
Right arrow
Author (1)
Gustavo R Santos
Gustavo R Santos
author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Right arrow

Transformations with Base R

In the last three chapters of this book, our intent was to lay the foundations of the main data types you will find when working on a real-life project. Once a dataset is opened, it is likely you will find strings, numbers, and dates and times as variables. Knowing what they are, how they can be created, and some popular functions to manipulate them will keep us moving during exploratory data analysis.

The next two chapters are focused on transformations of data, what I consider the core of data wrangling. I say that because the main part of what needs to be done during the data wrangling phase of a project will be related to transformations of data. Then, another good chunk of the work is on data visualization, and the final piece will be modeling and evaluating the results.

In this chapter, we will study the most common transformations that can be done in a dataset:

  • Slicing and filtering: These tasks allow us to focus on a specific part of...

Technical requirements

We will use the Census Income dataset (https://archive.ics.uci.edu/ml/datasets/Adult) for this chapter.

All the code can be found in the book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part2/Chapter7.

Before moving forward, make sure to run the following installation requirements if you want to code along with the book’s examples:

# Install package
install.packages('data.table')
# Load library
library(data.table)
library(stringr)

The dataset

The dataset to be used in the next exercises can be found in the UCI dataset repository. It was pulled from the popular datasets tab in the repository, and it is named Adults, but it is also known as Census Income dataset (from https://archive.ics.uci.edu/ml/datasets/Adult).

The variables we will be dealing with are listed next and I also invite you to read the adult.names file provided with the dataset in the UCI repository or in the GitHub page for this chapter’s codes (https://tinyurl.com/ywpjj329).

  • Demographics: age, sex, race, marital-status, relationship status, native country.
  • Education: education level and education-num (years of study).
  • Work related: work class, occupation, hours per week.
  • Financial: capital gain and capital loss.
  • fnlwgt: This means final weight, which is a scoring calculation from the Census Bureau based on socio-economic and demographic data. People with similar demographic information should have similar weights...

Slicing and filtering

When you have a table as large as the dataset we are working with, it is very hard to look at all the observations one by one. Look how many rows and columns this dataset has:

# Dataset dimensions
dim(df)
[1] 32561    15

The dim() function shows the number of rows first, then the number of columns, or variables. It’s easy to see that it would take us too much time – not to mention that it is not productive as well – to look at 32,561 observations. Therefore, the tasks of slicing and filtering play a major role, acting like a magnifying glass for us to zoom in on specific parts of the data.

These tasks can sound like they’re the same, but there is a slight difference between them.

Slicing

Slicing means cutting and displaying a slice, a piece, of the dataset. A good application of this task is when we need to look at the errors of a model. In this case, it is possible to take only the observations where...

Grouping and summarizing

The same logic used to present the slicing and filtering concepts can be applied here too: we will never go row by row, analyzing one observation at a time.

We need a better way to look at the data, one that makes it smaller and easier to understand. To do that, we can aggregate data, creating groups of observations and putting each one of them in a separate and labeled box. This is grouping.

After that, we have groups, but we still don’t have a very good use for n boxes that we don’t know the contents of, besides the name of the group on the label. Summarization will do that job by taking the observations in each box and wrapping them up with a single number, which could be the mean, the median, or the total. Summarization is, therefore, reducing observations to one number.

Given these definitions, it is reasonable to say that summarization is complementary to the grouping function since we first aggregate the data in groups and then...

Replacing and filling

Replacing values is straightforward. You have a value that does not fit in the data and you need it to be replaced with another value. In the dataset we’re using in this chapter, there is a good example. In the documentation about the data, it is stated that the author will convert unknown values to “?”, meaning that you will not find any standard NA values in this dataset. Therefore, it is our job as data scientists to wrangle this and replace all the ? values with NA.

Note

It’s worth making a note of this, as a lesson learned from this exercise: always look at the data documentation, if and when it is available. Many explanations about the way the data was collected and the meaning of each variable are contained in these documents.

Replacing the values is possible using slicing notation or the gsub() function. In the dataset, there are three variables with ? values: workclass, occupation, and native_country.

We will replace...

Arranging

Data can be arranged in two common forms: low to high or high to low, also known as ascending and descending order. Arranging data is useful for ranking observations or groups in an order that makes it easier for us to understand. When I look at the top 5 most sold items, I know that they are what brings traffic to a store. Then, imagine that the third best-selling item in terms of count is, in fact, the product that makes the most revenue. That could change our strategy, couldn’t it?

When looking at the other side of the rank, the tail, the bottom 5 items in terms of the number of items sold could be potential candidates to remove from the shelves, as they probably won’t bring much revenue to the business.

This simple example explains why arranging data is important when exploring data. But arranging is also an important part of data wrangling when visualizing data because it quickly pulls our eyes to the maximum point, from where we can read the rest...

Creating new variables

A dataset is not only the data you see. There is a lot of information in it. For example, remember in Chapter 6, when we worked with datetime objects during our data exploration exercise, we took the TIME variable and extracted the year, month, day, and hour from it. This is one of the many ways to create new variables.

Here are some examples of new variables created out of our working dataset:

  • Arithmetical operators: Adding two or more variables to create a total variable.
  • Text extraction: Extracting a meaningful part of a text, for instance, 1234 from ORDER-1234.
  • Custom calculations: Calculating a discount rate based on a business rule.
  • Binarization: Transforming a variable from on and off to 1 and 0. Binary means two options and is commonly associated with 0 and 1 in computer language.
  • Encoding: Transforming a qualitative ordinal variable, such as basic, intermediate, and advanced to 1, 2, and 3.
  • One Hot Encoding: A very common...

Binding

Binding data is the last of the main transformations listed at the beginning of this chapter. It is common to find yourself with two or more datasets that you need to put together for analysis. There are a couple of ways to do that, as follows:

Figure 7.16 – Types of data binding

Assume that our Census Income dataset has only 10 rows. After some research, the internal team found another 10 observations and gave them to the data science team. The ten new observations have to be appended to the original dataset since they have the same variables. Let’s see that in action:

# Creating datasets A and B
A <- df[1:10, ]
B <- df[11:20, ]
# Append / bind rows
AB <- rbind(A, B)

To illustrate the other scenario, that is, binding columns, imagine that the original data has only three variables, age, workclass, and fnlwgt. Then, the team was able to collect more information about the taxpayers, adding education grade and occupation....

Using data.table

The data.table library describes itself as an enhanced version of the data.frames in R. Using only base R, it is not easy to group data, for example. There are other small enhancements, such as not converting strings to factors during data import and in the visualization of printing datasets on R’s console.

The syntax for this library is very similar to data.frames, as you may have already seen during this chapter, but it is formally presented here:

Basic syntax
DT[i, j, by]
  • i is for the row selection or a condition for the rows to be displayed
  • j is for selecting variables or calculating a statistic based on them
  • by is used for grouping variables

Before using the syntax for data.table, it is necessary to make sure that the object is the correct type. That can be done using type(object). Conversion to a data.table object can be done using as.data.table(object).

Consider the following code snippet:

# Syntax
dt[dt$age > 50...

Summary

Transformations are the core of data wrangling. Datasets are almost like living organisms that change and evolve during the wrangling process, being shaped by the transformations, which, by the way, are driven by the analysis requirements.

In this chapter, we learned about the main transformations for data wrangling in R. We started with slicing and filtering, two great functions for zooming in to a piece of the dataset for deeper analysis. Then we moved on to grouping and summarizing, the dynamic duo of the transformations, where one gathers the data into groups and the other summarizes the essence of the group in a single number or statistic. Replacing and filling was the next section, where we learned about solutions to replace values such as ? with NA, followed by functions to fill NA values with the mean for numeric variables and with the most frequent value for categorical variables.

The section about arranging data covered the use of the order() function to order...

Exercises

  1. What is the difference between slicing and filtering?
  2. Describe grouping and summarizing.
  3. What function is used to replace all the patterns in a variable?
  4. What function drops the missing values from the entire dataset, and when should we use it?
  5. What is the percentage of NA values that is OK to drop from a dataset?
  6. Describe the main benefit of arranging data.
  7. Write a group by command with data.table.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos