You're reading from Data Wrangling with R

Product typeBook

Published inFeb 2023

PublisherPackt

ISBN-139781803235400

Edition1st Edition

Concepts

Data Mining

Author (1)

Gustavo R Santos

Transformations with Tidyverse Libraries

The journey through data wrangling is still at its core. We have just finished studying the major transformations from the perspective of the built-in functions of base R and counting on the support of data.table library.

We saw how easy it was to reach the solution for some of those transformations, without needing to load extra libraries. However, as the problems get more complicated, the basic functions will not be able to provide a sufficiently clean and fast solution. The code will get busier and will probably underperform as the size of the dataset increases. For complex cases, there are several libraries built for R language that can help us to get through most problems with better performance and clean code. If you are interested in comparison times between base R, data.table, and tidyverse, refer to this page (https://tinyurl.com/2udfcvx2), where the author compares the most common tasks using the three libraries.

Speaking of...

Technical requirements

Dataset: We will use the Census Income dataset (https://archive.ics.uci.edu/ml/datasets/Adult) for this chapter.

All the code can be found in the book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part2/Chapter8. The package to be used can be installed and loaded using the following single commands:

install.packages('tidyverse')
library(tidyverse)

Figure 8.1 shows the message displayed once the package is loaded. Be aware that there are some red-colored warning messages displayed once you load packages in R, but they are not errors. Those are just to alert you to what version of the R language is used to build their current versions.

Figure 8.1 – The tidyverse package will load eight libraries. The warning messages are not errors

In the next section, let’s remind ourselves about tidy data and why it is important for data wrangling with tidyverse.

What is tidy data

To tidy something means to arrange it, to put it in order. Consequently, tidy data means that our data has a specific order and should follow a set of rules to be considered ready to be worked.

A dataset can be arranged in different ways. For those that, like me, worked for many years with Microsoft Excel, at first sight, a tidy dataset may seem odd, as there will be plenty of repeated cells. Many datasets I worked with in MS Excel had the same measurement split among many columns. A classic example of that is the monthly reports that bring the first columns as the descriptive part of the data (for example, product, profit, and loss), and the values refering to them are shown in one column each month.

Figure 8.2 – Example of dataset not in Tidy format

The table from Figure 8.2 is comfortable to look at but not useful for an algorithm or a programming language. If you try to determine what is the best month for sales, it will require...

Slicing and filtering

Slicing and filtering a dataset are two similar ways to zoom in on a desired part of the data. In tidyverse, the dplyr library deals with the most common data wrangling tasks. Slicing and filtering are among those tasks, as well as the select() function, as we will see.

Slicing

As discussed in Chapter 7, slicing cuts out unwanted parts of the dataset, returning just part of the rows and/or columns. There is more than one way to slice a dataset, and we will learn the more interesting functions to do that, starting with the most basic one, as follows. In the code, when we use ., it means we are considering everything from the object that precedes it, which is df, followed by a slicing notation. Rows 1 to 5 and columns 1 to 4:

# Slicing rows 1 to 5, columns 1 to 4.
df %>% .[1:5, c(1:4)]

The result is shown in Figure 8.5.

Figure 8.5 – Sliced dataset

To provide summary notes about the previous code: the dataset object...

Grouping and summarizing data

Grouping and summarizing are two complementary functions. Generally, they will be used together, as there is not much use in grouping a dataset and not calculating anything or using the groups for a purpose. That is when summarizing plays the important role of transforming the data from each group into a summary or a number that we can understand.

In the business world, requests such as the average number of sales by store, the median number of customers by day, the standard deviation of a distribution, and many other examples, are part of the routine of a data scientist. These tasks can be performed using the group_by() and summarise()functions from dplyr.

Starting with the group_by() function, observe that it alone cannot bring much value:

# group by not summarized
df %>% group_by(workclass)

Here is the result.

Figure 8.9 – Dataset grouped but not summarized

We can see in Figure 8.9 that it worked because...

Replacing and filling data

A dataset can and certainly will be acquired with imperfections. An example of imperfection is the use of the ? sign instead of the default NA for missing values for the Census Income dataset. This problem will require the question mark to be replaced with NA first, and then filled with another value, such as the mean, the most frequent observation, or using more complex methods, even machine learning.

This case clearly illustrates the necessity of replacing and filling data points from a dataset. Using tidyr, there are specific functions to replace and fill in missing data.

First, the ? sign needs to be replaced with NA, before we can think of filling the missing values. As seen in Chapter 7, there are only missing values for the workclass (1836), occupation (1843), and native_country (583) columns. To confirm that, a loop through the variables searching for ? would be the fastest resource:

# Loop through variables looking for cells == "...

Arranging data

Arranging data is useful to create a rank, making the dataset ordinated. The orders can be from low to high values, also known as increasing order, as well as from high to low or decreasing order. In RStudio, visualizing a dataset using the software’s viewer pane already allows the analyst to arrange the data with the click of a button. Just like many dynamic tables, if you click on a column name, that variable becomes ordered. For simply eyeballing it, the feature is terrific, but for programming purposes, it won’t have any effect. You will have to take advantage of the arrange() function from dplyr.

The most basic ways to arrange a dataset are by running the succeeding pieces of. First, let's try arranging by increasing order:

# Arrange data in increasing order
df_no_na %>% arrange(native_country)

Next, arranging in decreasing order:

# Arrange data in decreasing order
df_no_na %>% arrange( desc(native_country) )

Notice that adding...

Creating new variables

Creating new variables can be useful for data scientists when they need to analyze something that is not present in the data as it was acquired. Common tasks to create new data are splitting a column, creating a calculation, encoding text, and applying a custom function over a variable.

We went over some good examples of column splitting in this book, such as a datetime split. Now, to illustrate the separate() function from tidyr, the example to be used is based on the Census Income dataset. Look at the target column: it has values such as <=50k and > 50k. Let’s say we wanted to separate only the > or <= signs and put them in a separate column; here is how to do that:

# Split variable target into sign and amount
df_no_na %>% separate(target, into=c("sign", "amt"), sep="\\b")

We took the dataset clean of NAs and separated the target column into two new variables: sign and amt. To accomplish that, the...

Joining datasets

Datasets can come from different sources or different tables within the same database or data lake. Many times, those tables are related to each other by key columns, which means that you will be able to find a certain column A in table 1 and a column A in table 2 that hold similar information so they can be related to each other using that common key element.

To better explain the join concept, imagine we are engineers from a retail company. Our goal is to store data about transactions from each store, including date, product, descriptions, quantity, and amount. Well, we can put everything in the same table, resulting in a big heavy file that the database will have to deal with every time we want to query some information. Think about that for a moment: it won’t be every time that we will need to pull the product description, or store address, for example. Consequently, the optimal solution for that problem is splitting that information into smaller tables...

Reshaping a table

Data frames are composed of rows and columns, and we already know that data that is considered tidy has one observation per row and one variable per column. However, data is everywhere, being generated by different kinds of sensors, machines, devices, and people. Naturally, not all that data will come to you in a perfectly beautiful, tidy format. Many will be the time when you will face messy datasets.

One of the things that can happen is to receive datasets in a wide format. As the name already suggests, this data comes with variables spread along the rows instead of columns, meaning that measurements from the same variable can be displayed on different columns. A common example of a wide formatted dataset is a monthly report with a column for the description of a project, followed by a column with the expense amount for each month, as shown in Figure 8.29.

Figure 8.29 – Projects dataset in wide format

A data frame in a wide shape...

Do more with tidyverse

Closing this chapter, we will quickly study a few functions from tidyverse that were not mentioned in any of the previous sections but that can be very helpful when solving data wrangling problems.

Consider again the mtcars dataset (load it with data('mtcars')), which has information about 32 cars from the 1974 Motor Trend Use magazine. We are already familiar with that data, and we can use it as a reference to learn about the next few transformations.

Let’s dive right in on a couple of functions of the purrr library. This library brings functions like those from the Apply family, studied in Chapter 5. The most interesting function to look at is the map() function. It applies the same function to every element of a vector or list. If we want to map the average of the variables’ horsepower and weight, this is how to do it:

# Map
mtcars %>% 
  select(hp, wt) %>%  map(mean)
$hp
[1] 146.6875
$wt
[1] 3.21725...

Summary

We have been through a lot in this chapter. After all, the core of data wrangling is about the transformations that we covered here. Most of the work of analysts, data scientists, and developers involves filtering, reshaping, joining, and doing all kinds of data wrangling to get the data into the shape needed for the project.

We started by studying slicing and filtering, allowing us to zoom in on parts of a dataset and revisited the difference between both. Then, we moved on to show you how to group and summarize data, which turns out to be an important task, possibly one of the most used functions when munging data.

Replacing and filtering were the next two subjects. That section covered how to replace values and how to fill in missing data in our dataset. Once the data was cleaned of NAs, the subsequent step was ordering the data, making it easier to read and understand the information.

It is worth repeating that datasets have more information than what it may initially...

Exercises

What function is used to filter data?
What happens when group_by() is used without the summarizing function?
Write the code to return the capital_gain mean grouped by native_country.
What’s function is used to drop all NAs from a dataset?
Write the code to replace NAs with another value.
Write the code to return the average net gain (capital_gain – capital_loss) by education level, arranged in decreasing order.
What is the main function to be used to create new variables?
List all the types of joins.
What is the difference between left join and inner join?
What are the two functions used to reshape data?

Dplyr official documentation and cheat sheet: https://dplyr.tidyverse.org/
Tidyr official documentation and cheat sheet: https://tidyr.tidyverse.org/
Readr official documentation and cheat sheet: https://readr.tidyverse.org/
Purrr official documentation and cheat sheet: https://purrr.tidyverse.org/
The power of mutate for data wrangling: https://tinyurl.com/y8fdjpdv
SQL reference documentation from Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/toc.htm
Types of joins:

https://tinyurl.com/yexdrev5

https://en.wikipedia.org/wiki/Join_(SQL)

Code for this chapter in GitHub: https://tinyurl.com/yckswc37

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling with R

Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Wrangling with R

Transformations with Tidyverse Libraries

Technical requirements

What is tidy data

Slicing and filtering

Slicing

Grouping and summarizing data

Replacing and filling data

Arranging data

Creating new variables

Joining datasets

Reshaping a table

Do more with tidyverse

Summary

Exercises

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook