You're reading from Data Wrangling with R

Product typeBook

Published inFeb 2023

PublisherPackt

ISBN-139781803235400

Edition1st Edition

Concepts

Data Mining

Author (1)

Gustavo R Santos

Transformations with Base R

In the last three chapters of this book, our intent was to lay the foundations of the main data types you will find when working on a real-life project. Once a dataset is opened, it is likely you will find strings, numbers, and dates and times as variables. Knowing what they are, how they can be created, and some popular functions to manipulate them will keep us moving during exploratory data analysis.

The next two chapters are focused on transformations of data, what I consider the core of data wrangling. I say that because the main part of what needs to be done during the data wrangling phase of a project will be related to transformations of data. Then, another good chunk of the work is on data visualization, and the final piece will be modeling and evaluating the results.

In this chapter, we will study the most common transformations that can be done in a dataset:

Slicing and filtering: These tasks allow us to focus on a specific part of...

Technical requirements

We will use the Census Income dataset (https://archive.ics.uci.edu/ml/datasets/Adult) for this chapter.

All the code can be found in the book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part2/Chapter7.

Before moving forward, make sure to run the following installation requirements if you want to code along with the book’s examples:

# Install package
install.packages('data.table')
# Load library
library(data.table)
library(stringr)

The dataset

The dataset to be used in the next exercises can be found in the UCI dataset repository. It was pulled from the popular datasets tab in the repository, and it is named Adults, but it is also known as Census Income dataset (from https://archive.ics.uci.edu/ml/datasets/Adult).

The variables we will be dealing with are listed next and I also invite you to read the adult.names file provided with the dataset in the UCI repository or in the GitHub page for this chapter’s codes (https://tinyurl.com/ywpjj329).

Demographics: age, sex, race, marital-status, relationship status, native country.
Education: education level and education-num (years of study).
Work related: work class, occupation, hours per week.
Financial: capital gain and capital loss.
fnlwgt: This means final weight, which is a scoring calculation from the Census Bureau based on socio-economic and demographic data. People with similar demographic information should have similar weights...

Slicing and filtering

When you have a table as large as the dataset we are working with, it is very hard to look at all the observations one by one. Look how many rows and columns this dataset has:

# Dataset dimensions
dim(df)
[1] 32561    15

The dim() function shows the number of rows first, then the number of columns, or variables. It’s easy to see that it would take us too much time – not to mention that it is not productive as well – to look at 32,561 observations. Therefore, the tasks of slicing and filtering play a major role, acting like a magnifying glass for us to zoom in on specific parts of the data.

These tasks can sound like they’re the same, but there is a slight difference between them.

Slicing

Slicing means cutting and displaying a slice, a piece, of the dataset. A good application of this task is when we need to look at the errors of a model. In this case, it is possible to take only the observations where...

Grouping and summarizing

The same logic used to present the slicing and filtering concepts can be applied here too: we will never go row by row, analyzing one observation at a time.

We need a better way to look at the data, one that makes it smaller and easier to understand. To do that, we can aggregate data, creating groups of observations and putting each one of them in a separate and labeled box. This is grouping.

After that, we have groups, but we still don’t have a very good use for n boxes that we don’t know the contents of, besides the name of the group on the label. Summarization will do that job by taking the observations in each box and wrapping them up with a single number, which could be the mean, the median, or the total. Summarization is, therefore, reducing observations to one number.

Given these definitions, it is reasonable to say that summarization is complementary to the grouping function since we first aggregate the data in groups and then...

Replacing and filling

Replacing values is straightforward. You have a value that does not fit in the data and you need it to be replaced with another value. In the dataset we’re using in this chapter, there is a good example. In the documentation about the data, it is stated that the author will convert unknown values to “?”, meaning that you will not find any standard NA values in this dataset. Therefore, it is our job as data scientists to wrangle this and replace all the ? values with NA.

Note

It’s worth making a note of this, as a lesson learned from this exercise: always look at the data documentation, if and when it is available. Many explanations about the way the data was collected and the meaning of each variable are contained in these documents.

Replacing the values is possible using slicing notation or the gsub() function. In the dataset, there are three variables with ? values: workclass, occupation, and native_country.

We will replace...

Arranging

Data can be arranged in two common forms: low to high or high to low, also known as ascending and descending order. Arranging data is useful for ranking observations or groups in an order that makes it easier for us to understand. When I look at the top 5 most sold items, I know that they are what brings traffic to a store. Then, imagine that the third best-selling item in terms of count is, in fact, the product that makes the most revenue. That could change our strategy, couldn’t it?

When looking at the other side of the rank, the tail, the bottom 5 items in terms of the number of items sold could be potential candidates to remove from the shelves, as they probably won’t bring much revenue to the business.

This simple example explains why arranging data is important when exploring data. But arranging is also an important part of data wrangling when visualizing data because it quickly pulls our eyes to the maximum point, from where we can read the rest...

Creating new variables

A dataset is not only the data you see. There is a lot of information in it. For example, remember in Chapter 6, when we worked with datetime objects during our data exploration exercise, we took the TIME variable and extracted the year, month, day, and hour from it. This is one of the many ways to create new variables.

Here are some examples of new variables created out of our working dataset:

Arithmetical operators: Adding two or more variables to create a total variable.
Text extraction: Extracting a meaningful part of a text, for instance, 1234 from ORDER-1234.
Custom calculations: Calculating a discount rate based on a business rule.
Binarization: Transforming a variable from on and off to 1 and 0. Binary means two options and is commonly associated with 0 and 1 in computer language.
Encoding: Transforming a qualitative ordinal variable, such as basic, intermediate, and advanced to 1, 2, and 3.
One Hot Encoding: A very common...

Binding

Binding data is the last of the main transformations listed at the beginning of this chapter. It is common to find yourself with two or more datasets that you need to put together for analysis. There are a couple of ways to do that, as follows:

Figure 7.16 – Types of data binding

Assume that our Census Income dataset has only 10 rows. After some research, the internal team found another 10 observations and gave them to the data science team. The ten new observations have to be appended to the original dataset since they have the same variables. Let’s see that in action:

# Creating datasets A and B
A <- df[1:10, ]
B <- df[11:20, ]
# Append / bind rows
AB <- rbind(A, B)

To illustrate the other scenario, that is, binding columns, imagine that the original data has only three variables, age, workclass, and fnlwgt. Then, the team was able to collect more information about the taxpayers, adding education grade and occupation....

Using data.table

The data.table library describes itself as an enhanced version of the data.frames in R. Using only base R, it is not easy to group data, for example. There are other small enhancements, such as not converting strings to factors during data import and in the visualization of printing datasets on R’s console.

The syntax for this library is very similar to data.frames, as you may have already seen during this chapter, but it is formally presented here:

Basic syntax
DT[i, j, by]

i is for the row selection or a condition for the rows to be displayed
j is for selecting variables or calculating a statistic based on them
by is used for grouping variables

Before using the syntax for data.table, it is necessary to make sure that the object is the correct type. That can be done using type(object). Conversion to a data.table object can be done using as.data.table(object).

Consider the following code snippet:

# Syntax
dt[dt$age > 50...

Summary

Transformations are the core of data wrangling. Datasets are almost like living organisms that change and evolve during the wrangling process, being shaped by the transformations, which, by the way, are driven by the analysis requirements.

In this chapter, we learned about the main transformations for data wrangling in R. We started with slicing and filtering, two great functions for zooming in to a piece of the dataset for deeper analysis. Then we moved on to grouping and summarizing, the dynamic duo of the transformations, where one gathers the data into groups and the other summarizes the essence of the group in a single number or statistic. Replacing and filling was the next section, where we learned about solutions to replace values such as ? with NA, followed by functions to fill NA values with the mean for numeric variables and with the most frequent value for categorical variables.

The section about arranging data covered the use of the order() function to order...

Exercises

What is the difference between slicing and filtering?
Describe grouping and summarizing.
What function is used to replace all the patterns in a variable?
What function drops the missing values from the entire dataset, and when should we use it?
What is the percentage of NA values that is OK to drop from a dataset?
Describe the main benefit of arranging data.
Write a group by command with data.table.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/datasets/Adult]. Irvine, CA: University of California, School of Information and Computer Science.
Slicing in R: https://tinyurl.com/yppamu4k
Arranging data in R: https://tinyurl.com/4e97mvjh
Introduction to data.table: https://tinyurl.com/4v2kta3e
The difference between the sub() and gsub() functions: https://tinyurl.com/4a5pbrye
The nafill function in R: https://tinyurl.com/mk9pzeju
Code for this chapter on GitHub: https://tinyurl.com/ywpjj329

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling with R

Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Wrangling with R

Transformations with Base R

Technical requirements

The dataset

Slicing and filtering

Slicing

Grouping and summarizing

Replacing and filling

Arranging

Creating new variables

Binding

Using data.table

Summary

Exercises

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook