Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Data Processing with dplyr

In the previous chapter, we covered the basics of the R language itself. Grasping these fundamentals will help us better tackle the challenges in the most common task in data science projects: data processing. Data processing refers to a series of data wrangling and massaging steps that transform the data into its intended format for downstream analysis and modeling. We can consider it as a function that accepts the raw data and outputs the desired data. However, we need to explicitly specify how the function executes the cooking recipe and processes the data.

By the end of this chapter, you will be able to perform common data wrangling steps such as filtering, selection, grouping, and aggregation using dplyr, one of the most widely used data processing libraries in R.

In this chapter, we will cover the following topics:

  • Introducing tidyverse and dplyr
  • Data transformation with dplyr
  • Data aggregation with dplyr
  • Data merging with dplyr...

Technical requirements

To complete the exercises in this chapter, you will need the following:

  • The latest version of the tidyverse package, which is 1.3.1 at the time of writing

All the code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/tree/main/Chapter_2.

Introducing tidyverse and dplyr

One of the most widely used R libraries that contains a set of individual packages is tidyverse; it includes dplyr and ggplot2 (to be covered in Chapter 4). It can support most data processing and visualization needs and comes with an easy and fast implementation compared to base R commands. Therefore, it is recommended to outsource a specific data processing or visualization task to tidyverse instead of implementing it ourselves.

Before we dive into the world of data processing, there is one more data structure that’s used in the ecosystem of tidyverse: tibble. A tibble is an advanced version of a DataFrame and offers much better format control, leading to clean expressions in code. It is the central data structure in tidyverse. A DataFrame can be converted into a tibble object and vice versa. Let’s go through an exercise on this.

Exercise 2.01 – converting between tibble and a DataFrame

First, we will explore the tidyverse...

Data transformation with dplyr

Data transformation refers to a collection of techniques for performing row-level treatment on the raw data using dplyr functions. In this section, we will cover five fundamental functions for data transformation: filter(), arrange(), mutate(), select(), and top_n().

Slicing the dataset using the filter() function

One of the biggest highlights of the tidyverse ecosystem is the pipe operator, %>%, which provides the statement before it as the contextual input for the statement after it. Using the pipe operator gives us better clarity in terms of code structuring, besides saving the need to type multiple repeated contextual statements. Let’s go through an exercise on how to use the pipe operator to slice the iris dataset using the filter() function.

Exercise 2.02 – filtering using the pipe operator

For this exercise, we have been asked to keep only the setosa species in the iris dataset using the pipe operator and the filter...

Data aggregation with dplyr

Data aggregation refers to a set of techniques that summarizes the dataset at an aggregate level and characterizes the original dataset at a higher level. Compared to data transformation, it operates at the row level for the input and the output.

We have already encountered a few aggregation functions, such as calculating the mean of a column. This section will cover some of the most widely used aggregation functions provided by dplyr. We will start with the count() function, which returns the number of observations/rows for each category of the specified input column.

Counting observations using the count() function

The count() function automatically groups the dataset into different categories according to the input argument and returns the number of observations for each category. The input argument could include one or more columns of the dataset. Let’s go through an exercise and apply it to the iris dataset.

Exercise 2.08 –...

Data merging with dplyr

In practical data analysis, the information we need is not necessarily confined to one table but is spread across multiple tables. Storing data in separate tables is memory-efficient but not analysis-friendly. Data merging is the process of merging different datasets into one table to facilitate data analysis. When joining two tables, there need to be one or more columns, or keys, that exist in both tables and serve as the common ground for joining.

This section will cover different ways to join tables and analyze them in combination, including inner join, left join, right join, and full join. The following list shows the verbs and their definitions for these four types of joining:

  • inner_join(): Returns common observations in both tables according to the matching key.
  • left_join(): Returns all observations from the left table and matched observations from the right table. Note that in the case of a duplicate key value in the right table, an additional...

Case study – working with the Stack Overflow dataset

This section will cover an exercise to help you practice different data transformation, aggregation, and merging techniques based on the public Stack Overflow dataset, which contains a set of tables related to technical questions and answers posted on the Stack Overflow platform. The supporting raw data has been uploaded to the accompanying Github repository of this book. We will directly download it from the source GitHub link using the readr package, another tidyverse offering that provides an easy, fast, and friendly way to read a wide range of data sources, including those from the web.

Exercise 2.11 – working with the Stack Overflow dataset

Let’s begin this exercise:

  1. Download three data sources on questions, tags, and their mapping table from GitHub:
    library(readr)
    df_questions = read_csv("https://raw.githubusercontent.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop...

Summary

In this chapter, we covered essential functions and techniques for data transformation, aggregation, and merging. For data transformation at the row level, we learned about common utility functions such as filter(), mutate(), select(), arrange(), top_n(), and transmute(). For data aggregation, which summarizes the raw dataset into a smaller and more concise summary view, we introduced functions such as count(), group_by(), and summarize(). For data merging, which combines multiple datasets into one, we learned about different joining methods, including inner_join(), left_join(), right_join(), and full_join(). Although there are other more advanced joining functions, the essential tools we covered in our toolkit are enough for us to achieve the same task. Finally, we went through a case study based on the Stack Overflow dataset. The skills we learned in this chapter will come in very handy in many data analysis tasks.

In the next chapter, we will cover a more advanced topic...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng