You're reading from The Statistics and Machine Learning with R Workshop

Product typeBook

Published inOct 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781803240305

Edition1st Edition

Languages

Concepts

Machine Learning

Author (1)

Liu Peng

Data Processing with dplyr

In the previous chapter, we covered the basics of the R language itself. Grasping these fundamentals will help us better tackle the challenges in the most common task in data science projects: data processing. Data processing refers to a series of data wrangling and massaging steps that transform the data into its intended format for downstream analysis and modeling. We can consider it as a function that accepts the raw data and outputs the desired data. However, we need to explicitly specify how the function executes the cooking recipe and processes the data.

By the end of this chapter, you will be able to perform common data wrangling steps such as filtering, selection, grouping, and aggregation using dplyr, one of the most widely used data processing libraries in R.

In this chapter, we will cover the following topics:

Introducing tidyverse and dplyr
Data transformation with dplyr
Data aggregation with dplyr
Data merging with dplyr...

Technical requirements

To complete the exercises in this chapter, you will need the following:

The latest version of the tidyverse package, which is 1.3.1 at the time of writing

All the code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/tree/main/Chapter_2.

Introducing tidyverse and dplyr

One of the most widely used R libraries that contains a set of individual packages is tidyverse; it includes dplyr and ggplot2 (to be covered in Chapter 4). It can support most data processing and visualization needs and comes with an easy and fast implementation compared to base R commands. Therefore, it is recommended to outsource a specific data processing or visualization task to tidyverse instead of implementing it ourselves.

Before we dive into the world of data processing, there is one more data structure that’s used in the ecosystem of tidyverse: tibble. A tibble is an advanced version of a DataFrame and offers much better format control, leading to clean expressions in code. It is the central data structure in tidyverse. A DataFrame can be converted into a tibble object and vice versa. Let’s go through an exercise on this.

Exercise 2.01 – converting between tibble and a DataFrame

First, we will explore the tidyverse...

Data transformation with dplyr

Data transformation refers to a collection of techniques for performing row-level treatment on the raw data using dplyr functions. In this section, we will cover five fundamental functions for data transformation: filter(), arrange(), mutate(), select(), and top_n().

Slicing the dataset using the filter() function

One of the biggest highlights of the tidyverse ecosystem is the pipe operator, %>%, which provides the statement before it as the contextual input for the statement after it. Using the pipe operator gives us better clarity in terms of code structuring, besides saving the need to type multiple repeated contextual statements. Let’s go through an exercise on how to use the pipe operator to slice the iris dataset using the filter() function.

Exercise 2.02 – filtering using the pipe operator

For this exercise, we have been asked to keep only the setosa species in the iris dataset using the pipe operator and the filter...

Data aggregation with dplyr

Data aggregation refers to a set of techniques that summarizes the dataset at an aggregate level and characterizes the original dataset at a higher level. Compared to data transformation, it operates at the row level for the input and the output.

We have already encountered a few aggregation functions, such as calculating the mean of a column. This section will cover some of the most widely used aggregation functions provided by dplyr. We will start with the count() function, which returns the number of observations/rows for each category of the specified input column.

Counting observations using the count() function

The count() function automatically groups the dataset into different categories according to the input argument and returns the number of observations for each category. The input argument could include one or more columns of the dataset. Let’s go through an exercise and apply it to the iris dataset.

Exercise 2.08 –...

Data merging with dplyr

In practical data analysis, the information we need is not necessarily confined to one table but is spread across multiple tables. Storing data in separate tables is memory-efficient but not analysis-friendly. Data merging is the process of merging different datasets into one table to facilitate data analysis. When joining two tables, there need to be one or more columns, or keys, that exist in both tables and serve as the common ground for joining.

This section will cover different ways to join tables and analyze them in combination, including inner join, left join, right join, and full join. The following list shows the verbs and their definitions for these four types of joining:

inner_join(): Returns common observations in both tables according to the matching key.
left_join(): Returns all observations from the left table and matched observations from the right table. Note that in the case of a duplicate key value in the right table, an additional...

Case study – working with the Stack Overflow dataset

This section will cover an exercise to help you practice different data transformation, aggregation, and merging techniques based on the public Stack Overflow dataset, which contains a set of tables related to technical questions and answers posted on the Stack Overflow platform. The supporting raw data has been uploaded to the accompanying Github repository of this book. We will directly download it from the source GitHub link using the readr package, another tidyverse offering that provides an easy, fast, and friendly way to read a wide range of data sources, including those from the web.

Exercise 2.11 – working with the Stack Overflow dataset

Let’s begin this exercise:

Download three data sources on questions, tags, and their mapping table from GitHub:

library(readr)
df_questions = read_csv("https://raw.githubusercontent.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop...

Summary

In this chapter, we covered essential functions and techniques for data transformation, aggregation, and merging. For data transformation at the row level, we learned about common utility functions such as filter(), mutate(), select(), arrange(), top_n(), and transmute(). For data aggregation, which summarizes the raw dataset into a smaller and more concise summary view, we introduced functions such as count(), group_by(), and summarize(). For data merging, which combines multiple datasets into one, we learned about different joining methods, including inner_join(), left_join(), right_join(), and full_join(). Although there are other more advanced joining functions, the essential tools we covered in our toolkit are enough for us to achieve the same task. Finally, we went through a case study based on the Stack Overflow dataset. The skills we learned in this chapter will come in very handy in many data analysis tasks.

In the next chapter, we will cover a more advanced topic...

The rest of the chapter is locked

You have been reading a chapter from

The Statistics and Machine Learning with R Workshop

Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages