You're reading from The Statistics and Machine Learning with R Workshop

Product typeBook

Published inOct 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781803240305

Edition1st Edition

Languages

Concepts

Machine Learning

Author (1)

Liu Peng

Intermediate Data Processing

The previous chapter covered a suite of commonly used functions offered by dplyr for data processing. For example, when characterizing and extracting the statistics of a dataset, we can follow the split-apply-combine procedure using group_by() and summarize(). This chapter continues from the previous one and focuses on intermediate data processing techniques, including transforming categorical and numeric variables and reshaping DataFrames. Besides that, we will also introduce string manipulation techniques for working with textual data, whose format is fundamentally different from the neatly shaped tables we have been working with so far.

By the end of this chapter, you will be able to perform more advanced data manipulation and extend your data massaging skills to string-based texts, which are fundamental to the field of natural language processing.

In this chapter, we will cover the following topics:

Transforming categorical and numeric...

Technical requirements

To complete the exercises in this chapter, you will need to have the following:

The latest version of the rebus package, which is 0.1-3 at the time of writing
The latest version of the tidytext package, which is 0.3.2 at the time of writing
The latest version of the tm package, which is 0.7-8 at the time of writing

All the code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/tree/main/Chapter_3.

Transforming categorical and numeric variables

As covered in the previous chapter, we can use the mutate() function from dplyr to transform existing variables and create new ones. The specific transformation depends on the type of the variable and the resulting shape we would like it to be. For example, we may want to change the value of a categorical variable according to a mapping dictionary, create a new variable based on a combination of filtering conditions of existing variables, or group a numeric variable into different ranges in a new variable. Let us look at these scenarios in turn.

Recoding categorical variables

There are many cases when you would want to recode the values of a variable, such as mapping countries’ short names to the corresponding full names. Let’s create a dummy tibble dataset to illustrate this.

In the following code, we have created a students variable that stores information on age, country, gender, and height. This is a small dummy...

Reshaping the DataFrame

A DataFrame that consists of a combination of categorical and numeric columns can be expressed in both wide and long formats. For example, the students DataFrame is considered a long format since all countries are stored in the country column. Depending on the specific purpose of processing, we may want to create a separate column for each unique country in the dataset, which adds more columns to the DataFrame and converts it into a wide format.

Converting between wide and long formats can be achieved via the spread() and gather() functions, both of which are provided by the tidyr package from the tidyverse ecosystem. Let’s see how it works in practice.

Converting from long format into wide format using spread()

There will be times when we’ll want to turn a long-formatted DataFrame into a wide format. The spread() function can be used to convert a categorical column with multiple categories into multiple columns, as specified by the key...

Manipulating string data

Character-typed strings are standard in real-life data, such as name and address. Analyzing string data requires properly cleaning the raw characters and converting the information embedded in a blob of textual data into a quantifiable numeric summary. For example, we may want to find the matching names of all students that follow a specific pattern.

This section will cover different ways to define patterns via regular expressions to detect, split, and extract string data. Let’s start with the basics of strings.

Creating strings

A string is a character-typed variable that is represented by a sequence of characters (including punctuation) wrapped by a pair of double quotes (""). Sometimes, a single quote (') is also used to denote a string, although it is generally recommended to use double quotes unless the characters themselves include double quotes.

There are multiple ways to create a string. The following exercise introduces...

Working with stringr

The stringr package provides a cohesive set of functions that all start with str_ and are designed to make working with strings as easy as possible.

Let’s start with the basic functions of stringr by replicating the same results from the previous exercise.

Basics of stringr

The str_c() function from the stringr package can concatenate multiple strings with similar functionalities as in paste(). Let’s see its use in action.

Exercise 3.9 – combining strings using paste()

In this exercise, we will reproduce the same as what we did in Exercise 3.8 using str_c():

Concatenate statistics with workshop with a separating space in between:
```
>>> str_c("statistics", "workshop", sep = " ")
"statistics workshop"
```
We can use the sep argument to specify the separator between strings.
Combine a vector of statistics and workshop with course:
```
>>> str_c(c("statistics", "...
```

Introducing regular expressions

A regular expression is a sequence of characters that bear a special meaning and are used for pattern matching in strings. Since the specific meaning of characters in a regular expression requires some memorization and can easily be forgotten if you do not use them often, we will avoid introducing its underlying syntax and focus on intuitive and more human-friendly programming using the rebus package. It is a good companion to stringr and provides utility functions that facilitate string manipulation and make building regular expressions much easier. Remember to install this package via install.package("rebus") when you use it for the first time.

The rebus package has a special operator called %R% that’s used to concatenate matching conditions. For example, to detect whether a string starts with a particular character, such as s, we could specify the pattern as START %R% "s" and pass it to the pattern argument of the str_detect...

Working with tidy text mining

The tidytext package handles unstructured text by following the tidy data principle, which mandates that data is represented as a structured, rectangular-shaped, and tibble-like object. In the case of text mining, this requires converting a piece of text in a single cell into one token per row in the DataFrame.

Another commonly used representation for a collection of texts (called a corpus) is the document-term matrix, where each row represents one document (this could be a short sentence or a lengthy article) and each column represents one term (a unique word in the whole corpus, for example). Each cell in the matrix usually contains a representative statistic, such as frequency of occurrence, to indicate the number of times the term appears in the document.

We will dive into both representations and look at how to convert between a document-term matrix and a tidy data format for text mining in the following sections.

Converting text into tidy...

Summary

In this chapter, we touched upon several intermediate data processing techniques, ranging from structured tabular data to unstructured textual data. First, we covered how to transform categorical and numeric variables, including recoding categorical variables using recode(), creating new variables using case_when(), and binning numeric variables using cut(). Next, we looked at reshaping a DataFrame, including converting a long-format DataFrame into a wide format using spread() and back again using gather(). We also delved into working with strings, including how to create, convert, and format string data.

In addition, we covered some essential knowledge regarding the stringr package, which provides many helpful utility functions to ease string processing tasks. Common functions include str_c(), str_sub(), str_subset(), str_detect(), str_split(), str_count(), and str_replace(). These functions can be combined to create a powerful and easy-to-understand string processing pipeline...

The rest of the chapter is locked

You have been reading a chapter from

The Statistics and Machine Learning with R Workshop

Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages