Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Intermediate Data Processing

The previous chapter covered a suite of commonly used functions offered by dplyr for data processing. For example, when characterizing and extracting the statistics of a dataset, we can follow the split-apply-combine procedure using group_by() and summarize(). This chapter continues from the previous one and focuses on intermediate data processing techniques, including transforming categorical and numeric variables and reshaping DataFrames. Besides that, we will also introduce string manipulation techniques for working with textual data, whose format is fundamentally different from the neatly shaped tables we have been working with so far.

By the end of this chapter, you will be able to perform more advanced data manipulation and extend your data massaging skills to string-based texts, which are fundamental to the field of natural language processing.

In this chapter, we will cover the following topics:

  • Transforming categorical and numeric...

Technical requirements

To complete the exercises in this chapter, you will need to have the following:

  • The latest version of the rebus package, which is 0.1-3 at the time of writing
  • The latest version of the tidytext package, which is 0.3.2 at the time of writing
  • The latest version of the tm package, which is 0.7-8 at the time of writing

All the code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/tree/main/Chapter_3.

Transforming categorical and numeric variables

As covered in the previous chapter, we can use the mutate() function from dplyr to transform existing variables and create new ones. The specific transformation depends on the type of the variable and the resulting shape we would like it to be. For example, we may want to change the value of a categorical variable according to a mapping dictionary, create a new variable based on a combination of filtering conditions of existing variables, or group a numeric variable into different ranges in a new variable. Let us look at these scenarios in turn.

Recoding categorical variables

There are many cases when you would want to recode the values of a variable, such as mapping countries’ short names to the corresponding full names. Let’s create a dummy tibble dataset to illustrate this.

In the following code, we have created a students variable that stores information on age, country, gender, and height. This is a small dummy...

Reshaping the DataFrame

A DataFrame that consists of a combination of categorical and numeric columns can be expressed in both wide and long formats. For example, the students DataFrame is considered a long format since all countries are stored in the country column. Depending on the specific purpose of processing, we may want to create a separate column for each unique country in the dataset, which adds more columns to the DataFrame and converts it into a wide format.

Converting between wide and long formats can be achieved via the spread() and gather() functions, both of which are provided by the tidyr package from the tidyverse ecosystem. Let’s see how it works in practice.

Converting from long format into wide format using spread()

There will be times when we’ll want to turn a long-formatted DataFrame into a wide format. The spread() function can be used to convert a categorical column with multiple categories into multiple columns, as specified by the key...

Manipulating string data

Character-typed strings are standard in real-life data, such as name and address. Analyzing string data requires properly cleaning the raw characters and converting the information embedded in a blob of textual data into a quantifiable numeric summary. For example, we may want to find the matching names of all students that follow a specific pattern.

This section will cover different ways to define patterns via regular expressions to detect, split, and extract string data. Let’s start with the basics of strings.

Creating strings

A string is a character-typed variable that is represented by a sequence of characters (including punctuation) wrapped by a pair of double quotes (""). Sometimes, a single quote (') is also used to denote a string, although it is generally recommended to use double quotes unless the characters themselves include double quotes.

There are multiple ways to create a string. The following exercise introduces...

Working with stringr

The stringr package provides a cohesive set of functions that all start with str_ and are designed to make working with strings as easy as possible.

Let’s start with the basic functions of stringr by replicating the same results from the previous exercise.

Basics of stringr

The str_c() function from the stringr package can concatenate multiple strings with similar functionalities as in paste(). Let’s see its use in action.

Exercise 3.9 – combining strings using paste()

In this exercise, we will reproduce the same as what we did in Exercise 3.8 using str_c():

  1. Concatenate statistics with workshop with a separating space in between:
    >>> str_c("statistics", "workshop", sep = " ")
    "statistics workshop"

    We can use the sep argument to specify the separator between strings.

  2. Combine a vector of statistics and workshop with course:
    >>> str_c(c("statistics", "...

Introducing regular expressions

A regular expression is a sequence of characters that bear a special meaning and are used for pattern matching in strings. Since the specific meaning of characters in a regular expression requires some memorization and can easily be forgotten if you do not use them often, we will avoid introducing its underlying syntax and focus on intuitive and more human-friendly programming using the rebus package. It is a good companion to stringr and provides utility functions that facilitate string manipulation and make building regular expressions much easier. Remember to install this package via install.package("rebus") when you use it for the first time.

The rebus package has a special operator called %R% that’s used to concatenate matching conditions. For example, to detect whether a string starts with a particular character, such as s, we could specify the pattern as START %R% "s" and pass it to the pattern argument of the str_detect...

Working with tidy text mining

The tidytext package handles unstructured text by following the tidy data principle, which mandates that data is represented as a structured, rectangular-shaped, and tibble-like object. In the case of text mining, this requires converting a piece of text in a single cell into one token per row in the DataFrame.

Another commonly used representation for a collection of texts (called a corpus) is the document-term matrix, where each row represents one document (this could be a short sentence or a lengthy article) and each column represents one term (a unique word in the whole corpus, for example). Each cell in the matrix usually contains a representative statistic, such as frequency of occurrence, to indicate the number of times the term appears in the document.

We will dive into both representations and look at how to convert between a document-term matrix and a tidy data format for text mining in the following sections.

Converting text into tidy...

Summary

In this chapter, we touched upon several intermediate data processing techniques, ranging from structured tabular data to unstructured textual data. First, we covered how to transform categorical and numeric variables, including recoding categorical variables using recode(), creating new variables using case_when(), and binning numeric variables using cut(). Next, we looked at reshaping a DataFrame, including converting a long-format DataFrame into a wide format using spread() and back again using gather(). We also delved into working with strings, including how to create, convert, and format string data.

In addition, we covered some essential knowledge regarding the stringr package, which provides many helpful utility functions to ease string processing tasks. Common functions include str_c(), str_sub(), str_subset(), str_detect(), str_split(), str_count(), and str_replace(). These functions can be combined to create a powerful and easy-to-understand string processing pipeline...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng