Reader small image

You're reading from  Data Wrangling with R

Product typeBook
Published inFeb 2023
PublisherPackt
ISBN-139781803235400
Edition1st Edition
Concepts
Right arrow
Author (1)
Gustavo R Santos
Gustavo R Santos
author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Right arrow

Building a Model with R

The last part of this book is about modeling data. It has been a long learning journey so far. We started with the fundamentals of data wrangling while covering the concepts that surround the matter and going through techniques to munge each type of data. During practical projects, we had the opportunity to wrangle entire datasets, showing some transformations. In the previous part, we worked with plenty of material regarding data visualization while going over one of the most complete libraries for visualization.

Now, it is time to put all our knowledge to the test and work on a final project. This will involve end-to-end work, from loading the dataset into RStudio to deploying it in a production environment using Shiny, where anyone can interact with the application.

This project will be built during these last two chapters in Part 4. Let’s get to work.

We will be covering the following main topics:

  • Machine learning concepts
  • Understanding...

Technical requirements

The dataset to be used in this project is called *Spambase* and is from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/spambase).

The code for this chapter can be found in this book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part4/Chapter13.

The following libraries will be used in the code:

library(tidyverse)
library(patchwork)
library(skimr)
library(randomForest)
library(caret)
library(ROCR)

Machine learning concepts

Before we move on to the project itself, let’s just build a background about machine learning concepts. This content is not the main scope of this book; therefore, we will quickly go over a couple of definitions to put us on the same page for the remainder of this book.

A model is a representation of a theory (HAIR Jr. et al, 2019) but is also defined as a simplification or approximation of reality (Burnham & Anderson, 2002). In other words, modeling data involves finding patterns that can help us explain a response, which is the most probable outcome from that observation.

With that said, the model will just reflect the data that it received. For that reason, it is crucial that the input data is clean and representative of the reality we are trying to model. To exemplify this, think about when we see a dataset with too many missing values that are going to be either removed or inputted. Both approaches will certainly have an impact on the...

Understanding the project

When starting a project, we need a purpose – that is, a goal we want to reach at the end. After all, knowing the problem is part of the solution. Like Lewis Carrol wrote in his book Alice’s Adventures in Wonderland, the Bunny says to Alice that if she does not know where she wants to go, any path will lead her there.

So, let’s begin by understanding the project, or where we want to go.

The dataset

The input data for this project is the Spambase Data Set (https://tinyurl.com/23xwdcah), which can be found in the UCI datasets repository. See the citation information in the Further reading section at the end of this chapter for more.

It contains 4,601 observations and 57 explanatory variables. Out of those, 48 features are floating numbers representing the percentage value, from 0 to 100, of specific words associated with spam and their percentage present in the message. There are six other variables with special characters such...

Preparing data for modeling in R

We must wrangle the data to prepare it for modeling. Since we know where we want to go at the end of this project, the next step is a matter of finding a way to get there.

The first thing we must do is load the libraries to be used for wrangling and modeling the data. We will use tidyverse to perform data wrangling and visualization, skimr to create a descriptive statistics summary, patchwork, a great library to put graphics side by side, randomForest to create the model, caret to create the confusion matrix, and ROCR to plot the ROC curve of model performance.

To load the dataset, the best option is to pull it directly from the internet, without the need to save it locally on our machine. Just use the read_csv() function and point to the web address where the raw dataset is located, as we’ve done previously in this book. Here, we are using the trim_ws=TRUE argument to trim any unwanted white spaces and the col_names=headers argument, where...

Exploring the data with a few visualizations

We should start the data visualization portion of a project with univariate graphics, such as histograms and boxplots. This is because the former will show us the data distribution, indicating the possible statistical tests to be used, while the latter will bring up the presence of outliers in the data.

Since there are more than 50 variables in this dataset, we will create a for loop to plot the histograms for all of them. The following code uses the hist() function from the base R histogram:

# Histograms
for (var in colnames(spam)[1:57]) {
  hist(unlist(spam[,var]), col="royalblue",
       main= paste("Histogram of", var),
       xlab=var)   }

Notice that we only did the loop for columns [1:57] since we know that the last one is the target variable. Next, we will see four graphics, as shown in Figure 13.6:

...

Selecting the best variables

At this point, selecting the best variables should be smooth since exploring the data gives us the answer we’re looking for. When we checked the boxplots and tested the words and characters that impact the classification the most, as well as the impact of the uppercase letters, we were already making a variable selection. We should use those variables that have the highest difference between both groups so that it’s easier for the algorithm to find a clearer separation between the two groups. As we have seen, 23 words maximize the difference, the number of uppercase letters, and the presence of too many symbols.

In this section, we will take the top_words vector, which gathers the top 23 words that have the most impact on the spam classification, as well as the exclamation, parenthesis, dollar sign, and hashtag characters and the uppercase variables and transform the dataset into a seven-variable Tibble, with six explanatory variables and...

Modeling

Training

Now that the new dataset has been created, the next step is to replace 1 with is_spam and 0 with not_spam so that the random forest algorithm can understand that the target variable is not numeric and that it is a classification model. We can do this by using the recode() function within a mutate function:

# Replace the binary 1(spam) and 0(not_spam)
spam_for_model <- spam_for_model %>% 
  mutate( spam= recode(spam, '1'='is_spam','0'='not_spam')    )

Now, it is time to separate the data into train and test subsets. The train subset is used to present the model with the patterns and the labels associated with it so that it can study how to classify each observation according to the patterns that occur. The test set is like a school test, where new data is presented to the trained model so that we can measure how accurate it is or how much it has learned.

As we learned during the...

Summary

In this chapter, we created an end-to-end machine learning project. We started by studying some basic machine learning concepts to put us in sync. Then, we understood what was needed for the main goal of the project. First, we must understand the problem and know where we want to go so that the solution becomes clearer. In this case, our client was a digital marketing company that wanted to reduce the risk of their messages ending up in their spam filter, so we had to create a classification model to predict the probability of a message being marked as spam or not spam.

We loaded a dataset from UCI, which brought up some words and characters associated with spam messages and their percentage in the email. Then, we studied the data and created some visualizations to learn which elements were more likely to be classified as spam. Out of those, we created a new dataset with just six explanatory variables, reducing it from the original 57 columns.

Next, we trained and tested...

Exercises

  1. What are the two types of models in machine learning?
  2. What are the three learning methods of machine learning algorithms?
  3. What is the importance of data wrangling for modeling data?
  4. Before creating a model, what is the split we must do with the dataset?
  5. What are some of the metrics we can use to evaluate a classification model?
  6. Explore the UCI database and choose a dataset to create another model.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos