You're reading from Data Wrangling with R

Product typeBook

Published inFeb 2023

PublisherPackt

ISBN-139781803235400

Edition1st Edition

Concepts

Data Mining

Author (1)

Gustavo R Santos

Building a Model with R

The last part of this book is about modeling data. It has been a long learning journey so far. We started with the fundamentals of data wrangling while covering the concepts that surround the matter and going through techniques to munge each type of data. During practical projects, we had the opportunity to wrangle entire datasets, showing some transformations. In the previous part, we worked with plenty of material regarding data visualization while going over one of the most complete libraries for visualization.

Now, it is time to put all our knowledge to the test and work on a final project. This will involve end-to-end work, from loading the dataset into RStudio to deploying it in a production environment using Shiny, where anyone can interact with the application.

This project will be built during these last two chapters in Part 4. Let’s get to work.

We will be covering the following main topics:

Machine learning concepts
Understanding...

Technical requirements

The dataset to be used in this project is called *Spambase* and is from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/spambase).

The code for this chapter can be found in this book’s GitHub repository: https://github.com/PacktPublishing/Data-Wrangling-with-R/tree/main/Part4/Chapter13.

The following libraries will be used in the code:

library(tidyverse)
library(patchwork)
library(skimr)
library(randomForest)
library(caret)
library(ROCR)

Machine learning concepts

Before we move on to the project itself, let’s just build a background about machine learning concepts. This content is not the main scope of this book; therefore, we will quickly go over a couple of definitions to put us on the same page for the remainder of this book.

A model is a representation of a theory (HAIR Jr. et al, 2019) but is also defined as a simplification or approximation of reality (Burnham & Anderson, 2002). In other words, modeling data involves finding patterns that can help us explain a response, which is the most probable outcome from that observation.

With that said, the model will just reflect the data that it received. For that reason, it is crucial that the input data is clean and representative of the reality we are trying to model. To exemplify this, think about when we see a dataset with too many missing values that are going to be either removed or inputted. Both approaches will certainly have an impact on the...

Understanding the project

When starting a project, we need a purpose – that is, a goal we want to reach at the end. After all, knowing the problem is part of the solution. Like Lewis Carrol wrote in his book Alice’s Adventures in Wonderland, the Bunny says to Alice that if she does not know where she wants to go, any path will lead her there.

So, let’s begin by understanding the project, or where we want to go.

The dataset

The input data for this project is the Spambase Data Set (https://tinyurl.com/23xwdcah), which can be found in the UCI datasets repository. See the citation information in the Further reading section at the end of this chapter for more.

It contains 4,601 observations and 57 explanatory variables. Out of those, 48 features are floating numbers representing the percentage value, from 0 to 100, of specific words associated with spam and their percentage present in the message. There are six other variables with special characters such...

Preparing data for modeling in R

We must wrangle the data to prepare it for modeling. Since we know where we want to go at the end of this project, the next step is a matter of finding a way to get there.

The first thing we must do is load the libraries to be used for wrangling and modeling the data. We will use tidyverse to perform data wrangling and visualization, skimr to create a descriptive statistics summary, patchwork, a great library to put graphics side by side, randomForest to create the model, caret to create the confusion matrix, and ROCR to plot the ROC curve of model performance.

To load the dataset, the best option is to pull it directly from the internet, without the need to save it locally on our machine. Just use the read_csv() function and point to the web address where the raw dataset is located, as we’ve done previously in this book. Here, we are using the trim_ws=TRUE argument to trim any unwanted white spaces and the col_names=headers argument, where...

Exploring the data with a few visualizations

We should start the data visualization portion of a project with univariate graphics, such as histograms and boxplots. This is because the former will show us the data distribution, indicating the possible statistical tests to be used, while the latter will bring up the presence of outliers in the data.

Since there are more than 50 variables in this dataset, we will create a for loop to plot the histograms for all of them. The following code uses the hist() function from the base R histogram:

# Histograms
for (var in colnames(spam)[1:57]) {
  hist(unlist(spam[,var]), col="royalblue",
       main= paste("Histogram of", var),
       xlab=var)   }

Notice that we only did the loop for columns [1:57] since we know that the last one is the target variable. Next, we will see four graphics, as shown in Figure 13.6:

...

Selecting the best variables

At this point, selecting the best variables should be smooth since exploring the data gives us the answer we’re looking for. When we checked the boxplots and tested the words and characters that impact the classification the most, as well as the impact of the uppercase letters, we were already making a variable selection. We should use those variables that have the highest difference between both groups so that it’s easier for the algorithm to find a clearer separation between the two groups. As we have seen, 23 words maximize the difference, the number of uppercase letters, and the presence of too many symbols.

In this section, we will take the top_words vector, which gathers the top 23 words that have the most impact on the spam classification, as well as the exclamation, parenthesis, dollar sign, and hashtag characters and the uppercase variables and transform the dataset into a seven-variable Tibble, with six explanatory variables and...

Modeling

Training

Now that the new dataset has been created, the next step is to replace 1 with is_spam and 0 with not_spam so that the random forest algorithm can understand that the target variable is not numeric and that it is a classification model. We can do this by using the recode() function within a mutate function:

# Replace the binary 1(spam) and 0(not_spam)
spam_for_model <- spam_for_model %>% 
  mutate( spam= recode(spam, '1'='is_spam','0'='not_spam')    )

Now, it is time to separate the data into train and test subsets. The train subset is used to present the model with the patterns and the labels associated with it so that it can study how to classify each observation according to the patterns that occur. The test set is like a school test, where new data is presented to the trained model so that we can measure how accurate it is or how much it has learned.

As we learned during the...

Summary

In this chapter, we created an end-to-end machine learning project. We started by studying some basic machine learning concepts to put us in sync. Then, we understood what was needed for the main goal of the project. First, we must understand the problem and know where we want to go so that the solution becomes clearer. In this case, our client was a digital marketing company that wanted to reduce the risk of their messages ending up in their spam filter, so we had to create a classification model to predict the probability of a message being marked as spam or not spam.

We loaded a dataset from UCI, which brought up some words and characters associated with spam messages and their percentage in the email. Then, we studied the data and created some visualizations to learn which elements were more likely to be classified as spam. Out of those, we created a new dataset with just six explanatory variables, reducing it from the original 57 columns.

Next, we trained and tested...

Exercises

What are the two types of models in machine learning?
What are the three learning methods of machine learning algorithms?
What is the importance of data wrangling for modeling data?
Before creating a model, what is the split we must do with the dataset?
What are some of the metrics we can use to evaluate a classification model?
Explore the UCI database and choose a dataset to create another model.

Spambase dataset, created by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt and donated by George Forman: https://archive.ics.uci.edu/ml/datasets/spambase
Supervised, unsupervised, and reinforcement learning algorithms: https://tinyurl.com/sxcr9xpj
Bagging ML models: https://en.wikipedia.org/wiki/Bootstrap_aggregating
Random Forest: https://www.r-bloggers.com/2021/04/random-forest-in-r/
Gini/mean decrease accuracy: https://tinyurl.com/272auchs
Gini index for the random forest algorithm: https://tinyurl.com/42p8umhb
ROC curve: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Difference Between Classification and Regression in Machine Learning: https://tinyurl.com/mr2j436a
Creating side-by-side plots with patchwork: https://tinyurl.com/mecmhf8k
R code GitHub page for this chapter: https://tinyurl.com/mr497d8v

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling with R

Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages