Reader small image

You're reading from  Advanced Analytics with R and Tableau

Product typeBook
Published inAug 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781786460110
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Ruben Oliva Ramos
Ruben Oliva Ramos
author image
Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.
Read more about Ruben Oliva Ramos

Jen Stirrup
Jen Stirrup
author image
Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.
Read more about Jen Stirrup

View More author details
Right arrow

Chapter 5. Classifying Data with Tableau

In this chapter, we will look at ways to perform classification using R and visualizing the results in Tableau. Classification is one of the most important tasks in analytics today. By the end of this chapter, you'll build a decision tree, while retaining a focus on a business-oriented understanding of the business question using classification algorithms.

Business understanding


When we are modeling data, it is crucial to keep the original business objectives in mind. These business objectives will direct the subsequent work in the data understanding, preparation and modeling steps, and the final evaluation and selection (after revisiting earlier steps if necessary) of a classification model or models.

At later stages, this will help to streamline the project because we will be able to keep the model's performance in line with the original requirement, while retaining a focus on ensuring a return on investment from the project.

The main business objective is to identify individuals who are higher earners, so that they can be targeted by a marketing campaign. For this purpose, we will investigate the data mining of demographic data in order to create a classification model in R. The model will be able to accurately determine whether individuals earn a salary that is above or below $50K per annum. The datasets used in this chapter were taken from...

Understanding the data


We will use Tableau to look at data preparation and data quality. Though we could also do these activities in R, we will use Tableau since it is a good way of seeing data quality issues and capturing them easily. We can also see problematic issues such as outliers or missing values.

Data preparation

When confronted with many variables, analysts usually start by building a decision tree and then using the variables that the decision tree algorithm has selected with other methods that suffer from the complexity of many variables, such as neural networks. However, decision trees perform worse when the problem at hand is not linearly separable.

In this section, we will use Tableau as a visual data preparation in order to prepare the data for further analysis. Here is a summary of some of the things we will explore:

  • Looking at columns that do not add any value to the model

  • Columns that have so many missing categorical values that they do not predict the outcome reliably

  • Review...

Modeling in R


In this example, we will use the rpart package, which is used to build a decision tree. The tree with the minimum prediction error is selected. After that, the tree is applied to make predictions for unlabeled data with the predict function.

One way to call rpart is to give it a list of variables and see what happens. Although we have discussed missing values, rpart has built-in code for dealing with missing values. So let's dive in, and look at the code.

Firstly, we need to call the libraries that we need:

library(rpart) 
library(rpart.plot)
library(caret)
library(e1071)
library(arules)

Next, let's load in the data, which will be in the AdultUCI variable:

data("AdultUCI");
AdultUCI
## 75% of the sample size
sample_size <- floor(0.80 * nrow(AdultUCI))

## set the seed to make your partition reproductible
set.seed(123)

## Set a variable to have the sample size
training.indicator <- sample(seq_len(nrow(AdultUCI)), size = sample_size)

# Set up the training and test sets...

Model deployment


Now that we have created our model, we can reuse it in Tableau. This model will just work in Tableau, as long as you have Rserve running. You will also need to have the relevant packages installed, as per the script. In particular, the rpart package is the workhorse of this example, and it must be installed since it is self-contained as it loads the library, trains the model, and then uses the model to make predictions within the same calculation.

There are many ways to deploy your model for future use, and this part of the process involves the CRISP-DM methodology. Here are a few ways:

  • You can go through the model fitting inside R using RStudio or another IDE and save it. Then, you could simply load the model into Tableau or you can save it to a file directly from within Tableau. The advantage of doing it in this way is that you can reuse your R model in other packages as well. The downside is that you will need to switch between R and Tableau, and then back again.

  • If you...

Decision trees in Tableau using R


When the data has a lot of features that interact in complicated non-linear ways, it is hard to find a global regression model, that is, a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.

There are two main types of decision trees:

  • Classification trees: Predicted outcome is the class the data belongs to

  • Regression trees: Predicted outcome is a continuous variable, for example, a real number such as the price of a commodity

There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class that corresponds to the mode of the classes output by individual trees.

Bayesian methods


Suppose I claim that I have a pair of magic rainbow socks. I allege that whenever I wear these special socks, I gain the ability to predict the outcome of coin tosses, using fair coins, better than chance would dictate. Putting my claim to the test, you toss a coin 30 times, and I correctly predict the outcome 20 times. Using a directional hypothesis with the binomial test, the null hypothesis would be rejected at alpha-level 0.05. Would you invest in my special socks?

Why not? If it's because you require a larger burden of proof on absurd claims, I don't blame you. As a grandparent of Bayesian analysis, Pierre-Simon Laplace (who independently discovered the theorem that bears Thomas Bayes' name), once said: The weight of evidence for an extraordinary claim must be proportioned to its strangeness. Our prior belief—my absurd hypothesis—is so small that it would take much stronger evidence to convince the skeptical investor, let alone the scientific community.

Unfortunately...

Graphs


A graph is a type of data structure capable of handling networks. Graphs are widely used across various domains such as the following:

  • Transportation: To find the shortest routes to travel between two places

  • Communication-signaling networks: To optimize the network of inter-connected computers and systems

  • Understanding relationships: To build relationship trees across families or organizations

  • Hydrology: To perform flow regime simulation analysis of various fluids

Terminology and representations

A graph (G) is a network of vertices (V) interconnected using a set of edges (E). Let |V| represent the count of vertices and |E| represent the count of edges. The value of |E| lies in the range of 0 to |V|2 - |V|. Based on the directional edges, the graphs are classified as directed or undirected. In directed graphs, the edges are directed from one vertex towards the other, whereas in undirected graphs, each vertex has an equal probability of being directionally connected with the others. An...

Summary


Although most introductory data analysis texts don't even broach the topic of Bayesian methods, you, dear reader, are versed enough in this matter to start applying these techniques to real problems.

We discovered that Bayesian methods could—at least for the models in this chapter—not only allow us to answer the same kinds of questions we might use the binomial, one sample t-test, and the independent samples t-test for, but provide a much richer and more intuitive depiction of our uncertainty in our estimates. If these approaches interest you, I urge you to learn more about how to extend these to supersede other NHST tests. I also urge you to learn more about the mathematics behind MCMC. As with the last chapter, we covered much ground here. If you made it through, congratulations! This concludes the unit on confirmatory data analysis and inferential statistics. In the next unit, we will be less concerned with estimating parameters, and more interested in prediction. Last one there...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Advanced Analytics with R and Tableau
Published in: Aug 2017Publisher: PacktISBN-13: 9781786460110
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.
Read more about Ruben Oliva Ramos

author image
Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.
Read more about Jen Stirrup