In this chapter, we will look at ways to perform classification using R and visualizing the results in Tableau. Classification is one of the most important tasks in analytics today. By the end of this chapter, you'll build a decision tree, while retaining a focus on a business-oriented understanding of the business question using classification algorithms.
You're reading from Advanced Analytics with R and Tableau
When we are modeling data, it is crucial to keep the original business objectives in mind. These business objectives will direct the subsequent work in the data understanding, preparation and modeling steps, and the final evaluation and selection (after revisiting earlier steps if necessary) of a classification model or models.
At later stages, this will help to streamline the project because we will be able to keep the model's performance in line with the original requirement, while retaining a focus on ensuring a return on investment from the project.
The main business objective is to identify individuals who are higher earners, so that they can be targeted by a marketing campaign. For this purpose, we will investigate the data mining of demographic data in order to create a classification model in R. The model will be able to accurately determine whether individuals earn a salary that is above or below $50K per annum. The datasets used in this chapter were taken from...
We will use Tableau to look at data preparation and data quality. Though we could also do these activities in R, we will use Tableau since it is a good way of seeing data quality issues and capturing them easily. We can also see problematic issues such as outliers or missing values.
When confronted with many variables, analysts usually start by building a decision tree and then using the variables that the decision tree algorithm has selected with other methods that suffer from the complexity of many variables, such as neural networks. However, decision trees perform worse when the problem at hand is not linearly separable.
In this section, we will use Tableau as a visual data preparation in order to prepare the data for further analysis. Here is a summary of some of the things we will explore:
Looking at columns that do not add any value to the model
Columns that have so many missing categorical values that they do not predict the outcome reliably
Review...
In this example, we will use the rpart
package, which is used to build a decision tree. The tree with the minimum prediction error is selected. After that, the tree is applied to make predictions for unlabeled data with the predict function.
One way to call rpart
is to give it a list of variables and see what happens. Although we have discussed missing values, rpart
has built-in code for dealing with missing values. So let's dive in, and look at the code.
Firstly, we need to call the libraries that we need:
library(rpart) library(rpart.plot) library(caret) library(e1071) library(arules)
Next, let's load in the data, which will be in the AdultUCI
variable:
data("AdultUCI"); AdultUCI
## 75% of the sample size sample_size <- floor(0.80 * nrow(AdultUCI)) ## set the seed to make your partition reproductible set.seed(123) ## Set a variable to have the sample size training.indicator <- sample(seq_len(nrow(AdultUCI)), size = sample_size) # Set up the training and test sets...
Now that we have created our model, we can reuse it in Tableau. This model will just work in Tableau, as long as you have Rserve running. You will also need to have the relevant packages installed, as per the script. In particular, the rpart
package is the workhorse of this example, and it must be installed since it is self-contained as it loads the library, trains the model, and then uses the model to make predictions within the same calculation.
There are many ways to deploy your model for future use, and this part of the process involves the CRISP-DM methodology. Here are a few ways:
You can go through the model fitting inside R using RStudio or another IDE and save it. Then, you could simply load the model into Tableau or you can save it to a file directly from within Tableau. The advantage of doing it in this way is that you can reuse your R model in other packages as well. The downside is that you will need to switch between R and Tableau, and then back again.
If you...
When the data has a lot of features that interact in complicated non-linear ways, it is hard to find a global regression model, that is, a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.
There are two main types of decision trees:
There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class that corresponds to the mode of the classes output by individual trees.
Suppose I claim that I have a pair of magic rainbow socks. I allege that whenever I wear these special socks, I gain the ability to predict the outcome of coin tosses, using fair coins, better than chance would dictate. Putting my claim to the test, you toss a coin 30 times, and I correctly predict the outcome 20 times. Using a directional hypothesis with the binomial test, the null hypothesis would be rejected at alpha-level 0.05. Would you invest in my special socks?
Why not? If it's because you require a larger burden of proof on absurd claims, I don't blame you. As a grandparent of Bayesian analysis, Pierre-Simon Laplace (who independently discovered the theorem that bears Thomas Bayes' name), once said: The weight of evidence for an extraordinary claim must be proportioned to its strangeness. Our prior belief—my absurd hypothesis—is so small that it would take much stronger evidence to convince the skeptical investor, let alone the scientific community.
Unfortunately...
A graph is a type of data structure capable of handling networks. Graphs are widely used across various domains such as the following:
Transportation: To find the shortest routes to travel between two places
Communication-signaling networks: To optimize the network of inter-connected computers and systems
Understanding relationships: To build relationship trees across families or organizations
Hydrology: To perform flow regime simulation analysis of various fluids
A graph (G) is a network of vertices (V) interconnected using a set of edges (E). Let |V| represent the count of vertices and |E| represent the count of edges. The value of |E| lies in the range of 0 to |V|2 - |V|. Based on the directional edges, the graphs are classified as directed or undirected. In directed graphs, the edges are directed from one vertex towards the other, whereas in undirected graphs, each vertex has an equal probability of being directionally connected with the others. An...
Although most introductory data analysis texts don't even broach the topic of Bayesian methods, you, dear reader, are versed enough in this matter to start applying these techniques to real problems.
We discovered that Bayesian methods could—at least for the models in this chapter—not only allow us to answer the same kinds of questions we might use the binomial, one sample t-test, and the independent samples t-test for, but provide a much richer and more intuitive depiction of our uncertainty in our estimates. If these approaches interest you, I urge you to learn more about how to extend these to supersede other NHST tests. I also urge you to learn more about the mathematics behind MCMC. As with the last chapter, we covered much ground here. If you made it through, congratulations! This concludes the unit on confirmatory data analysis and inferential statistics. In the next unit, we will be less concerned with estimating parameters, and more interested in prediction. Last one there...