Packt+ | Advance your knowledge in tech

You're reading from Advanced Analytics with R and Tableau

Product typeBook

Published inAug 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781786460110

Edition1st Edition

Languages

Tools

Tableau

Concepts

Business Intelligence

Authors (3):

Ruben Oliva Ramos

Jen Stirrup

Roberto Rösler

View More author details

Chapter 5. Classifying Data with Tableau

In this chapter, we will look at ways to perform classification using R and visualizing the results in Tableau. Classification is one of the most important tasks in analytics today. By the end of this chapter, you'll build a decision tree, while retaining a focus on a business-oriented understanding of the business question using classification algorithms.

Business understanding

When we are modeling data, it is crucial to keep the original business objectives in mind. These business objectives will direct the subsequent work in the data understanding, preparation and modeling steps, and the final evaluation and selection (after revisiting earlier steps if necessary) of a classification model or models.

At later stages, this will help to streamline the project because we will be able to keep the model's performance in line with the original requirement, while retaining a focus on ensuring a return on investment from the project.

The main business objective is to identify individuals who are higher earners, so that they can be targeted by a marketing campaign. For this purpose, we will investigate the data mining of demographic data in order to create a classification model in R. The model will be able to accurately determine whether individuals earn a salary that is above or below $50K per annum. The datasets used in this chapter were taken from...

Understanding the data

We will use Tableau to look at data preparation and data quality. Though we could also do these activities in R, we will use Tableau since it is a good way of seeing data quality issues and capturing them easily. We can also see problematic issues such as outliers or missing values.

Data preparation

When confronted with many variables, analysts usually start by building a decision tree and then using the variables that the decision tree algorithm has selected with other methods that suffer from the complexity of many variables, such as neural networks. However, decision trees perform worse when the problem at hand is not linearly separable.

In this section, we will use Tableau as a visual data preparation in order to prepare the data for further analysis. Here is a summary of some of the things we will explore:

Looking at columns that do not add any value to the model
Columns that have so many missing categorical values that they do not predict the outcome reliably
Review...

Modeling in R

In this example, we will use the rpart package, which is used to build a decision tree. The tree with the minimum prediction error is selected. After that, the tree is applied to make predictions for unlabeled data with the predict function.

One way to call rpart is to give it a list of variables and see what happens. Although we have discussed missing values, rpart has built-in code for dealing with missing values. So let's dive in, and look at the code.

Firstly, we need to call the libraries that we need:

library(rpart) 
library(rpart.plot)
library(caret)
library(e1071)
library(arules)

Next, let's load in the data, which will be in the AdultUCI variable:

data("AdultUCI");
AdultUCI

## 75% of the sample size
sample_size <- floor(0.80 * nrow(AdultUCI))

## set the seed to make your partition reproductible
set.seed(123)

## Set a variable to have the sample size
training.indicator <- sample(seq_len(nrow(AdultUCI)), size = sample_size)

# Set up the training and test sets...

Model deployment

Now that we have created our model, we can reuse it in Tableau. This model will just work in Tableau, as long as you have Rserve running. You will also need to have the relevant packages installed, as per the script. In particular, the rpart package is the workhorse of this example, and it must be installed since it is self-contained as it loads the library, trains the model, and then uses the model to make predictions within the same calculation.

There are many ways to deploy your model for future use, and this part of the process involves the CRISP-DM methodology. Here are a few ways:

You can go through the model fitting inside R using RStudio or another IDE and save it. Then, you could simply load the model into Tableau or you can save it to a file directly from within Tableau. The advantage of doing it in this way is that you can reuse your R model in other packages as well. The downside is that you will need to switch between R and Tableau, and then back again.
If you...

Decision trees in Tableau using R

When the data has a lot of features that interact in complicated non-linear ways, it is hard to find a global regression model, that is, a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.

There are two main types of decision trees:

Classification trees: Predicted outcome is the class the data belongs to
Regression trees: Predicted outcome is a continuous variable, for example, a real number such as the price of a commodity

There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class that corresponds to the mode of the classes output by individual trees.

Bayesian methods

Suppose I claim that I have a pair of magic rainbow socks. I allege that whenever I wear these special socks, I gain the ability to predict the outcome of coin tosses, using fair coins, better than chance would dictate. Putting my claim to the test, you toss a coin 30 times, and I correctly predict the outcome 20 times. Using a directional hypothesis with the binomial test, the null hypothesis would be rejected at alpha-level 0.05. Would you invest in my special socks?

Why not? If it's because you require a larger burden of proof on absurd claims, I don't blame you. As a grandparent of Bayesian analysis, Pierre-Simon Laplace (who independently discovered the theorem that bears Thomas Bayes' name), once said: The weight of evidence for an extraordinary claim must be proportioned to its strangeness. Our prior belief—my absurd hypothesis—is so small that it would take much stronger evidence to convince the skeptical investor, let alone the scientific community.

Unfortunately...

Graphs

A graph is a type of data structure capable of handling networks. Graphs are widely used across various domains such as the following:

Transportation: To find the shortest routes to travel between two places
Communication-signaling networks: To optimize the network of inter-connected computers and systems
Understanding relationships: To build relationship trees across families or organizations
Hydrology: To perform flow regime simulation analysis of various fluids

Terminology and representations

A graph (G) is a network of vertices (V) interconnected using a set of edges (E). Let |V| represent the count of vertices and |E| represent the count of edges. The value of |E| lies in the range of 0 to |V|2 - |V|. Based on the directional edges, the graphs are classified as directed or undirected. In directed graphs, the edges are directed from one vertex towards the other, whereas in undirected graphs, each vertex has an equal probability of being directionally connected with the others. An...

Summary

Although most introductory data analysis texts don't even broach the topic of Bayesian methods, you, dear reader, are versed enough in this matter to start applying these techniques to real problems.

We discovered that Bayesian methods could—at least for the models in this chapter—not only allow us to answer the same kinds of questions we might use the binomial, one sample t-test, and the independent samples t-test for, but provide a much richer and more intuitive depiction of our uncertainty in our estimates. If these approaches interest you, I urge you to learn more about how to extend these to supersede other NHST tests. I also urge you to learn more about the mathematics behind MCMC. As with the last chapter, we covered much ground here. If you made it through, congratulations! This concludes the unit on confirmatory data analysis and inferential statistics. In the next unit, we will be less concerned with estimating parameters, and more interested in prediction. Last one there...

The rest of the chapter is locked

You have been reading a chapter from

Advanced Analytics with R and Tableau

Published in: Aug 2017Publisher: PacktISBN-13: 9781786460110

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.
Read more about Ruben Oliva Ramos

Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.
Read more about Jen Stirrup

Roberto Rösler

Other recommended products

Related to this chapter

Machine Learning with scikit-learn Quick Start Guide

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize and evaluate all the important machine learning algorithms that scikit-learn provides.

BookOct 2018172 pages

Data Analysis with R

R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

BookMar 2018570 pages

Hands-On Machine Learning with Azure

This book will teach you how advanced machine learning can be performed in the cloud in a very cheap way. You will learn more about Azure ML processes as an enterprise-ready methodology. By the end of this book, you will implement machine learning and artificial intelligence concepts in your model to solve real-world problems.

BookOct 2018340 pages

Tableau 2019.x Cookbook

Explore recipes for analysis, visualization, and more with the Tableau 2019.x Cookbook. You’ll cover more than 115 recipes, discover best practices, and expert techniques as you progress through the book.

BookJan 2019670 pages

IBM SPSS Modeler Essentials

IBM SPSS Modeler allows quick, efficient predictive analytics and insight building from your data, and is a popularly used data mining tool. This book will guide you through the data mining process, and presents relevant statistical methods which are used to build predictive models and conduct other analytic tasks using IBM SPSS Modeler. From importing the data to finding hidden relationships within it, you will be able to build solid data mining solutions and then deploy them to production. The book also contains valuable information on evaluating and enhancing the performance of your data models.

BookDec 2017238 pages

Mastering Tableau 2021

A comprehensive resource to mastering your Tableau skills and becoming a BI expert. This guide will teach you how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookMay 2021792 pages

R Data Mining

This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. Explore a data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques.

BookNov 2017442 pages

Neural Networks with R

The book helps you learn neural networks and implement them in R. It covers real-world use cases that will help you better understand their concepts. A basic understanding of R and mathematics is required.

BookSep 2017270 pages

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

R Statistics Cookbook

With this book, you will learn to execute a series of intermediate to advanced statistical tasks as you walk through each chapter. You will not only get well versed with the traditional statistics but you will also cover the necessary statistics required for machine learning and deep learning concepts.

BookMar 2019448 pages3

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages