Packt+ | Advance your knowledge in tech

You're reading from Advanced Analytics with R and Tableau

Product type Book

Published in Aug 2017

Publisher Packt

ISBN-13 9781786460110

Pages 178 pages

Edition 1st Edition

Languages

Concepts

Business Intelligence

Authors (3):

Ruben Oliva Ramos

Jen Stirrup

Roberto Rösler

View More author details

Table of Contents (16) Chapters

Advanced Analytics with R and Tableau

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Advanced Analytics with R and Tableau

2. The Power of R

3. A Methodology for Advanced Analytics Using Tableau and R

4. Prediction with R and Tableau Using Regression

5. Classifying Data with Tableau

6. Advanced Analytics Using Clustering

7. Advanced Analytics with Unsupervised Learning

8. Interpreting Your Results for Your Audience

Index

Chapter 3. A Methodology for Advanced Analytics Using Tableau and R

In the era of big data when lack of methodology is likely to produce random and false discoveries, a robust framework for delivery is extremely important. According to a Dataversity poll in 2015, it was found that only 17% of survey respondents said they had a well-developed Predictive or Prescriptive Analytics program in place. On the other hand, 80% of respondents said they planned on implementing such a program within five years. How can we ensure that our projects are successful?

There is an increasing amount of data in the world, and in our databases. The data deluge is not going to go away anytime soon! Businesses risk wasting the useful business value of information contained in databases, unless they are able to excise useful knowledge from the data.

There is a saying in the world of data: garbage in, garbage out. Data needs to be cleaned before it is turned into information. There is a difference between original...

Industry standard methodologies for analytics

There are a few main methodologies: the Microsoft TDSP Process and the CRISP-DM methodology.

Ultimately, they are all setting out to achieve the same objectives as an analytics framework. There are differences, of course, and these are highlighted here. CRISP-DM and TDSP focus on the business value and the results derived from analytics projects.

Both of these methodologies are described in the following sections.

CRISP-DM

One common methodology is the CRISP-DM methodology (the modeling agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions. The process is loosely divided into six main phases. The phases can be seen in the following diagram:

CRISP-DM Methodology

Initially, the process starts with a business idea and a general consideration of the data. Each stage is briefly discussed in the following sections.

Business understanding/data understanding

The first phase looks at the machine learning solution from a business standpoint, rather than a technical standpoint. The business idea is defined, and a draft project plan is generated. Once the business idea is defined, the data understanding phase focuses on data collection and familiarity. At this point, missing data may be identified, or initial insights may be revealed. This...

Team Data Science Process

The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process:

Credit: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-process-overview

The process is loosely divided into four main phases:

Business Understanding
Data Acquisition and Understanding
Modeling
Deployment

The phases are described in the following paragraphs.

Business understanding

The Business understanding process starts with a business idea, which is solved with a machine learning solution. The business idea is defined from the business perspective, and possible scenarios are identified and evaluated. Ultimately, a project plan is generated for delivering the solution.

Data acquisition and understanding

Following on from the business understanding phase is the data acquisition and understanding phase, which concentrates on...

Working with dirty data

The process of cleaning data involves tidying the data, which usually results in making the dataset smaller because we have cleaned out some of the dirty data. What makes data dirty?

Dirty data can be due to invalid data, which is data that is false, incomplete, or doesn't conform to the accepted standard. An example of invalid data could be formatting errors, or data that is out of an acceptable range. Invalid data could also have the wrong type. For example, the Asterix is invalid because the acceptable formatted data is for letters only, so it can be removed.

Dirty data can be due to missing data, which is data where no value is stored. An example of missing data is data that has not been stored due to a faulty sensor. We can see that some data is missing, so it is removed from consideration.

Dirty data could also have null values. If data has null values, then programs may respond differently to the data on that basis. The nulls will need to be considered in order...

Introduction to dplyr

What is dplyr? Well, dplyr can be perceived as a grammar of data manipulation. It has been created for the R community by Hadley Wickham, Romain Francois, and RStudio.

What does dplyr give the Tableau user? We will use dplyr in order to cleanse, summarize, group, chain, filter, and visualize our data in Tableau.

Summarizing the data with dplyr

Firstly, let's import the packages that we need. These packages are listed in the following table, followed by the code itself.

Packages required for the hands-on exercise:

Package Name	Description	Reference
`WDI`	Search, extract, and format data from the World Bank's World Development Indicators	https://cran.r-project.org/web/packages/WDI/index.html
`dplyr`	`dplyr` is a grammar of data manipulation

As we walk through the script, the first thing we need to do is install the packages.

Once you have installed the packages, we need to call each library.

Once we have called the libraries, then we need to obtain the data from the...

Summary

Data science requires a process to ensure that the project is successful. As we have seen from the previous frameworks, it requires many moving parts from the extraction of timely data from diverse data sources, building and testing the models, and then deploying those models to aid in or to automate day-to-day decision making processes. Otherwise, the project can easily fall through the gaps in this data so that the organization is right where they started: data rich, information poor.

In this example, we have covered the CRISP-DM methodology and the TDSP methodology. Each of these stages has the data preparation stage clearly marked out. In order to follow this sequence, we have started with a focus on the data preparation stage using the dplyr package in R. We have cleaned some data and compared the results between the dirty and clean data.

The rest of the chapter is locked

You have been reading a chapter from

Advanced Analytics with R and Tableau

Published in: Aug 2017 Publisher: Packt ISBN-13: 9781786460110

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at AU $19.99/month. Cancel anytime}

Authors (3)

Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.

See other products by Ruben Oliva Ramos

Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.

See other products by Jen Stirrup

Roberto Rösler

See other products by Roberto Rösler