Reader small image

You're reading from  Advanced Analytics with R and Tableau

Product typeBook
Published inAug 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781786460110
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Ruben Oliva Ramos
Ruben Oliva Ramos
author image
Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.
Read more about Ruben Oliva Ramos

Jen Stirrup
Jen Stirrup
author image
Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.
Read more about Jen Stirrup

View More author details
Right arrow

Chapter 4. Prediction with R and Tableau Using Regression

In this chapter, we will consider regression from an analytics point of view. We will look at the predictive capabilities and performance of regression algorithms, which is a great start for the analytics program. At the end of this chapter, you'll have experience in simple linear regression, multi-linear regression, and k-Nearest Neighbors regression using a business-oriented understanding of the actual use cases of the regression techniques.

We will focus on preparing, exploring, and modeling the data in R, combined with the visualization power of Tableau in order to express the findings in the data.

Some interesting datasets come from the UCI machine learning datasets, which can be obtained from the following link: https://archive.ics.uci.edu/ml/datasets.html.

During the course of this chapter, we will use datasets that are obtained from the UCI website, in addition to default R datasets.

Getting started with regression


Regression means the unbiased prediction of the conditional expected value, using independent variables, and the dependent variable. A dependent variable is the variable that we want to predict. Examples of a dependent variable could be a number such as price, sales, or weight. An independent variable is a characteristic, or feature, that helps to determine the dependent variable. So, for example, the independent variable of weight could help to determine the dependent variable of weight.

Regression analysis can be used in forecasting, time series modeling, and cause and effect relationships.

Simple linear regression

R can help us to build prediction stories with Tableau. Linear regression is a great starting place when you want to predict a number, such as profit, cost, or sales. In simple linear regression, there is only one independent variable x, which predicts a dependent value, y.

Simple linear regression is usually expressed with a line that identifies...

Comparing actual values with predicted results


Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women are as follows, using the following command:

women$weight

When we execute the women$weight command, this is the result that we obtain:

When we look at the predicted values, these are also read out in R:

How can we put these pieces of data together?

women$pred <- linearregressionmodel$fitted.values

This is a very simple merge. When we look inside the women variable again, this is the result:

Investigating relationships in the data

We can see the column names in the model by using the names command. In our example, it will appear as follows:

names(linearregressionmodel)

When we use this command, we get the following columns:

[1] "coefficients"  "residuals"     "effects"      
[4] "rank"          "fitted.values" "assign"       
[7] "qr"            "df.residual"   "xlevels"      
[10] "call"          "terms"   ...

Getting started with multiple regression?


Simple linear regression will summarize the relationship between an outcome and a single explanatory element. However, in real life, things are not always so simple! We are going to use the adult dataset from UCI, which focuses on census data with a view to identifying if adults earn above or below fifty thousand dollars a year. The idea is that we can build a model from observations of adult behavior, to see if the individuals earn above or below fifty thousand dollars a year.

Multiple regression builds a model of the data, which is used to make predictions. Multiple regression is a scoring model, which makes a summary. It predicts a value between 0 and 1, which means that it is good for predicting probabilities.

It's possible to imagine multiple regression as modeling the behavior of a coin being tossed in the air. How will the coin land—heads or tails? It is not dependent on just one thing. The reality is that the result will depend on other variables...

Solving the business question


What are we trying to do with regression? If you are trying to solve a business question that helps predict probabilities or scoring, then regression is a great place to start. Business problems that require scoring are also known as regression problems. In this example, we have scored the likelihood of the individual earning above or below fifty thousand dollars per annum.

The main objective is to create a model that we can use on other data, too. The output is a set of results, but it is also an equation that describes the relationship between a number of predictor variables and the response variable.

What do the terms mean?

For example, you could try to estimate the probability that a given person earns above or below fifty thousand dollars:

  • Error: The difference between predicted value and true value

  • Residuals: The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression--y - ŷ

For most...

Sharing our data analysis using Tableau


R gives you good diagnostic information to help you take the next step in your analysis, which is to visualize the results.

Interpreting the results

Statistics provides us with a method of investigation where other methods haven't been able to help, and their success or failure isn't clear to many people. If we see a correlation and think that the relationship is obvious, then we need to think again. Correlation can help people to insinuate causation. It's often said that correlation is not causation, but what does this mean? Correlation is a measure of how closely related two things are. We can use other statistical methods, such as structural equation modeling, to help us to identify the direction of the relationship, if it exists, using correlated data. It's a complex field in itself, and it isn't covered in this book; the main point here is to show that this is a complex question.

How does correlation help us here? For our purposes, the most interesting...

Summary


In this chapter, we reviewed ways of creating regression models and displaying our regression results using Tableau. We have reiterated the importance of the business question in understanding the data, and we have covered interpretation of the statistics in terms of their numbers, whilst being mindful of the context.

While regression is important for scoring the data, there are business problems where we need to classify the data. Classification is one of the most important tasks in analytics today, and it's used in all sorts of examples to reach a business-oriented understanding of the business question.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Advanced Analytics with R and Tableau
Published in: Aug 2017Publisher: PacktISBN-13: 9781786460110
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (3)

author image
Ruben Oliva Ramos

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering and a specialization in teleinformatics and networking from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience of developing web applications to control and monitor devices connected with Arduino and Raspberry Pi, using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 teaching subjects such as electronics, robotics and control, automation, and microcontrollers. He is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, HTML5, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers, hardware programming, and control and monitor systems for data acquisition and programming.
Read more about Ruben Oliva Ramos

author image
Jen Stirrup

Jen Stirrup is a data strategist and technologist, a Microsoft Most Valuable Professional (MVP), and a Microsoft Regional Director, a tech community advocate, a public speaker and blogger, a published author, and a keynote speaker. Jen is the founder of a boutique consultancy based in the UK, Data Relish, which focuses on delivering successful business intelligence and artificial intelligence solutions that add real value to customers worldwide. She has featured on the BBC as a guest expert on topics relating to data.
Read more about Jen Stirrup