Packt+ | Advance your knowledge in tech

You're reading from Practical Predictive Analytics

Product typeBook

Published inJun 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785886188

Edition1st Edition

Languages

Tools

Splunk

Concepts

Predictive Analytics

Author (1)

Ralph Winters

Your second script

Our second R script is a simple two variable regression model which predicts womens height based upon weight.

Begin by creating another R script by selecting File | New File | R Script from the top navigation bar. If you create new scripts via File | New File | R Script often enough you might get Click Fatigue (uses three clicks), so you can also save a click by selecting the icon in the top left with the + sign:

Whichever way you choose , a new blank script window will appear with the name Untitled2.

Now paste the following code into the new script window:

require(graphics)
data(women)
lm_output <- lm(women$height ~ women$weight) 
summary(lm_output) 
prediction <- predict(lm_output) 
error <- women$height-prediction 
plot(women$height,error)

Press the Source icon to run the entire code. The display will change to something similar to what is displayed as follows:

Code description

Here are some notes and explanations for the script code that you have just run:

lm() function: This function runs a simple linear regression using the lm() function. This function predicts women's height based upon the value of their weight. In statistical parlance, you will be regressing height on weight. The line of code which accomplishes this is:

        lm_output <- lm(women$height ~ women$weight)

There are two operations that you will become very familiar with when running predictive models in R:
- The ~ operator: Also called the tilde, this is a shorthand way for separating what you want to predict, with what you are using to predict. This is an expression in formula syntax. What you are predicting (the dependent or target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. In order to improve readability, the independent variable (weight) and dependent variable (height) are specified using $ notation which specifies the object name, $, and then the dataframe column. So womens height is referenced as women$height and womens weight is referenced as women$weight. Alternatively, you can use the attach command, and then refer to these columns only by specifying the names height and weight. For example, the following code would achieve the same results:

                      attach(women)
                      lm_output <- lm(height ~ weight)

- The <- operator: Also called the assignment operator. This common statement assigns whatever expressions are evaluated on the right side of the assignment operator to the object specified on the left side of the operator. This will always create or replace a new object that you can further display or manipulate. In this case, we will be creating a new object called lm_output, which is created using the function lm(), which creates a linear model based on the formula contained within the parentheses.

Note

Note that the execution of this line does not produce any displayed output. You can see whether the line was executed by checking the console. If there is any problem with running the line (or any line for that matter), you will see an error message in the console.

summary(lm_output): The following statement displays some important summary information about the object lm_output and writes to output to the R Console as pictured previously:

        summary(lm_output)

The results will appear in the Console window as pictured in the previous figure. Just to keep thing a little bit simpler for now, I will just show the first few lines of the output, and underline what you should be looking at. Do not be discouraged by the amount of output produced.

Look at the lines marked Intercept and women$weight which appear under the coefficients line in the console.

        Coefficients:
                    Estimate Std. Error t value Pr(>|t|)   
      (Intercept)  25.723456   1.043746   24.64 2.68e-12 ***
      women$weight  0.287249   0.007588   37.85 1.09e-14 ***

The Estimate column illustrates the linear regression formula needed to derive height from weight. We can actually use these numbers along with a calculator to determine the prediction ourselves. For our example the output tells us that we should perform the following steps for all of the observations in our dataframe in order to obtain the prediction for height. We will obviously not want to do all of the observations (R will do that via the following predict() function), but we will illustrate the calculation for 1 data point:

- Take the weight value for each observation. Lets take the weight of the first woman which is 115 lbs.
- Then,multiply weight by 0.2872 . That is the number that is listed under Estimate for womens$weight. Multiplying 115 lbs. by 0.2872 yield 33.028
- Then add 25.7235 which is the estimate of the (intercept) row. That will yield a prediction of 58.75 inches.

If you do not have a calculator handy, the calculation is easily done in calculator mode via the R Console, by typing the following:

The predict function

To predict the value for all of the values we will use a function called predict(). This function reads each input (independent) variable and then predicts a target (dependent) variable based on the linear regression equation. In the code we have assigned the output of this function to a new object named prediction.

Switch over to the console area, and type prediction, then Enter, to see the predicted values for the 15 women. The following should appear in the console.

> prediction
       1        2        3        4        5        6        7  
58.75712 59.33162 60.19336 61.05511 61.91686 62.77861 63.64035  
       8        9       10       11       12       13       14  
64.50210 65.65110 66.51285 67.66184 68.81084 69.95984 71.39608  
      15  
72.83233

Notice that the value of the first prediction is very close to what you just calculated by hand. The difference is due to rounding error.

Examining the prediction errors

Another R object produced by our linear regression is the error object. The error object is a vector that was computed by taking the difference between the predicted value of height and the actual height. These values are also known as the residual errors, or just residuals.

error <- women$height-prediction

Since the error object is a vector, you cannot use the nrow() function to get its size. But you can use the length() function:

>length(error)
[1] 15

In all of the previous cases, the counts all total 15, so all is good. If we want to see the raw data, predictions, and the prediction errors for all of the data, we can use the cbind() function (Column bind) to concatenate all three of those values, and display as a simple table.

At the console enter the follow cbind command:

> cbind(height=women$height,PredictedHeight=prediction,ErrorInPrediction=error)
   height PredictedHeight ErrorInPrediction
1      58        58.75712       -0.75711680
2      59        59.33162       -0.33161526
3      60        60.19336       -0.19336294
4      61        61.05511       -0.05511062
5      62        61.91686        0.08314170
6      63        62.77861        0.22139402
7      64        63.64035        0.35964634
8      65        64.50210        0.49789866
9      66        65.65110        0.34890175
10     67        66.51285        0.48715407
11     68        67.66184        0.33815716
12     69        68.81084        0.18916026
13     70        69.95984        0.04016335
14     71        71.39608       -0.39608278
15     72        72.83233       -0.83232892

From the preceding output, we can see that there are a total 15 predictions. If you compare the ErrorInPrediction with the error plot shown previously, you can see that for this very simple model, the prediction errors are much larger for extreme values in height (shaded values).

Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows.

At the command prompt in the console area, enter the command:

nrow(women)

The following should appear:

>nrow(women)
[1] 15

Refer back to the seventh line of code in the original script: plot(women$height,error) plots the predicted height versus the errors. It shows how much the prediction was off from the original value. You can see that the errors show a non-random pattern.

After you are done, save the file using File | File Save, navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression.

You have been reading a chapter from

Practical Predictive Analytics

Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Other recommended products

Related to this chapter

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Hands-On Exploratory Data Analysis with R

Hands-On Exploratory Data Analysis with R puts the complete process of exploratory data analysis into a practical demonstration in one nutshell. You will understand the concepts of data analysis right from data ingestion, data cleaning, data manipulation to applying statistical techniques and visualizing hidden patterns.

BookMay 2019266 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

Practical Machine Learning with R

Practical Machine Learning with R gives you the complete knowledge to solve your business problems - starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain the model.

BookAug 2019416 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

R Data Analysis Projects

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it is one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. In this book, we show you just how to do that - with the help of practical implementations of real-world use cases.

BookNov 2017366 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

SAS for Finance

SAS is the ground-breaking tool for advanced, predictive, and statistical analytics. Right from refining your data using power of SAS analytics, you will be able to exploit the capabilities of high-powered package to create accurate financial models. You can easily assess the pros and cons of models to suit unique business needs.

BookMay 2018306 pages

IBM SPSS Modeler Essentials

IBM SPSS Modeler allows quick, efficient predictive analytics and insight building from your data, and is a popularly used data mining tool. This book will guide you through the data mining process, and presents relevant statistical methods which are used to build predictive models and conduct other analytic tasks using IBM SPSS Modeler. From importing the data to finding hidden relationships within it, you will be able to build solid data mining solutions and then deploy them to production. The book also contains valuable information on evaluating and enhancing the performance of your data models.

BookDec 2017238 pages

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages