Packt+ | Advance your knowledge in tech

You're reading from Practical Predictive Analytics

Product typeBook

Published inJun 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785886188

Edition1st Edition

Languages

Tools

Splunk

Concepts

Predictive Analytics

Author (1)

Ralph Winters

Chapter 11. Spark Machine Learning - Regression and Cluster Models

"The study of thinking machines teaches us more about the brain than we can learn by introspective methods. Western man is externalizing himself in the form of gadgets."

- William S. Burroughs (Naked Lunch)

About this chapter/what you will learn

In the previous chapters, we introduced Spark and SparkR, with the emphasis on exploring data using SQL. In this chapter, we will begin to look at the machine learning capabilities of Spark using MLlib, which is the native machine learning library which is packaged with Spark.

In this chapter we will cover logistic regression, and clustering algorithms. In the next chapter we will cover rule based algorithm, which include decision trees. Some of the material has already been discussed in prior chapters using PC versions of R. In this chapter, as well as the next, we will focus predominantly on how to prepare your data and apply these techniques using the MLLib algorihms which exist in Spark.

Reading the data

In the last chapter, we saved the out_sd to an external parquet file. In the real world, you will be faced with analyzing multiple data sources. Often, these data sources will have similar schemas but will differ by the time period that they were written...

Splitting the data into train and test datasets

Proceed to create our test and train datasets. The objective will be to sample 80% of the data for the training set and 20% of the data for the test data set.

To speed up sampling somewhat, we can sequentially sample the tails of the sample_bin range for the test dataset and then use the middle for the training data. This is still a random sample, since sample_bin was originally generated randomly and the sequence or range of the numbers have no bearing on the randomness.

Generating the training datasets

Since we want 80% of our data to be training data, first take all of the sample_bin numbers which lie between the high and low cutoff values. We can define the cutoff range as 20% of the difference between the highest and lowest value of sample_bin.

Set the low cutoff as the lowest value plus the cutoff range defined previously, and the high cutoff as the highest value minus the cutoff range:

#compute the minimum and maximum values of sample bin...

Spark machine learning using logistic regression

Now that we have constructed our test and training datasets, we will begin by building a logistic regression model which will predict the outcome 1 or 0. As you will recall, 1 designates diabetes detected, while 0 designates diabetes not detected.

The syntax of a Spark glm is very similar to a normal glm. Specify the model using formula notation. Be sure to specify family = "binomial" to indicate that the outcome variable has only two outcomes:

# run glm model on Training dataset and assign it to object named "model"

model <- spark.glm(outcome ~ pregnant + glucose + pressure + triceps + insulin + pedigree + age,family = "binomial", maxIter=100, data = df) 
summary(model)

Examining the output:

You can observe the coefficients of the model in the Estimate column. You can also see that the residuals range from -2.54 to +2.40, which encompasses about 2.5 standard devations, and that the median value (not the mean) is supplied, which is -.326....

Running predictions for the test data

Next, we have similar code for the test data group. Set the grp flag to 0 to designate this is from the test group:

#run predictions on test dataset based on training model
preds_test <- predict(model, test)
preds_test$grp <- 0
preds_test$totrows = nrow(preds_test)

Print a few rows from the results using the SparkR select function to extract several key columns:

head(SparkR::select(preds_test,preds_test$outcome,preds_test$prediction,preds_test$grp,preds_test$totrows))

Combining the training and test dataset

Next, we will combine the training (grp=1) and testing (grp=0) datasets into one dataframe and manually calculate some accuracy statistics:

preds$error: this is the absolute difference between the outcome (0,1) and the prediction. Recall that for a binary regression model, the prediction represents the probability that the event (diabetes) will occur.
preds$errorsqr: this is the calculated squared error. This is done in order to remove the sign.
preds$correct: in order to classify the probability into correct or not correct, we will compare the error to a .5 cutoff. If the error was small (<- .5) we will call it correct, otherwise it will be considered not correct. This is a somewhat arbitrary cutoff, and it is used to determine which category to place the prediction in.

As a final step, we will once again separate the data back into test and training based upon the grp flag:

#classify 'correct' prediction if error is less than or equal to .5 

preds...

Exposing the three tables to SQL

We can now register our three key tables so that we can run some SQL code in order to obtain some additional diagnostics:

registerTempTable(preds,"preds_tbl")
registerTempTable(preds_train,"preds_train")
registerTempTable(preds_test,"preds_test")

Validating the regression results

Logistic regression in SparkR lacks some of the cross-validation and other features that you may be used to in base R. However, it is a starting point to enable you to start running large-scale models. If you need to employ some of the cross-validation techniques that have already been covered, you can certainly extract a sample of the data (via collect) and run the regression in base R.

However, there are some techniques that you can use to produce pseudo R-Squares and other diagnostics while continuing to work within Spark, which we will demonstrate.

Calculating goodness of fit measures

Confusion matrix

We can compute the confusion, or error, matrix in order to determine how our manual calculation performed, when we classified the prediction outcomes as correct or not:

#Confusion matrix 
result <- sql("select outcome,correct, count(*) as k, avg(totrows) as totrows from preds_tbl where grp=1 group by 1,2 order by 1,2") 
result$classify_pct <- result$k/result$totrows 

display(result)

To determine the grand total correct model prediction, sum the correct=Y columns previously:

Summary of correct predictions for training group:

Correctly predicted outcome=1	20%
Correctly predicted outcome=0	59%
Total Correct Percentage	79%

You can see that there is much more predictive power in predicting outcome=0 than there is outcome=1.

Confusion matrix for test group

The results for the test group are similar to those of the training group. Any discrepancies between test and training would warrant looking more closely at the model and observing how the data was sampled or split:

#Confusion matrix for TEST group 
result <- sql("select outcome,correct, count(*) as k, avg(totrows) as totrows from preds_tbl where grp=0 group by 1,2 order by 1,2") 
result$classify_pct <- result$k/result$totrows 
display(result)

Add up the correct calculation is a similar way to the training group. The results are slightly less, which is normal when comparing test to training results :

Summary of Correct Predictions for Test Group:

Correctly predicted outcome=1	22%
Correctly predicted outcome=0	52%
Total Correct Percentage	74%

Distribution of average errors by group

Distribution of errors is another that you can look at how well a model has fit the data. In this analysis, we look at the distribution of errors for all four combinations of the following...

Plotting outside of Spark

If you wish to use other tools to plot the data, you can first take a sample of the Spark data and plot it using another package such as ggplot. Note that some versions of Spark may now have ggplot integrated and available for use within Spark. However this example will show another example of extracting data which can be used by other packages.

Collecting a sample of the results

We will take a 2% sample of all of the predictions and then print some of the results. Note that the Spark sample function has a different syntax from the base R sample function we used earlier. You could also specify this as SparkR::sample to make sure you are invoking the correct function:

local = collect(sample(preds, F,.02))

head(local)

Examining the distributions by outcome

Next, you can run ggplot to graphically display the errors grouped by outcome. The resulting boxplots show that the three quartiles for diabetes are above the non-diabetic patients. This demonstrates that the model...

Creating some global views

Creating global views will also allow us to pass data between different databricks notebooks. These views will be referenced in the next section. Use the %sql magic command as the first line in the databricks notebook to signify that these are SQL statements:

%sql 
CREATE GLOBAL TEMPORARY VIEW df_view AS SELECT * FROM df 

%sql 
CREATE GLOBAL TEMPORARY VIEW test_view AS SELECT * FROM test 

%sql 
CREATE GLOBAL TEMPORARY VIEW out_sd_view AS SELECT * FROM out_sd 

%sql 
CREATE GLOBAL TEMPORARY VIEW sumdf_view AS SELECT * FROM sumdf

User exercise

After the views have been created, use SQL to read back the counts and verify the totals with the row counts produced for the original dataframes:

%sql
select count(*) from global_temp.df_view  union all
select count(*) from global_temp.test_view union all
select count(*) from global_temp.sumdf_view union all
select count(*) from global_temp.out_sd_view

Cluster analysis

In this section, we will illustrate how to implement a cluster...

Normalizing the data

We now have all the needed statistics to normalize the data. Recall that the formula for normalizing a variable x is as follows:

In order to implement this, we will wrap the needed computations into a function and invoke it for both the training and test datasets:

Use the SparkR selectExpr expression to calculate the normalized version of each variable using the formula above.
Also, create a new variable with old appended to the name, which preserves the original value of the variable. After testing, you should remove these extra variables to save space, but it is good to retain them while debugging:

         normalize_it <- function (x) { 
         selectExpr(x, 
                 "age as ageold","(age-age_mean)/ age_std as age", 
                 "mass as massold","(mass-mass_mean)/ mass_std as mass", 
                 "triceps as tricepsold",
                 "(triceps-triceps_mean)/ triceps_std as     triceps", 
                 "pressure as pressureold",
        ...

Characterizing the clusters by their mean values

Another way to look at the clusters is by looking directly at their mean values. We can do this directly by using SQL:

First, look at any variables which have normalized values >1 or < -1, or high the highest absolute value for that variable. That will give you some clues on how to begin to classify the clusters.
Also look at the magnitude and the signs of the coefficients. Coefficients with large absolute values can indicate an important influence of the variable on that particular cluster. Variables with opposite signs are important in terms of characterizing or naming the clusters.

        tmp_agg <- SparkR::sql("SELECT prediction, mean(age),
        mean(triceps),      
        mean(pregnant),mean(pressure),mean(insulin),
        mean(glucose),
        mean(pedigree) from fitted_tbl group by 1") 
        head(tmp_agg)

Scanning through the five clusters produced, you might categorize Cluster 2 as a group consisting of younger people...

Summary

In this chapter we went beyond SQL, and started to explore the machine learning capabilities of Spark. We covered both regression and kmeans clustering using our diabetes dataset. We constructed our training and testing data sets, and learned how to introduce some variation into our data via simulation. A lot of Databricks visualization was covered, as well as some visualizations using the collect() function to export the data to base R so that we could use ggplot. We also learned how to perform some regression diagnostics manually using code. We then learned how to standardize a data set via code, and used the results to illustrate a kmeans example using Spark. Finally we looked at the resulting clusters and examined some simple interpretations.

The rest of the chapter is locked

You have been reading a chapter from

Practical Predictive Analytics

Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Other recommended products

Related to this chapter

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Hands-On Exploratory Data Analysis with R

Hands-On Exploratory Data Analysis with R puts the complete process of exploratory data analysis into a practical demonstration in one nutshell. You will understand the concepts of data analysis right from data ingestion, data cleaning, data manipulation to applying statistical techniques and visualizing hidden patterns.

BookMay 2019266 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

Practical Machine Learning with R

Practical Machine Learning with R gives you the complete knowledge to solve your business problems - starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain the model.

BookAug 2019416 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

R Data Analysis Projects

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it is one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. In this book, we show you just how to do that - with the help of practical implementations of real-world use cases.

BookNov 2017366 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

SAS for Finance

SAS is the ground-breaking tool for advanced, predictive, and statistical analytics. Right from refining your data using power of SAS analytics, you will be able to exploit the capabilities of high-powered package to create accurate financial models. You can easily assess the pros and cons of models to suit unique business needs.

BookMay 2018306 pages

IBM SPSS Modeler Essentials

IBM SPSS Modeler allows quick, efficient predictive analytics and insight building from your data, and is a popularly used data mining tool. This book will guide you through the data mining process, and presents relevant statistical methods which are used to build predictive models and conduct other analytic tasks using IBM SPSS Modeler. From importing the data to finding hidden relationships within it, you will be able to build solid data mining solutions and then deploy them to production. The book also contains valuable information on evaluating and enhancing the performance of your data models.

BookDec 2017238 pages

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages