You're reading from Practical Predictive Analytics
In the previous chapters, we introduced Spark and SparkR, with the emphasis on exploring data using SQL. In this chapter, we will begin to look at the machine learning capabilities of Spark using MLlib, which is the native machine learning library which is packaged with Spark.
In this chapter we will cover logistic regression, and clustering algorithms. In the next chapter we will cover rule based algorithm, which include decision trees. Some of the material has already been discussed in prior chapters using PC versions of R. In this chapter, as well as the next, we will focus predominantly on how to prepare your data and apply these techniques using the MLLib algorihms which exist in Spark.
Proceed to create our test and train datasets. The objective will be to sample 80% of the data for the training set and 20% of the data for the test data set.
To speed up sampling somewhat, we can sequentially sample the tails of the sample_bin
range for the test dataset and then use the middle for the training data. This is still a random sample, since sample_bin
was originally generated randomly and the sequence or range of the numbers have no bearing on the randomness.
Since we want 80% of our data to be training data, first take all of the sample_bin
numbers which lie between the high and low cutoff values. We can define the cutoff range as 20% of the difference between the highest and lowest value of sample_bin
.
Set the low cutoff as the lowest value plus the cutoff range defined previously, and the high cutoff as the highest value minus the cutoff range:
#compute the minimum and maximum values of sample bin...
Now that we have constructed our test and training datasets, we will begin by building a logistic regression model which will predict the outcome 1 or 0. As you will recall, 1 designates diabetes detected, while 0 designates diabetes not detected.
The syntax of a Spark glm
is very similar to a normal glm. Specify the model using formula notation. Be sure to specify family = "binomial"
to indicate that the outcome variable has only two outcomes:
# run glm model on Training dataset and assign it to object named "model" model <- spark.glm(outcome ~ pregnant + glucose + pressure + triceps + insulin + pedigree + age,family = "binomial", maxIter=100, data = df) summary(model)
Next, we have similar code for the test data group. Set the grp
flag to 0 to designate this is from the test
group:
#run predictions on test dataset based on training model preds_test <- predict(model, test) preds_test$grp <- 0 preds_test$totrows = nrow(preds_test)
Print a few rows from the results using the SparkR select
function to extract several key columns:
head(SparkR::select(preds_test,preds_test$outcome,preds_test$prediction,preds_test$grp,preds_test$totrows))
Next, we will combine the training (grp=1
) and testing (grp=0
) datasets into one dataframe and manually calculate some accuracy statistics:
preds$error
: this is the absolute difference between the outcome (0,1) and the prediction. Recall that for a binary regression model, the prediction represents the probability that the event (diabetes) will occur.preds$errorsqr
: this is the calculated squared error. This is done in order to remove the sign.preds$correct
: in order to classify the probability into correct or not correct, we will compare the error to a.5
cutoff. If the error was small (<-.5
) we will call it correct, otherwise it will be considered not correct. This is a somewhat arbitrary cutoff, and it is used to determine which category to place the prediction in.
As a final step, we will once again separate the data back into test and training based upon the grp
flag:
#classify 'correct' prediction if error is less than or equal to .5 preds...
Logistic regression in SparkR lacks some of the cross-validation and other features that you may be used to in base R. However, it is a starting point to enable you to start running large-scale models. If you need to employ some of the cross-validation techniques that have already been covered, you can certainly extract a sample of the data (via collect) and run the regression in base R.
However, there are some techniques that you can use to produce pseudo R-Squares and other diagnostics while continuing to work within Spark, which we will demonstrate.
We can compute the confusion, or error, matrix in order to determine how our manual calculation performed, when we classified the prediction outcomes as correct or not:
#Confusion matrix result <- sql("select outcome,correct, count(*) as k, avg(totrows) as totrows from preds_tbl where grp=1 group by 1,2 order by 1,2") result$classify_pct <- result$k/result$totrows display(result)
To determine the grand total correct model prediction, sum the correct=Y columns previously:
Summary of correct predictions for training group:
Correctly predicted outcome=1 | 20% |
Correctly predicted outcome=0 | 59% |
Total Correct Percentage | 79% |
You can see that there is much more predictive power in predicting outcome=0 than there is outcome=1.
The results for the test group are similar to those of the training group. Any discrepancies between test and training would warrant looking more closely at the model and observing how the data was sampled or split:
#Confusion matrix for TEST group result <- sql("select outcome,correct, count(*) as k, avg(totrows) as totrows from preds_tbl where grp=0 group by 1,2 order by 1,2") result$classify_pct <- result$k/result$totrows display(result)
Add up the correct calculation is a similar way to the training group. The results are slightly less, which is normal when comparing test to training results :
Summary of Correct Predictions for Test Group:
Correctly predicted outcome=1 | 22% |
Correctly predicted outcome=0 | 52% |
Total Correct Percentage | 74% |
If you wish to use other tools to plot the data, you can first take a sample of the Spark data and plot it using another package such as ggplot
. Note that some versions of Spark may now have ggplot integrated and available for use within Spark. However this example will show another example of extracting data which can be used by other packages.
We will take a 2% sample of all of the predictions and then print some of the results. Note that the Spark sample function has a different syntax from the base R sample function we used earlier. You could also specify this as SparkR::sample
to make sure you are invoking the correct function:
local = collect(sample(preds, F,.02)) head(local)
Creating global views will also allow us to pass data between different databricks notebooks. These views will be referenced in the next section. Use the %sql
magic command as the first line in the databricks notebook to signify that these are SQL statements:
%sql CREATE GLOBAL TEMPORARY VIEW df_view AS SELECT * FROM df %sql CREATE GLOBAL TEMPORARY VIEW test_view AS SELECT * FROM test %sql CREATE GLOBAL TEMPORARY VIEW out_sd_view AS SELECT * FROM out_sd %sql CREATE GLOBAL TEMPORARY VIEW sumdf_view AS SELECT * FROM sumdf
After the views have been created, use SQL to read back the counts and verify the totals with the row counts produced for the original dataframes:
%sql select count(*) from global_temp.df_view union all select count(*) from global_temp.test_view union all select count(*) from global_temp.sumdf_view union all select count(*) from global_temp.out_sd_view
We now have all the needed statistics to normalize the data. Recall that the formula for normalizing a variable x
is as follows:
In order to implement this, we will wrap the needed computations into a function and invoke it for both the training and test datasets:
- Use the SparkR
selectExpr
expression to calculate the normalized version of each variable using the formula above. - Also, create a new variable with old appended to the name, which preserves the original value of the variable. After testing, you should remove these extra variables to save space, but it is good to retain them while debugging:
normalize_it <- function (x) { selectExpr(x, "age as ageold","(age-age_mean)/ age_std as age", "mass as massold","(mass-mass_mean)/ mass_std as mass", "triceps as tricepsold", "(triceps-triceps_mean)/ triceps_std as triceps", "pressure as pressureold", ...
Another way to look at the clusters is by looking directly at their mean values. We can do this directly by using SQL:
- First, look at any variables which have normalized values >1 or < -1, or high the highest absolute value for that variable. That will give you some clues on how to begin to classify the clusters.
- Also look at the magnitude and the signs of the coefficients. Coefficients with large absolute values can indicate an important influence of the variable on that particular cluster. Variables with opposite signs are important in terms of characterizing or naming the clusters.
tmp_agg <- SparkR::sql("SELECT prediction, mean(age), mean(triceps), mean(pregnant),mean(pressure),mean(insulin), mean(glucose), mean(pedigree) from fitted_tbl group by 1") head(tmp_agg)
Scanning through the five clusters produced, you might categorize Cluster 2 as a group consisting of younger people...
In this chapter we went beyond SQL, and started to explore the machine learning capabilities of Spark. We covered both regression and kmeans clustering using our diabetes dataset. We constructed our training and testing data sets, and learned how to introduce some variation into our data via simulation. A lot of Databricks visualization was covered, as well as some visualizations using the collect() function to export the data to base R so that we could use ggplot. We also learned how to perform some regression diagnostics manually using code. We then learned how to standardize a data set via code, and used the results to illustrate a kmeans example using Spark. Finally we looked at the resulting clusters and examined some simple interpretations.