Reader small image

You're reading from  Practical Predictive Analytics

Product typeBook
Published inJun 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785886188
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Ralph Winters
Ralph Winters
author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Right arrow

Chapter 10. Exploring Large Datasets Using Spark

"I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

- Sir Arthur Conan Doyle

In this chapter, we will begin to perform some exploratory data analysis on the Spark dataframe we created in the previous chapter. We will learn about some specific Spark commands that will assist you in your analysis, and will discuss several ways to perform graphing and plotting.

As you go through these examples, remember that data that resides in Spark may be much larger than you are used to, and that it may be impractical to apply some quick analytic techniques without first considering how the data is organized, and how performance will be affecting using standard techniques.

If you are picking up where you left off in the previous chapter, you will have to load the saved Spark data frame before you begin. Recall that we saved the results of the diabetes...

Performing some exploratory analysis on positives


Before we move on to exploring the entire Spark dataframe, we can look at some of the data already generated for positive cases. As you may recall from the prior chapter, this is stored in the Spark dataframe out_sd1.

We have generated some random sample bins specifically so that we can do some exploratory analysis.

We can use the filter command to extract random sample 1, and take the first 1,000 records:

  • The filter is a SparkR command that allows you to subset a Spark dataframe
  • The display command is a databricks command that is equivalent to the View command we have previously used and you can also use the head function as well to limit the number of rows that are displayed:

This code chunk extracts 1000 records from the positives and displays them:

        small_pos <- head(SparkR::filter(out_sd1,out_sd1$sample_bin==1),1000) 
        nrow(small_pos)

        display(small_pos) 

The data appears in tabular form, and you can scroll up/down...

Cleaning up and caching the table in memory


Since Spark excels at processing in-memory data, we will first remove our intermediary data and then cache our out_sd dataframe, so that subsequent queries run much faster. Caching data in memory works best when similar types of queries are repeated. In that way, Spark is able to know how to juggle memory so that most of what you need resides in memory.

However, this is not foolproof. Good Spark query and table design will help with optimization, but out-of-the-box caching usually gives some benefit. Often, the first queries will not benefit from memory caching, but subsequent queries will run much faster.

Since we will no longer use the intermediary dataframes we created, we will remove them with the rm function, and then use the cache() function on the full dataframe:

#cleanup and cache df 
rm(out_sd1) 
rm(out_sd2) 
cache(out_sd)  

Some useful Spark functions to explore your data


Count and groupby

We can also use the Count and groupby functions to aggregate individual variables.

Here is an example of using this to tally the number of observations by outcome. Since the result is another dataframe, we can use the head function to write the results to the console.

Note

You might have to alter the number of rows returned by head if you change the query. It is always a good idea to filter results using a function such as head, to make sure that you are not printing hundreds of rows (or more).

However, you also need to ensure that you do not cut off all of your output. If you are unsure as to the number of rows, first assign the result to a dataframe and then check the number of rows (with nrow) first:

This code line count the number of rows by outcome. I know that there should be only 2 outcomes, but I place the count function within a head statement just to program defensively.

head(SparkR::count(groupBy(out_sd, "outcome")))...

Creating new columns


Usually it's necessary to create some new transformation based on existing variables which will improve a prediction. We have already seen that binning a variable is often done to create a nominal variable from a quantitative one.

Let's create a new column, called agecat, which divides age into two segments. To keep things simple, we will start off by rounding the age to the nearest integer.

filtered <- SparkR::filter(out_sd, "age > 0 AND insulin > 0") 
filtered$age <- round(filtered$age,0) 
filtered$agecat <- ifelse(filtered$age <= 35,"<= 35","35 Or Older") 
SparkR::head(SparkR::select(filtered, "age","agecat")) 

In the code which you just ran, you may notice that some commands are prefaced by SparkR::

This is done to let the program know which version of the function we wish to apply, and it is always good practice to preface commands in this way, in order to avoid syntax errors and misapplying identically named functions which occur between SparkR...

Constructing a cross-tab


Now that we have categorized age, we can run a cross-tab which counts outcomes by the age category.

Since there are only two outcomes and two age categories, this results in a four-cell crosstab:

  1. First, display the results using the special Databricks display command.
  2. After the results appear as shown in the table below, you can click the plot button (2nd icon on the bottom left) and the Customized Plot dialogue will appear, which will allow the results to be plotted as a bar chart. The plot show that diabetes occurs more frequently in the higher age group than in the lower age group, while the reverse is true for the non-diabetes group:
        table <- crosstab(filtered, "outcome", "agecat") 
        display(as.data.frame(table)) 

Contrasting histograms


Histograms are also a quick way to visually inspect and compare outcome variables.

Here is another example of using the Spark histogram function to contrast the mean values of body mass index for diabetic versus non-diabetic patients in the study. For the first bar chart, we can see a peak bar of about 38.9 BMI, versus a peak bar of 29.8 for non-diabetic patients. This suggests that BMI will be an important variable in any model we develop:

This code uses the SparkR histogram function to compute a histogram with 10 bins. The centroids gives the center value for each of the 10 bins. The most frequently occurring bar is the bar with a center value of 38.9 with a count of about 50,000. This type of histogram is useful for quickly getting a sense of the distribution of variables, but is somewhat lacking in labeling, and controlling various elements since as scales and ranges. If you wish to fine tune some of the elements you may want to start by using a collect() function...

Plotting using ggplot


If you prefer to use ggplot to plot your results, load the ggplot2 package to run your plots directly against the Spark dataframe. You will have more of an opportunity for further customization:

Here is a basic ggplot which corresponds to the histogram() functions illustrated above:

require(ggplot2) 
plot <- ggplot(age_hist, aes(x = centroids, y = counts)) + 
       geom_bar(stat = "identity") + 
      xlab("mass") + ylab("Frequency")   

plot 

Spark SQL


Another way to explore data in Spark is by using Spark SQL. This allows analysts who may not be well-versed in language-specific APIs, such as SparkR, PySpark (for Python), and Scala, to explore Spark data.

I will describe two different ways of accessing Spark data via SQL:

  • Issuing SQL commands through the R interface:

This has the advantage of returning the results as an R dataframe, where it can be further manipulated

  • Issuing SQL queries via databricks SQL magic: directive

This method allows analysts to issue direct SQL commands, without regard to any specific language environment

Before processing an object as SQL, the object needs to be registered as a SQL table or view. Once it is registered, it can be accessed through the SQL interface of any language API.

Once registered, you can use the show tables SparkR command to get a list of registered tables for your session.

Registering tables

#register out_sd as a table

SparkR:::registerTempTable(out_sd,"out_tbl")
SparkR:::cacheTable(sqlContext...

Exporting data from Spark back into R


It will often be the case that some of the analysis you wish to perform will not be available within SparkR and you will need to extract some of the data from Spark objects, and return them to base R.

For example, we were able to run correlation and covariance functions earlier directly on a Spark dataframe, by specifying specific pairs of variables. However, we did not generate correlation matrices for the entire dataframe for a couple of reasons:

  • The capability to do this may not be built into the version of Spark that you are currently running

  • Even if it was available, these kinds of calculation could be very computationally expensive to perform

One strategy you may want to use is to use Spark functions to explore basic characteristics of the data, and/or utilize specialized packages written for Spark (such as MLlib) to perform this.

For other cases, in which you want to perform more in-depth analysis, simply extract a sample from the Spark dataframe,...

Running local R packages


Once you have extracted your sample, you can run normal R functions such as pairs to generate a correlation matrix, or use the reshape2 package along with ggplot to generate a correlation plot.

Using the pairs function (available in the base package)

#this takes our "collect()" data frame which we exported from Spark, and runs a basic correlation matrix

pairs(samp[,3:8], col=samp$outcome) 

Generating a correlation plot

Here is a more sophisticated visualization which uses ggplot to illustrate how to generate a correlation matrix using shading to indicate the degree of correlation for each of the intersecting variables. Again, the point is to emphasis that you can perform analysis outside of Spark if your sample size is reasonable, and the exact functionality you need is not available in the version of Spark you are running.

require(ggplot2)
library(reshape2)
cormatrix <- round(cor(samp),2)
cormatrix_melt <- melt(cormatrix)
head(cormatrix_melt)
ggplot(data = cormatrix_melt...

Some tips for using Spark


Take a look at the following tips:

  • Sample when possible. Use the sample_bin methodology and filter command liberally. Sampling will speed up analysis both for the analysis phase and for the development/testing phase.

  • Once testing has been completed on a smaller segment, it can be scaled up to a much larger population with confidence.

  • Preprocess the data so that you can subselect potentially interesting sub segments.

  • Cache analysis when it makes sense.

  • If performance becomes a factor, try a larger number of partitions in your data.

  • For larger number crunching, bring back a representative sample to local R.

Summary


In this chapter, we learned the basics of exploring Spark data, using some Spark-specific commands that allowed us to filter, group, and summarize our Spark data.

We also learned about the ability to visualize data directly in Spark, along with learning how to run R functions such as ggplot against data.

We learned about some strategies for working with Spark data, such as performing intelligent filtering and sampling.

Finally, we demonstrated that often we need to extract some Spark data back into local R if we want the flexibility to use some of our usual tools that may not be supplied natively in the Spark environment.

In the next chapter, we will delve into the various predictive models that you can use that are specific to large datasets.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Predictive Analytics
Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters