Packt+ | Advance your knowledge in tech

You're reading from Practical Predictive Analytics

Product type Book

Published in Jun 2017

Publisher Packt

ISBN-13 9781785886188

Pages 576 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Author (1):

Ralph Winters

Table of Contents (19) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Predictive Analytics

The Modeling Process

Inputting and Exploring Data

Introduction to Regression Algorithms

Introduction to Decision Trees, Clustering, and SVM

Using Survival Analysis to Predict and Analyze Customer Churn

Using Market Basket Analysis as a Recommender Engine

Exploring Health Care Enrollment Data as a Time Series

Introduction to Spark Using R

Exploring Large Datasets Using Spark

Spark Machine Learning - Regression and Cluster Models

Spark Models – Rule-Based Learning

Chapter 3. Inputting and Exploring Data

"On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

-Charles Babbage

In this chapter, we will cover inputting and exploring data. In the first two chapters, we covered processing some datasets that already reside within R packages. We purposefully avoided reading any external data sources. However, now we will. The inputting data section will cover various mechanisms for reading your own data into R.

The exploring data section covers some techniques you can use to implement successful completion of the second and third of the data understanding and data preparation steps of the CRISP-DM process we covered in the last chapter.

The topics we will cover include the following:

Getting data into R
Generating your own data
Munging and joining data
Data cleaning techniques
Data...

Data input

Data by itself is just a pure stream of a numerical something. It is the analytics process that turns this something into knowledge, but before we start understanding it, we have to be able to obtain it. The number of ways in which we can now generate data has grown exponentially. Progressing from fixed length format, through HTML, and then to free form unstructured input, and then to today's schema on read technologies, there are so many different data formats today that there is a very good chance that you haven't, and will never work with a few of them. Reading data in and understanding the variables and what the data represents can also be incredibly frustrating. Integrating the data with other sources, both internal and external, can seem like a jigsaw puzzle at times. At times, the data will not seem to fit as nicely as you would hope.

However, with regard to raw input, most people will work with a couple of common formats in the course of their work, and it will be useful...

Joining data

If you need to bring together different data sources, SQL is one method for bringing data together. As mentioned, SQL syntax is common to a lot of environments, so if you learn SQL syntax in R, you have started to learn how data is processed in other environments. But do not restrict yourself to just SQL. Other options exist for joining data, such as using the merge statement. Merge is a native function that accomplishes the same objective. And some other packages handle data integration fairly well. I will also be using the dplyr package to perform some of the same tasks as could be done in SQL.

The sqldf package is a standard R package that uses standard SQL syntax to merge, or join, two tables together. For relational data, this is accomplished by associating a variable on one table (primary key) with a similar variable on another associated table. Note that I am using the term table in the context of a relational database environment. In the R environment, an SQL table is...

Exploring the hospital dataset

Exploratory data analysis is a preliminary step prior to data modeling in which you look at all of the characteristics of data in order t0 get a sense of data distribution, correlation, missing values, outliers, and any other factors that might impact future analyses. It is a very important step, and if performed diligently, will save you a lot of time later on.

For the following examples, we will read the NYC hospital discharges dataset (hospital inpatient discharges (SPARCS De-Identified): 2012, n.d.). This example uses the read.csv function to input the delimited file, and then uses the View function to graphically display the output. Then the str function is used to display the contents of the df dataframe that was just created, and then finally, the summary() function displays all of the relevant statistics on all of the variables. These are all typical first steps to perform when looking at data for the first time:

df <-read.csv("C:/PracticalPredictiveAnalytics...

Transposing a dataframe

You will sometimes be given a format that contains data that is arranged vertically and you want to flip it so that the variables are arranged horizontally. You will also hear this referred to as long format versus wide format. Most predictive analytics packages are set up to use long format, but there are often cases in which you want to switch rows with columns. Perhaps data is being input as a set of key pairs and you want to be able to map them to features for an individual entity. Also, this may be necessary with some time series data in which the data which comes in as long format needs to be reformatted so that the time periods appear horizontally.

Here is a data frame that consists of sales for each member for each month in the first quarter. We will use the text=' option of the read.table() function to read table data that we have pasted directly into the code. For example, this is from data that has been pasted directly from an Excel spreadsheet:

sales_vertical...

Missing values

Missing values denote the absence of a value for a variable. Since data can never be collected in a perfect manner, many missing values can appear due to human oversight, or can be introduced via any systematic process that touches a data element. It can be due to a survey respondent not completing a question, or, as we have seen, it can be created from joining a membership file with a transaction file. In this case, if a member did not have a purchase in a particular year, it might end up as NA or missing.

The first course of action for handling missing values is to understand why they are occurring. In the course of plotting missing values, you not only want to produce counts of missing values, but you want to determine which sub-segments may be responsible for the missing values.

To research this, attempt to break out your initial analysis by time periods and other attributes using some of the bivariate analysis techniques that have been mentioned. This will help you to identify...

Imputing categorical variables

Imputing categorical variables can be trickier than imputation of numeric variables. Numeric imputation is based upon random variates, but imputation of categorical variables is based upon statistical tests with less power, such as Chi-square, and they can be rule-based, so if you end up imputing categorical variables, use with caution and run the results past some domain experts to see if it makes sense. You can use decision trees or random forests to come up with a prediction path for your missing values, and map them to a reasonable prediction value using the actual decision rules generated by the tree.

Outliers

Outliers are values in the data that are outside the range of what is to be expected. "What is to be expected?" is of course subjective. Some people will define an outlier as anything beyond three standard deviations of a normal distribution, or anything beyond 1.5 times the interquartile ranges. This, of course, may be good starting points, but there are many examples of real data that defies any statistical explanation. These rules of thumb are also highly dependent upon the form of the data. What might be considered an outlier for a normal distribution would not hold for a lognormal or Poisson distribution.

In addition to potential single variable outliers, outliers can also exist in multivariate form, and are more prevalent as data is examined more closely in a high-dimensional space.

Whenever they appear, outliers should be examined closely since they may be simple errors or provide valuable insight. Again, it is best to consult with other collaborators when you suspect deviation...

Data transformations

When you are dealing with continuous skewed data, consider applying a data transformation, which can conform the data to a specific statistical distribution with certain properties. Once you have forced the data to a certain shape, you will find it easier to work with certain models. A simple transformation usually involves applying a mathematical function to the data.

Some of the typical data transformations used are log, exp, and sqrt. Some work better for different kinds of skewed data, but they are not always guaranteed to work, so it is always best practice to try out several basic ones and determine if the transformation becomes workable within the modeling context. As always, the simplest transformation is the best transformation, and do some research on how transformations work, and which ones are best for certain kinds of data.

To illustrate the concept of a transformation, we will start by first generating an exponential distribution, which is an example of a...

Variable reduction/variable importance

Variable reduction techniques allow you to reduce the number of variables that you need to specify to a model. We will discuss three different methods to accomplish this.

Principal Components Analysis (PCA).
All subsets Regression.
Variable Importance.

Principal Components Analysis (PCA)

Principle Components Analysis (PCA) is a variable reduction technique, and can also be used to identify variable importance. An interesting benefit of PCA is that all of the resulting new component variables will all be uncorrelated with each other. Uncorrelated variables are desirable in a predictive model since too many correlated variables confound predictions and make it difficult to tell which of the independent variables have the most influence. So, if you first perform an exploratory analysis of your data and you find that a high number of correlations exist, this would be a good opportunity to apply PCA.

Note

Models can tolerate some degree of correlated variables...

References

Hospital Inpatient Discharges (SPARCS De-Identified): 2012. (n.d.). Retrieved from https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/u4ud-w55t.

Summary

In this chapter, we learned all about getting data prepared for analysis so that you can start to run models. It starts with inputting external data in raw form, and we saw that there are several ways you can accomplish these available methods. You also learned how to generate your own data and two different ways you can use to join, or munge data together, one using SQL and the other using dplyr function.

We later proceeded to cover some basic data cleaning and data exploration techniques that are sometimes needed after your data is input, such as standardizing and transposing the data, changing the variables type, creating dummy variables, binning, and eliminating redundant data. You now know about the key R functions that are used to take a first glance at the contents of the data, as well as its structure.

We then covered the important concepts of analyzing missing values and outliers, and how to handle them.

We saw a few ways to decrease the number of variables to a manageable...