Reader small image

You're reading from  Practical Predictive Analytics

Product typeBook
Published inJun 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785886188
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Ralph Winters
Ralph Winters
author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Right arrow

Chapter 6. Using Survival Analysis to Predict and Analyze Customer Churn

"In an Infinite Universe anything can happen,' said Ford, 'Even survival. Strange but true."

- Douglas Adams The Restaurant at the End of the Universe (1980)

What is survival analysis?


Survival analysis covers a broad range of topics. Here is the list of topics that we will cover in this chapter:

  • Survival analysis
  • Time-based variables and regression
  • R survival objects
  • Customer attrition or churn
  • Survival curves
  • Cox regression
  • Plotting methods
  • Variable selection
  • Model concordance

Often, predictive analytic problems deal with various situations concerning the tracking of important events along a customer's journey, and predicting when these events will occur. Survival analysis is a form of analysis that is based upon the concept of time to event. The time to event is simply the number of units of time that have elapsed until something happens. The event can be just about anything; a car crash, a stock market crash, or a devastating phenomenon.

Survival analysis originated in the studying of patients who developed terminal diseases, such as cancer, hence the term survival. However, conceptually, it can even be applied to marketing applications in which you...

Our customer satisfaction dataset


In this chapter, we will be looking at a dataset of hypothetical customers who are subscribed to an online service, and who have responded to a customer satisfaction survey prior to the beginning of the study. This survey was then matched to transactional as well as demographic data to produce this simple analysis dataset, consisting of an event variable (churn), which will represent whether or not a customer unsubscribed from the service. We will also include some transaction data (number of purchases last month), as well as some demographic data (gender, educational level), as well as an overall satisfaction survey administered prior to the start of the study:

Partitioning into training and test data


Next, we will generate test and training datasets so that we can validate any models produced. There are many ways of generating test and training sets.

In earlier chapters, we used the createDataPartition function. For this example, we will generate the test and training data using native R functions. Please refer to the outline of the code here, and then run the code that follows:

  • Set a variable corresponding to the percentage of the data to designate as training data (TrainingRows). In this example, we will use 75%.
  • Use the sample() function to randomize the rows and assign to a new dataframe named ChurnStudy.
  • Then select the first TrainingRows rows. Since the df dataframe has already been sampled, selecting a percentage of rows sequentially from a random sample is a convenient and valid way to select a training sample.
  • The remaining rows (TrainingRows+1 to the end) will be the testing dataset. Assign it to ChurnStudy.test.

Once we have generated the...

Setting the stage by creating survival objects


Coding survival analysis in R usually starts with creating what is known as a survival object using the Surv() function. A survival object contains more information than a regular dataframe. The purpose of the survival object is to keep track of the time and the event status (0 or 1) for each observation. It is also to designate what the response (dependent) variable is.

At a minimum, you need to supply a single time variable and an event when defining a survival object. In our case, we will use the tenure time (Xtenure2) as the time variable, and a formula that designates the defining event. In our case, this will be Churn == 1, since that means that the customer churned in that month:

install.packages("survival")
library(survival)
ChurnStudy$SurvObj <- with(ChurnStudy, Surv(Xtenure2, Churn == 1))

As I mentioned in earlier chapters, I always like to issue a str() command after I create a new dataframe, just to make sure the results are as expected...

Examining survival curves


Kaplan Meir survival curves are usually a good place to start when examining the effect of different single factors upon the survival rate, since they are easy to construct and visualize. Later on, we will example cox regression, which can examine multiple factors.

Kaplan Meir (KM) curves are actually step functions in which the survival object, or hazard rate, is estimated at each discrete time point. This survival rate is computed by calculating the number of customers who have survived (are still active), divided by the number of customers at risk. The number of customers at risk (which is the denominator) excludes all customers who have already churned, or haven't achieved the tenure specified at any particular time point.

To illustrate, if we table ChurnStudy by the number of months active (Xtenure2), we can see that for month 1, there were 44 members whose survival rate is calculated as (1984 -19) (Number left after end of month 1 / 1984):

table(ChurnStudy$Xtenure2...

Cox regression modeling


KM tests can be satisfactory in many situations, especially during preliminary analysis; however, KM tests are non-parametric, and typically are less powerful than parametric equivalents. Cox regression extends survival analysis to a parametric regression type framework under which it assumes more power. If there are several independent variables that need to be incorporated into a model, and some of them are continuous, it is advantageous to perform cox proportional hazard modeling rather than KM.

Our first model

Cox modeling also starts with creating a survival object, as we did in previous examples. Other than that, a cox model looks very similar to a standard regression model with the response variables specified to the left of the ~ and the independent variables specified to the right.

In cox regression modeling, we use the coxph() function over the surv() function to specify the dependent variable. This can be done directly in the formula, or by assigning it to...

Time-based variables


Up until now, we have treated all of our variables as static, that is, they maintained their original values measured from the beginning of the measurement period.

In reality, values such as age and marital status change over time, and these changes can be accounted for by the model. In the marketing context, surveys might be administered after the study has begun. Based upon changes in some of these variables, coupons and other incentives might be offered (interventions) with the purpose of changing customer behavior. In the model, these interventions can also be accounted for.

In our example, we will introduce a hypothetical second survey, which was introduced 6 months into the measurement period and measured the effect of treating some of the unsatisfied customers.

Changing the data to reflect the second survey

The following code uses the survSplit function to create a new record a time period 6 that will reflect the response to a second hypothetical customer survey administered...

Comparing the models


Even though the survival curves are similar, we can see that at the end of 12 months, 56% of the customers were retained, as opposed to the original 27%. We could attribute that to the intervention that took place at month 6.

Use the summary(survfit) function to compare the modes:

Variable

Description

Monthly.Charges

Average dollar amount of previous purchases

Purch.last.Month

Number of purchases in the month before the study begins

Satisfaction

Overall satisfaction with the service supplied on a Likert scale

Satisfaction2

Follow-up satisfaction score

Gender

Male or female

Education Level...

Variable selection


The model we just worked with had a limited number of variables, so mechanical variable selection methods when dealing with a large number of variables were not really that pertinent. We were able to pinpoint the important ones via the regression model. However, for a model with a large number of variables we could use the glmulti package for the purpose of performing variable selection.

For the churn example that was generated, we have a small number of variables, so it is easy to demonstrate a variable selection and not so time consuming.

In the following code, we will set the maximum number of terms to include in the best regression to 10 in order to limit the computational time needed to perform an exhaustive search. We will also use the genetic algorithm option (method = "g") which can be much faster with larger datasets, since it only considers the best subsets of all of the combinations.

If you wish to perform an exhaustive search, use method = "h". However, be forewarned...

Summary


In this chapter, we learned about what survival analysis is, and how two main techniques, Kaplan-Meir and Cox Regression, can be used to explain and predict customer churn.

We also learned how we can generate our own data to test assumptions and test the robustness of the models.

Finally, we included some coding techniques to help us reproduce and save our generated code and images.

In the next chapter, we will not be concerned with a customer leaving, but will cover how to keep customers happy by predicting what they will purchase next using a technique known as Market Basket Analysis.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Predictive Analytics
Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

> summary(survfit(CoxModel.2))

Call: survfit(formula = CoxModel.2)

v

time n.risk n.event survival std.err lower 95% CI upper 95% CI

1 1488 15 0.994 0.00157 0.991 0.997

2 1455 52 0.973 0.00359 0.966 0.980

3 1393 34 0.958 0.00461 0.949 0.967

4 1342 20 0.950 0.00518 0.940 0.960

5 1315 39 0.932 0.00624 0.920 0.945

6 1245 42 0.913 0.00736 0.898 0.927

7 1156 24 0.898 0.00801 0.883 0.914

8 1020 32 0.877 0.00902 0.859 0.895

9 850 40 0.846 0.01052 0.825 0.866

10 665 51 0.797 0.01293 0.772 0.822

11 435 54 0.721 0.01688 0.688 0.755

12 225 55 0.569 0.02518 0.522 0.621

 

> summary(survfit(CoxModel.1))

Call: survfit(formula = CoxModel.1)

 

time n.risk n.event survival std.err lower 95% CI upper 95% CI

1 1488 15 0.993 0.00185...