Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

Chapter 10. Customer Churn Prediction

This is our last chapter for this book, and we have looked at the technology topics around Spark from architecture to the details of the APIs including RDDs, DataFrames, and machine learning and GraphX frameworks. In the last chapter, we covered a recommendation engine use case where we primarily looked at the Scala API. We've primarily used Scala, Python, or R-Shell. In this chapter, we will be using the Jupyter notebook with the Pyspark interpreter to look at the Churn prediction use case.

The chapter covers:

Overview of customer churn
Importance of churn prediction
Understanding the dataset
Exploring data
Building a machine learning pipeline
Predicting Churn

This chapter will hopefully give you a good introduction to churn prediction systems, which you can use as a baseline for other prediction activities.

Let's get started.

Overview of customer churn

I have spent almost 15 years in the telecom and financial industry with some of the major telecom and financial customers, and if there is one business case that makes the business worried it is Customer Churn. So what is customer churn?

According to Wikipedia, Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers (http://bit.ly/2kfTHXF).

Customer churn is such a nightmarish problem for major vendors because:

Churn affects your customer base and hence profitability and baseline
Churn impacts your overall business image
Churn hurts your company's goodwill and market sentiment
Churn hurts your total addressable market

Churn gives your competitors a psychological and economic advantage
Churn affects your overall employee morale

Figure 10.1: Customer churn via http://bit.ly/2iUScln

While you can never reduce the churn rate to zero, you essentially want to:

Identify who is most likely to churn
Identify...

Why is predicting customer churn important?

So based on the quick overview of what churn is and how it impacts the organizations, why is it important to predict churn? Well it is as important as predicting any potential bad event happening in the organization. Predicting the customers that are prone to churn will help you to devise a strategy to tackle the potential problem. Remember, not all churn is bad for the bottom line; however, you need to understand the impact of a customer's churn on the revenues and other non-tangible factors mentioned in the previous section. Remember, that until you understand what the problem is, you cannot devise a strategy to resolve the problem. Each customer, or segment of a customer's needs is to be treated differently, and perhaps with a different strategy for each segment. There are various ways by which you can reduce your churn rate, and some of them have been eloquently described in the blog by Ross Beard, which can be accessed here: http://bit.ly...

How do we predict customer churn with Spark?

Predicting customer churn in Apache Spark is similar to predicting any other binary outcome. Spark provides a number of algorithms to do such a prediction. While we'll focus on Random Forest, you can potentially look at other algorithms within the MLLib library to perform the prediction. We'll follow the typical steps of building a machine learning pipeline that we had discussed in our earlier MLLib chapter.

The typical stages include:

Stage 1: Loading data/defining schema
Stage 2: Exploring/visualizing the data set
Stage 3: Performing necessary transformations
Stage 4: Feature engineering
Stage 5: Model training
Stage 6: Model evaluation
Stage 7: Model monitoring

Data set description

Since we are going to target the telecom industry, we'll use one of the popular data sets around generally used for telecommunication demonstrations. It was originally published in Discovering Knowledge in Data (http://www.dataminingconsultant.com/DKD.htm) (Larose, 2004)....

Exploring customer service calls

Based on past experience, in a typical churn scenario, customers who churn would typically make more calls to the customer service centers compared to those who don't churn. Calls to customer service center imply a problem faced by the customer, and multiple calls to the customer service center indicate a persistent unresolved problem.

We've used Plotly to build a scatter plot for calls made to customer services. Plotly, also known by its URL, Plot.ly, is an online analytics and data visualization tool providing online graphing, analytic, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python (https://en.wikipedia.org/wiki/Python_(programming_language), R (https://en.wikipedia.org/wiki/R_(programming_language)), MATLAB (https://en.wikipedia.org/wiki/MATLAB), Perl (https://en.wikipedia.org/wiki/Perl), Julia (https://en.wikipedia.org/wiki/Julia_(programming_language)), Arduino (https://en.wikipedia.org/wiki...

References

We've referenced the following blogs, articles, and videos during the creation of this chapter. You may want to refer to them for further reading:

Summary

This concludes the chapter. We have gone through a churn prediction example using the PySpark and the Jupyiter notebook. I hope this gives you a good starting point for building your own applications. The full code and the Jupyter notebook are available on this book's GitHub page.

This was the last major chapter of this book. As a part of this book our intention was to take the users who are beginning to learn Spark on a journey where they can start from the very basics to a level where they feel comfortable with Spark as framework and also about writing their own Spark applications. We've covered some interesting topics including RDDs, DataFrames, MlLib, GraphX and also how to set up Spark in a cluster mode. Any book cannot do justice to Spark as a framework, as it is continuously evolving with new and exciting features added in every release.

We hope you have enjoyed this journey and look forward to hearing from you on your experience and feedback. In the Appendix, There's More with...