Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Spark 2

You're reading from  Learning Apache Spark 2

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785885136
Pages 356 pages
Edition 1st Edition
Languages

Table of Contents (18) Chapters

Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface
1. Architecture and Installation 2. Transformations and Actions with Spark RDDs 3. ETL with Spark 4. Spark SQL 5. Spark Streaming 6. Machine Learning with Spark 7. GraphX 8. Operating in Clustered Mode 9. Building a Recommendation System 10. Customer Churn Prediction 1. Theres More with Spark

Chapter 10. Customer Churn Prediction

This is our last chapter for this book, and we have looked at the technology topics around Spark from architecture to the details of the APIs including RDDs, DataFrames, and machine learning and GraphX frameworks. In the last chapter, we covered a recommendation engine use case where we primarily looked at the Scala API. We've primarily used Scala, Python, or R-Shell. In this chapter, we will be using the Jupyter notebook with the Pyspark interpreter to look at the Churn prediction use case.

The chapter covers:

  • Overview of customer churn
  • Importance of churn prediction
  • Understanding the dataset
  • Exploring data
  • Building a machine learning pipeline
  • Predicting Churn

This chapter will hopefully give you a good introduction to churn prediction systems, which you can use as a baseline for other prediction activities.

Let's get started.

Overview of customer churn


I have spent almost 15 years in the telecom and financial industry with some of the major telecom and financial customers, and if there is one business case that makes the business worried it is Customer Churn. So what is customer churn?

According to Wikipedia, Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers (http://bit.ly/2kfTHXF).

Customer churn is such a nightmarish problem for major vendors because:

  • Churn affects your customer base and hence profitability and baseline
  • Churn impacts your overall business image
  • Churn hurts your company's goodwill and market sentiment
  • Churn hurts your total addressable market
  • Churn gives your competitors a psychological and economic advantage
  • Churn affects your overall employee morale

Figure 10.1: Customer churn via http://bit.ly/2iUScln

While you can never reduce the churn rate to zero, you essentially want to:

  • Identify who is most likely to churn
  • Identify...

Why is predicting customer churn important?


So based on the quick overview of what churn is and how it impacts the organizations, why is it important to predict churn? Well it is as important as predicting any potential bad event happening in the organization. Predicting the customers that are prone to churn will help you to devise a strategy to tackle the potential problem. Remember, not all churn is bad for the bottom line; however, you need to understand the impact of a customer's churn on the revenues and other non-tangible factors mentioned in the previous section. Remember, that until you understand what the problem is, you cannot devise a strategy to resolve the problem. Each customer, or segment of a customer's needs is to be treated differently, and perhaps with a different strategy for each segment. There are various ways by which you can reduce your churn rate, and some of them have been eloquently described in the blog by Ross Beard, which can be accessed here: http://bit.ly...

How do we predict customer churn with Spark?


Predicting customer churn in Apache Spark is similar to predicting any other binary outcome. Spark provides a number of algorithms to do such a prediction. While we'll focus on Random Forest, you can potentially look at other algorithms within the MLLib library to perform the prediction. We'll follow the typical steps of building a machine learning pipeline that we had discussed in our earlier MLLib chapter.

The typical stages include:

  • Stage 1: Loading data/defining schema
  • Stage 2: Exploring/visualizing the data set
  • Stage 3: Performing necessary transformations
  • Stage 4: Feature engineering
  • Stage 5: Model training
  • Stage 6: Model evaluation
  • Stage 7: Model monitoring

Data set description

Since we are going to target the telecom industry, we'll use one of the popular data sets around generally used for telecommunication demonstrations. It was originally published in Discovering Knowledge in Data (http://www.dataminingconsultant.com/DKD.htm) (Larose, 2004)....

Exploring customer service calls


Based on past experience, in a typical churn scenario, customers who churn would typically make more calls to the customer service centers compared to those who don't churn. Calls to customer service center imply a problem faced by the customer, and multiple calls to the customer service center indicate a persistent unresolved problem.

We've used Plotly to build a scatter plot for calls made to customer services. Plotly, also known by its URL, Plot.ly, is an online analytics and data visualization tool providing online graphing, analytic, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python (https://en.wikipedia.org/wiki/Python_(programming_language), R (https://en.wikipedia.org/wiki/R_(programming_language)), MATLAB (https://en.wikipedia.org/wiki/MATLAB), Perl (https://en.wikipedia.org/wiki/Perl), Julia (https://en.wikipedia.org/wiki/Julia_(programming_language)), Arduino (https://en.wikipedia.org/wiki...

References


We've referenced the following blogs, articles, and videos during the creation of this chapter. You may want to refer to them for further reading:

Summary


This concludes the chapter. We have gone through a churn prediction example using the PySpark and the Jupyiter notebook. I hope this gives you a good starting point for building your own applications. The full code and the Jupyter notebook are available on this book's GitHub page.

This was the last major chapter of this book. As a part of this book our intention was to take the users who are beginning to learn Spark on a journey where they can start from the very basics to a level where they feel comfortable with Spark as framework and also about writing their own Spark applications. We've covered some interesting topics including RDDs, DataFrames, MlLib, GraphX and also how to set up Spark in a cluster mode. Any book cannot do justice to Spark as a framework, as it is continuously evolving with new and exciting features added in every release.

We hope you have enjoyed this journey and look forward to hearing from you on your experience and feedback. In the Appendix, There's More with...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Spark 2
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781785885136
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime}