Reader small image

You're reading from  Data Science for Marketing Analytics - Second Edition

Product typeBook
Published inSep 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800560475
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Mirza Rahim Baig
Mirza Rahim Baig
author image
Mirza Rahim Baig

Mirza Rahim Baig is a Data Science and Artificial Intelligence leader with over 13 years of experience across e-commerce, healthcare, and marketing. He currently holds the position of leading Product Analytics at Marketing Services for Zalando, Europe's largest online fashion platform. In addition, he serves as a Subject Matter Expert and faculty member for MS level programs at prominent Ed-Tech platforms and institutes in India. He is also the lead author of two books, 'Data Science for Marketing Analytics' and 'The Deep Learning Workshop,' both published by Packt. He is recognized as a thought leader in my field and frequently participates as a guest speaker at various forums.
Read more about Mirza Rahim Baig

Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Vishwesh Ravi Shrimali
Vishwesh Ravi Shrimali
author image
Vishwesh Ravi Shrimali

Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. He also completed his Masters in Machine Learning and AI from LJMU in 2021. He has authored - Machine learning for OpenCV (2nd edition), Computer Vision Workshop and Data Science for Marketing Analytics (2nd edition) by Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar.
Read more about Vishwesh Ravi Shrimali

View More author details
Right arrow

7. Supervised Learning: Predicting Customer Churn

Overview

In this chapter, you will perform classification tasks using logistic regression and implement the most widely used data science pipeline – Obtain, Scrub, Explore, Model, and iNterpret (OSEMN). You will interpret the relationship between the target and explanatory variables by performing data exploration. This will in turn help in selecting features for building predictive models. You will use these concepts to train your churn model. You will also perform logistic regression as a baseline model to predict customer churn.

Introduction

The success of a company is highly dependent on its ability to attract new customers while holding on to the existing ones. Churn refers to the situation where a customer of a company stops using its product and leaves the company. Churn can be anything—employee churn from a company, customer churn from a mobile subscription, and so on. Predicting customer churn is important for an organization because acquiring new customers is easy but retaining them is more difficult. Similarly, high employee churn can also affect a company, since the companies spend a huge sum of money on grooming talent. Also, organizations that have high retention rates benefit from consistent growth, which can also lead to high referrals from existing customers. Churn prediction is one of the most common use cases of machine learning.

You learned about supervised learning in the previous chapters, where you gained hands-on experience in solving regression problems. When it comes to predicting...

Classification Problems

Consider a situation where you have been tasked to build a model to predict whether a product bought by a customer will be returned or not. Since we have focused on regression models so far, let's try and imagine whether these will be the right fit here. A regression model will give continuous values as output (for example, 0.1, 100, 100.25, and so on), but in our case study we just have two values as output – a product will be returned, or it won't be returned. In such a case, except for these two values, all other values will be incorrect/invalid. While we can say that product returned can be considered as the value 0, and product not returned can be considered as the value 1, we still can't define what a value of 1.5 means.

In scenarios like these, classification models come into the picture. Classification problems are the most common type of machine learning problem. Classification tasks are different from regression tasks in the...

Understanding Logistic Regression

Logistic regression is one of the most widely used classification methods, and it works well when data is linearly separable. The objective of logistic regression is to squash the output of linear regression to classes 0 and 1. Let's first understand the "regression" part of the name and why, despite its name, logistic regression is a classification model.

Revisiting Linear Regression

In the case of linear regression, our mapping function would be as follows:

Figure 7.2: Equation of linear regression

Here, x refers to the input data and θ0 and θ1 are parameters that are learned from the training data.

Also, the cost function in the case of linear regression, which is to be minimized, is the root mean squared error (RMSE), which we discussed in the previous chapter.

This works well for continuous data, but the problem arises when we have a categorical target variable, such as 0 or 1. When we try to use linear regression...

Logistic Regression

If a response variable has binary values, the assumptions of linear regression are not valid for the following reasons:

  • The relationship between the independent variable and the predictor variable is not linear.
  • The error terms are heteroscedastic. Recall that heteroscedastic means that the variance of the error terms is not the same throughout the range of x (input data).
  • The error terms are not normally distributed.

If we proceed, considering these violations, the results would be as follows:

  • The predicted probabilities could be greater than 1 or less than 0.
  • The magnitude of the effects of independent variables may be underestimated.

With logistic regression, we are interested in modeling the mean of the response variable, p, in terms of an explanatory variable, x, as a probabilistic model in terms of the odds ratio. The odds ratio is the ratio of two probabilities – the probability of the event occurring, and the probability...

Creating a Data Science Pipeline

"Pipeline" is a commonly used term in data science, and it means that a pre-defined list of steps is performed in a proper sequence – one after another. The clearer the instructions, the better the standard of results obtained, in terms of quality and quantity. OSEMN is one of the most common data science pipelines used for approaching any kind of data science problem. The acronym is pronounced awesome.

The following figure provides an overview of the typical sequence of actions a data analyst would follow to create a data science pipeline:

Figure 7.12: The OSEMN pipeline

Let's understand the steps in the OSEMN pipeline in a little more detail:

  1. Obtaining the data, which can be from any source: structured, unstructured, or semi-structured.
  2. Scrubbing the data, which means getting your hands dirty and cleaning the data, which can involve renaming columns and imputing missing values.
  3. Exploring the data to find out...

Churn Prediction Case Study

You work at a multinational bank that is aiming to increase its market share in Europe. Recently, the number of customers using banking services has declined, and the bank is worried that existing customers have stopped using them as their main bank. As a data scientist, you are tasked with finding out the reasons behind customer churn and predicting future customer churn. The marketing team is interested in your findings and wants to better understand existing customer behavior and possibly predict future customer churn. Your results will help the marketing team to use their budget wisely to target potential churners.

Before you start analyzing the problem, you'll first need to have the data at you disposal.

Obtaining the Data

This step refers to collecting data. Data can be obtained from a single source or multiple sources. In the real world, collecting data is not always easy since the data is often divided. It can be present in multiple...

Modeling the Data

Data modeling, as the name suggests, refers to the process of creating a model that can define the data and can be used to draw conclusions and predictions for new data points. Modeling the data not only includes building your machine learning model but also selecting important features/columns that will go into your model. This section will be divided into two parts: Feature Selection and Model Building. For example, when trying to solve the churn prediction problem, which has a large number of features, feature selection can help in selecting the most relevant features. Those relevant features can then be used to train a model (in the model-building stage) to perform churn prediction.

Feature Selection

Before building our first machine learning model, we have to do some feature selection. Consider a scenario of churn prediction where you have a large number of columns and you want to perform prediction. Not all the features will have an impact on your prediction...

Summary

Predicting customer churn is one of the most common use cases in marketing analytics. Churn prediction not only helps marketing teams to better strategize their marketing campaigns but also helps organizations to focus their resources wisely.

In this chapter, we explored how to use the data science pipeline for any machine learning problem. We also learned the intuition behind using logistic regression and saw how it is different from linear regression. We looked at the structure of the data by reading it using a pandas DataFrame. We then used data scrubbing techniques such as missing value imputation, renaming columns, and data type manipulation to prepare our data for data exploration. We implemented various data visualization techniques, such as univariate and bivariate analysis and a correlation plot, which enabled us to find useful insights from the data. Feature selection is another important part of data modeling. We used a tree-based classifier to select important...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Science for Marketing Analytics - Second Edition
Published in: Sep 2021Publisher: PacktISBN-13: 9781800560475
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Mirza Rahim Baig

Mirza Rahim Baig is a Data Science and Artificial Intelligence leader with over 13 years of experience across e-commerce, healthcare, and marketing. He currently holds the position of leading Product Analytics at Marketing Services for Zalando, Europe's largest online fashion platform. In addition, he serves as a Subject Matter Expert and faculty member for MS level programs at prominent Ed-Tech platforms and institutes in India. He is also the lead author of two books, 'Data Science for Marketing Analytics' and 'The Deep Learning Workshop,' both published by Packt. He is recognized as a thought leader in my field and frequently participates as a guest speaker at various forums.
Read more about Mirza Rahim Baig

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Vishwesh Ravi Shrimali

Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. He also completed his Masters in Machine Learning and AI from LJMU in 2021. He has authored - Machine learning for OpenCV (2nd edition), Computer Vision Workshop and Data Science for Marketing Analytics (2nd edition) by Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar.
Read more about Vishwesh Ravi Shrimali