Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Data Analysis Workshop

You're reading from  The Data Analysis Workshop

Product type Book
Published in Jul 2020
Publisher Packt
ISBN-13 9781839211386
Pages 626 pages
Edition 1st Edition
Languages
Authors (3):
Gururajan Govindan Gururajan Govindan
Profile icon Gururajan Govindan
Shubhangi Hora Shubhangi Hora
Profile icon Shubhangi Hora
Konstantin Palagachev Konstantin Palagachev
Profile icon Konstantin Palagachev
View More author details

Table of Contents (12) Chapters

Preface
1. Bike Sharing Analysis 2. Absenteeism at Work 3. Analyzing Bank Marketing Campaign Data 4. Tackling Company Bankruptcy 5. Analyzing the Online Shopper's Purchasing Intention 6. Analysis of Credit Card Defaulters 7. Analyzing the Heart Disease Dataset 8. Analyzing Online Retail II Dataset 9. Analysis of the Energy Consumed by Appliances 10. Analyzing Air Quality Appendix

3. Analyzing Bank Marketing Campaign Data

Overview

In this chapter, we will analyze marketing campaign data related to new financial products. We will pay particular attention to modeling the relationships between the different features in the data and their impact on the final outcome of the campaign. We will also introduce fundamental topics in data analysis, such as linear and logistic regression models, and see how they can be used when analyzing the outcome of a marketing campaign.

Introduction

In the previous chapter, we looked at various techniques to do with probability theory and hypothesis testing, which are often applied in data analysis. In this chapter, we will extend our knowledge by introducing mathematical models that are suitable for both data analysis and predictions. In this way, we will obtain the fundamental tools for deriving explanatory models and provide a generic framework for identifying causalities and effects when performing data analysis.

Direct marketing campaigns are a classical approach to increasing business revenue, informing potential customers about new products, and merchandising them. Having targeted marketing campaigns can significantly increase success rates and revenue since the audience is based on precise criteria and the analysis of past marketing campaigns. Thus, extracting information about successful campaigns and customers can significantly reduce marketing costs and increase sales.

In this chapter, we will analyze...

Initial Data Analysis

We'll start our analysis by loading the data into Python and performing some simple analysis, which will give us a feeling about the type of data and the different features of the dataset (this is presented in detail in Figure 3.2):

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# pull data from github
bank_data = pd.read_csv("https://raw.githubusercontent.com/"\
                        "PacktWorkshops/"\
                        "The-Data-Analysis-Workshop/"\
                        "master/Chapter03...

Impact of Numerical Features on the Outcome

In this section, we will analyze the relationship between the numerical features (already identified in Exercise 3.01, Analyzing Distributions of Numerical Features in the Banking Dataset) and the outcome of a marketing campaign, which is identified in the y column in the banking dataset.

We will start our analysis by addressing the following question: Is there a statistically significant difference in numerical features for successful and non-successful marketing campaigns? For this reason, we will create violin plots (as shown in the previous chapters) that compare the distribution of the numerical features for the two types of outcomes ("yes" for a successful marketing campaign, "no" for an unsuccessful one):

"""
create violin plots for successful and non-successful marketing campaigns
"""
plt.figure(figsize=(10,18))
for index, col in enumerate(numerical_features):
  &...

Modeling the Relationship via Logistic Regression

We will dedicate the rest of this chapter to one of the fundamental techniques in data analysis and machine learning: linear and logistic regression.

Suppose that we are estimating the relationship between m different features.

In both linear and logistic regression, a set of features, X1, ..., Xm, and a target variable, Y, are provided to model the target variable, Y, as a function of the features, X1, ..., Xm, as in the following equation:

Figure 3.16: General form of linear/logistic regression

Figure 3.16: General form of linear/logistic regression

Linear Regression

In linear regression, the target variable, Y, is a continuous variable, meaning that it assumes all possible values in a bounded or unbounded interval, (a,b) formula R, where R is the set of real numbers. In this way, the preceding equation assumes the following concrete form:

Figure 3.17: Linear regression equation

Figure 3.17: Linear regression equation

Let's denote the right-hand side of the preceding equation with Ŷ, as follows:

Figure 3.18: Linear regression equation

Then, if we have n samples in our data (where for each i ϵ {1,..., n}, we denote the entries for the m features with xi,1,....xi,m and the target variable with y_i), we can rewrite the previous equation in a more specific form, as follows:

Figure 3.19: Linear regression equation in a specific form

Figure 3.19: Linear regression equation in a specific form

Note that in Figure 3.17 and Figure 3.19, we assume that the dependency of Y from the feature vectors X1,...,Xm is either linear or can be approximated...

Logistic Regression

Logistic regression is very similar to the linear regression technique we introduced in the previous section, with the only difference that the target variable, Y, assumes only values in a discrete set; say, for simplicity {0, 1}. If we were to approach such a problem as a logistic regression problem, the output of the right-hand side of the equation in Figure 3.17 could easily go way beyond the values 0 and 1. Furthermore, even by limiting the output, it will still be able to assume all the values in the interval [0, 1]. For this reason, the idea behind logistic regression is to model the probability of the target variable Y, to assume one of the values (say 1). In this case, all the values between 0 and 1 will be reasonable.

With p, let's denote the probability of the target variable, Y, being equal to 1 when it's given a specific feature x:

Figure 3.32: Definition of p

Figure 3.32: Definition of p

Let's also define the logit function:

...

Summary

In this chapter, we analyzed campaign marketing data in which new financial services were offered to customers. We paid particular attention to linear models and investigated how the probability of a successful outcome can be modeled as a function of different macroeconomics factors. From a technical perspective, we introduced linear models (such as linear and logistic regression) and paid particular attention to their interpretation.

In the next chapter, we will be analyzing a Polish company's bankruptcy data to try and understand the main reasons behind bankruptcy and see whether it is possible to identify early warning signs.

lock icon The rest of the chapter is locked
You have been reading a chapter from
The Data Analysis Workshop
Published in: Jul 2020 Publisher: Packt ISBN-13: 9781839211386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}