Reader small image

You're reading from  The Data Analysis Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839211386
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Shubhangi Hora
Shubhangi Hora
author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

Konstantin Palagachev
Konstantin Palagachev
author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev

View More author details
Right arrow

6. Analysis of Credit Card Defaulters

Overview

In this chapter, you will analyze the characteristics of the customers who are most likely to default on their credit card payments using univariate and bivariate analysis techniques. Using the crosstab function, you will investigate the relationship between different features of the dataset. And, by the end of this chapter, you will be able to build a profile of a customer who is the most statistically likely to default on their credit card payments.

Introduction

In the previous chapter, we analyzed online shoppers' purchasing intent and derived various useful insights from our findings. We explored and utilized the K-means clustering technique, along with univariate and bivariate analysis, and also studied the linear relationships between each feature of the dataset to build a proper evaluation of the dataset. The results derived from the analysis would help a business to identify the pain points and develop new business strategies to tackle them.

In this chapter, we will analyze credit card payments of customers and use their transactional data to study the characteristics of the customers who are most likely to default, eventually building a profile of these customers.

Credit card default has been a field of interest and extensive analysis for more than a decade. There are two types of loan – secured and unsecured. A secured loan is one where some collateral is mandatory, so whenever a default happens, the...

Importing the Data

Before we begin with the actual analysis, we will need to import the required packages as follows:

# Import basic libraries
import numpy as np 
import pandas as pd 
# import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Next, read/import the dataset into the work environment:

df = pd.read_excel('default_credit.xls')
df.head(5)

The output will be as follows:

Figure 6.2: Top five rows of the DataFrame

Figure 6.2: Top five rows of the DataFrame

Check the metadata of the DataFrame:

# Getting Meta Data Information about the dataset
df.info()

The output will be similar to the image shown below:

Figure 6.3: Information of the DataFrame

Figure 6.3: Information of the DataFrame

Check the descriptive statistics for the numerical columns in the DataFrame:

df.describe().T

The output will be as follows:

Figure 6.4: Descriptive statistics of the DataFrame

Figure 6.4: Descriptive statistics of the DataFrame

Next, check for null values:

...

Data Preprocessing

Before proceeding onto univariate analysis, let's look at the unique values in the columns. The motive behind looking at the unique values in a column is to identify the subcategory in each column. By knowing the subcategory in each column, we would be in a position to understand which subcategory has a higher count or vice versa. For example, let's take the EDUCATION column. We are interested in finding what the different subcategories in the EDUCATION column are and which subcategory has the higher count; that is, do our customers have their highest education as College or University?

This step acts as a precursor before we build a profile of our customers.

Let's now find unique values in the SEX column.

We'll print the unique values in the SEX column and sort them in ascending order:

print('SEX ' + str(sorted(df['SEX'].unique())))

The output will be as follows:

SEX [1, 2]

The following code prints...

Exploratory Data Analysis

The majority of time in a data science project is spent on Exploratory Data Analysis (EDA). In EDA, we investigate data to find hidden patterns and outliers with the help of visualization. By performing EDA, we can uncover the underlying structure of data and test our hypotheses with the help of summary statistics. We can split EDA into three parts:

  • Univariate analysis
  • Bivariate analysis
  • Correlation

Let's look at each of the parts one by one in the following sections.

Univariate Analysis

Univariate analysis is the simplest form of analysis where we analyze each feature (that is, each column of a DataFrame) and try to uncover the pattern or distribution of the data.

In univariate analysis, we will be analyzing the categorical columns (DEFAULT, SEX, EDUCATION, and MARRIAGE) to mine useful information about the data:

Let's begin with each of the variables one by one:

  1. The DEFAULT column:

    Let's look at the...

Correlation

In this section, we will cover correlation – what does correlation mean, and how do we check the correlation between the DEFAULT column and other columns in our dataset?

Correlation measures the degree of dependency between any two variables. Say, for example, we have two variables, A and B. If the value of B increases when the value of A is increased, we say the variables are positively correlated. On the other hand, if the value of B decreases when we increase the value of A, we say the variables are negatively correlated. There could also be a situation where an increase in the value of A doesn't affect the value of B, for which we say the variables are uncorrelated.

The value of a correlation coefficient can vary between -1 to 1, with 1 being a strong positive correlation and -1 a strong negative correlation.

By studying the correlation between the DEFAULT column and other columns with the help of a heatmap, we can figure out which column/variable...

Building a Profile of a High-Risk Customer

Based on the analysis performed in the previous sections, we can now build a profile of the customer who is most likely to default. With this predicted customer profile, credit card companies can take preventive steps (such as reducing credit limits or increasing the rate of interest) and can demand additional collateral from customers who are deemed to be high risk.

The customer who satisfies the majority of the following conditions can be classified as a high-risk customer. A high-risk customer is one who has a higher probability of default:

  • A male customer is more likely to default than a female customer.
  • People with a relationship status of other are more likely to default than married or single people.
  • A customer whose highest educational qualification is a high-school diploma is more likely to default than a customer who has gone to graduate school or university.
  • A customer who has delayed payment for 2 consecutive...

Summary

In this chapter, we applied univariate EDA to a given dataset to plot the distribution of individual features and implemented bivariate analysis to understand the relationship between two features. We also used a correlation heatmap to determine the correlation of the features of the DataFrame. Drawing conclusions from the results of our analyses, we were able to build a statistically probable profile of a high-risk customer most likely to default on a loan.

In the next chapter, we will analyze the medical data of 303 patients and link the data features with the diagnosis of heart disease.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Analysis Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781839211386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev