Reader small image

You're reading from  R Machine Learning By Example

Product typeBook
Published inMar 2016
Reading LevelIntermediate
Publisher
ISBN-139781784390846
Edition1st Edition
Languages
Tools
Right arrow

Chapter 5. Credit Risk Detection and Prediction – Descriptive Analytics

In the last two chapters, you saw some interesting problems revolving around the retail and e-commerce domains. You now know how to detect and predict shopping trends from shopping patterns as well as how to build recommendation systems. If you remember from Chapter 1, Getting started with R and Machine Learning that the applications of machine learning are diverse, we can apply the same concepts and techniques to solve a wide variety of problems in the real world. We will be tackling a completely new problem here, but hold on to what you have learnt because several concepts you learnt previously will come in handy soon!

In the next couple of chapters, we will be tackling a new problem related to the financial domain. We will be looking at the bank customers of a particular German bank who could be credit risks for the bank, based on some data that has been previously collected. We will perform descriptive and exploratory...

Types of analytics


Before we start tackling our next challenge, it will be useful to get an idea of the different types of analytics which broadly encompass the data science domain. We use a variety of data mining and machine learning techniques to solve different data problems. However, depending on the mechanism of the technique and its end result, we can broadly classify analytics into four different types which are explained next:

  • Descriptive analytics: This is what we use when we have some data to analyze. We start with looking at the different attributes of the data, extract meaningful features, and use statistics and visualizations to understand what has already happened. The main aim of descriptive analytics is to get a broad idea of what kind of data we are dealing with and summarize what has happened in the past. Above almost 80% of all analytics in businesses today are descriptive.

  • Diagnostic analytics: This is sometimes clubbed together with descriptive analytics. Here the main...

Our next challenge


We have dealt with some interesting applications of machine learning in the e-commerce domain in the last couple of chapters. For the next two chapters, our big challenge will be in the financial domain. We will be using data analysis and machine learning techniques to analyze financial data from a German bank. This data will contain a lot of information regarding customers of that bank. We will be analyzing that data in two stages which include descriptive and predictive analytics.

  • Descriptive: Here we will look closely at the data and its various attributes. We will perform descriptive analysis and visualizations to see the kind of features we are dealing with and how they might be related to credit risk. The data we will be dealing with here consists of labeled data already and we will be able to see how many customers were credit risks and how many weren't. We will also look closely at each feature in the data and understand its significance which will be useful in...

What is credit risk?


We have been using this term credit risk since the start of this chapter and many of you might be wondering what exactly does this mean, even though you might have guessed it after reading the previous section. Here, we will be explaining this term clearly so that you will have no problem in understanding the data and its features in the subsequent sections when we will be analyzing the data.

The standard definition of credit risk is the risk of defaulting on a debt which takes place due to the borrower failing to make the required debt payments in time. This risk is taken by the lender since the lender incurs losses of both the principal amount as well as the interest on it.

In our case, we will be dealing with a bank which acts as the financial organization giving out loans to customers who apply for them. Hence, customers who might default on the loan payment would be credit risks for the bank. By analyzing customer data and applying machine learning algorithms on it...

Getting the data


The first step in our data analysis pipeline is to get the dataset. We have actually cleaned the data and provided meaningful names to the data attributes and you can check that out by opening the german_credit_dataset.csv file. You can also get the actual dataset from the source which is from the Department of Statistics, University of Munich through the following URL: http://www.statistik.lmu.de/service/datenarchiv/kredit/kredit_e.html.

You can download the data and then run the following commands by firing up R in the same directory with the data file, to get a feel of the data we will be dealing with in the following sections:

> # load in the data and attach the data frame
> credit.df <- read.csv("german_credit_dataset.csv", header = TRUE, sep = ",") 
> # class should be data.frame
> class(credit.df)
[1] "data.frame"
> 
> # get a quick peek at the data
> head(credit.df)

The following figure shows the first six rows of the data. Each column indicates...

Data preprocessing


In this section, we will be focusing on data preprocessing which includes data cleaning, transformation, and normalizations if required. Basically, we perform operations to get the data ready before we start performing any analysis on it.

Dealing with missing values

There will be situations when the data you are dealing with will have missing values, which are often represented as NA in R. There are several ways to detect them and we will show you a couple of ways next. Note that there are several ways in which you can do this.

> # check if data frame contains NA values
> sum(is.na(credit.df))
[1] 0
> 
> # check if total records reduced after removing rows with NA 
> # values
> sum(complete.cases(credit.df))
[1] 1000

The is.na function is really useful as it helps in finding out if any element has an NA value in the dataset. There is another way of doing the same by using the complete.cases function, which essentially returns a logical vector saying whether...

Data analysis and transformation


Now that we have processed our data, it is ready for analysis. We will be carrying out descriptive and exploratory analysis in this section, as mentioned earlier. We will analyze the different dataset attributes and talk about their significance, semantics, and relationship with the credit risk attribute. We will be using statistical functions, contingency tables, and visualizations to depict all of this.

Besides this, we will also be doing data transformation for some of the features in our dataset, namely the categorical variables. We will be doing this to combine the category classes which have similar semantics and remove the classes having very less proportion by merging them with a similar class. Some reasons for doing this include preventing the overfitting of our predictive models, which we will be building in Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, linking semantically similar classes together and also because modeling...

Next steps


We have analyzed our dataset, performed necessary feature engineering and statistical tests, built visualizations, and gained substantial domain knowledge about credit risk analysis and what kind of features are considered by banks when they analyze customers. The reason why we analyzed each feature in the dataset in detail was to give you an idea about each feature that is considered by banks when analyzing credit rating for customers. This was to give you good domain knowledge understanding and also to help you get familiar with the techniques of performing an exploratory and descriptive analysis of any dataset in the future. So, what next? Now comes the really interesting part of using this dataset; building feature sets from this data and feeding them into predictive models to predict which customers can be potential credit risks and which of them are not. As mentioned previously, there are two steps to this: data and algorithms. In fact, we will go a step further and say...

Summary


Congratulations on staying until the end of this chapter! You have learnt several important things by now which we have covered in this chapter. You now have an idea about one of the most important areas in the financial domain, that is, Credit Risk analysis. Besides this, you also gained significant domain knowledge about how banks analyze customers for their credit ratings and what kind of attributes and features are considered by them. Descriptive and exploratory analysis of the dataset also gave you an insight into how to start working from scratch when you just have a problem to solve and a dataset given to you! You now know how to perform feature engineering, build beautiful publication quality visualizations using ggplot2, and perform statistical tests to check feature associations. Finally, we wrapped up our discussion by talking about feature sets and gave a brief introduction to several supervised machine learning algorithms which will help us in the next step of predicting...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Machine Learning By Example
Published in: Mar 2016Publisher: ISBN-13: 9781784390846
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime