Reader small image

You're reading from  Machine Learning with R Quick Start Guide

Product typeBook
Published inMar 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781838644338
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Iván Pastor Sanz
Iván Pastor Sanz
author image
Iván Pastor Sanz

Ivn Pastor Sanz is a lead data scientist and machine learning enthusiast with extensive experience in finance, risk management, and credit risk modeling. Ivn has always endeavored to find solutions to make banking more comprehensible, accessible, and fair. Thus, in his thesis to obtain his PhD in economics, Ivn tried to identify the origins of the 2008 financial crisis and suggest ways to avoid a similar crisis in the future.
Read more about Iván Pastor Sanz

Right arrow

Predicting Failures of Banks - Multivariate Analysis

In this chapter, we are going to apply different algorithms with the aim of obtaining a good model using combinations of our predictors. The most common algorithm that's used in credit risk applications, such as credit scoring and rating, is logistic regression. In this chapter, we will see how other algorithms can be applied to solve some of the weaknesses of logistic regression.

In this chapter, we will be covering the following topics:

  • Logistic regression
  • Regularized methods
  • Testing a random forest model
  • Gradient boosting
  • Deep learning in neural networks
  • Support vector machines
  • Ensembles
  • Automatic machine learning

Logistic regression

Mathematically, a binary logistic model has a dependent variable with two categorical values. In our example, these values relate to whether or not a bank is solvent.

In a logistic model, log odds refers to the logarithm of the odds for a class, which is a linear combination of one or more independent variables, as follows:

The coefficients (beta values, β) of the logistic regression algorithm must be estimated using maximum likelihood estimation. Maximum likelihood estimation involves getting values for the regression coefficients that minimize the error in the probabilities that are predicted by the model and the real observed case.

Logistic regression is very sensitive to the presence of outlier values, so high correlations in variables should be avoided. Logistic regression in R can be applied as follows:

set.seed(1234)
LogisticRegression=glm(train...

Regularized methods

There are three common approaches to using regularized methods:

  • Lasso
  • Ridge
  • Elastic net

In this section, we will see how these methods can be implemented in R. For these models, we will use the h2o package. This provides a predictive analysis platform to be used in machine learning that is open source, based on in-memory parameters, and distributed, fast, and scalable. It helps in creating models that are built on big data and is most suitable for enterprise applications as it enhances production quality.

For more information on the h2o package, please visit its documentation at https://cran.r-project.org/web/packages/h2o/index.html.

This package is very useful because it summarizes several common machine learning algorithms in one package. Moreover, these algorithms can be executed in parallel on our own computer, as it is very fast. The package includes...

Testing a random forest model

A random forest is an ensemble of decision trees. In a decision tree, the training sample, which is based on the independent variables, will be split into two or more homogeneous sets. This algorithm deals with both categorical and continuous variables. The best attribute is selected using a recursive selection method and is split to form the leaf nodes. This continues until a criterion that's meant to stop the loop is met. Every tree that's created by the expansion of leaf nodes is considered to be a weak learner. This weak learner is built on top of the rows and columns of the subsets. The higher the number of trees, the lower the variance. Both classification and regression random forests calculate the average prediction of all of the trees to make a final prediction.

When a random forest is trained, some different parameters can be set...

Gradient boosting

Gradient boosting means combining weak and average predictors to acquire one strong predictor. This ensures robustness. It is similar to a random forest, which is mainly based on decision trees. The difference is that the sample is not modified from one tree to another; only the weights of the different observations are modified.

Boosting trains trees sequentially by using information from previously trained trees. For this, we first need to create decision trees using the training dataset. Then, we need to create another model that does nothing but rectify the errors that occurred in the training model. This process is repeated sequentially until the specified number of trees, or some other stopping rule, is reached.

More specific details about the algorithm can be found in the documentation of the h2o package. While training the algorithm, we will need to define...

Deep learning in neural networks

For machine learning, we need systems that can process nonlinear and unrelated sets of data. This is very important so that we can make predictions for bankruptcy problems, since the relationship between the default and explanatory variables will rarely be linear. Therefore, using neural networks is the best possible solution.

Artificial neural networks (ANNs) have long since been used to solve bankruptcy problems. An ANN is a computer system that has a number of interconnected processors. These processors provide outputs by processing information and by responding dynamically to the inputs that are provided. A prominent and basic example of ANN is the multilayer perceptron (MLP). An MLP can be represented as follows:

Except for the input nodes, each node is a neuron that uses a nonlinear activation function, which was sent in.

As is evident from...

Support vector machines

The support vector machine (SVM) algorithm is a supervised learning technique. To understand this algorithm, take a look at the following diagram for the optimal hyperplane and maximum margin:

In this classification problem, we only have two classes that exist for many possible solutions to a problem. As shown in the preceding diagram, the SVM classifies these objects by calculating an optimal hyperplane and maximizing the margins between the classes. Both of these things will differentiate the classes to the maximum extent. Samples that are placed closest to the margin are known as support vectors. The problem is then treated as an optimization problem and can be solved by optimization techniques, the most common one being the use of Lagrange multipliers.

Even in a separable linear problem, as shown in the preceding diagram, sometimes, it is not always...

Ensembles

At this point, we have trained five different models. The predictions are stored in two data frames, one for training and the other for the validation samples:

head(summary_models_train)
## ID_RSSD Default GLM RF GBM deep
## 4 37 0 0.0013554364 0 0.000005755001 0.000000018217172
## 21 242 0 0.0006967876 0 0.000005755001 0.000000002088871
## 38 279 0 0.0028306028 0 0.000005240935 0.000003555978680
## 52 354 0 0.0013898732 0 0.000005707480 0.000000782777042
## 78 457 0 0.0021731695 0 0.000005755001 0.000000012535539
## 81 505 0 0.0011344433 0 0.000005461855 0.000000012267744
## SVM
## 4 0.0006227083
## 21 0.0002813123
## 38 0.0010763298
## 52 0.0009740568
## 78 0.0021555739
## 81 0.0005557417

Let's summarize the accuracy of the previously trained models...

Automatic machine learning

Now that we have learned how to develop a powerful model to predict bank failures, we will test a final option to develop different models. Specifically, we will try out automatic machine learning (autoML), which is included in the h2o package. The process that we have carried out to build many models and find the best one without any prior knowledge is done automatically by the autoML function. This function trains different models by trying different grids of parameters. Moreover, stacked ensembles or models based on previously trained models are trained to find more accurate or predictive models.

In my opinion, using this function before launching any model is highly recommended to get an initial idea of a reference starting point. Using an automatic approach, we can assess the most reliable algorithms, the most important potential variables to be...

Summary

In this chapter, we used different models and algorithms to try and optimize our model. All of the algorithms obtained good results. This would not have been the case in other problems. You can try using different algorithms in your problems and test the best combinations of parameters to solve your specific problem. A combination of different algorithms or ensembles might be a good option as well.

In the next chapter, we will continue by looking at other real problems—specifically, data visualization of economic imbalances in European countries.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with R Quick Start Guide
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644338
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Iván Pastor Sanz

Ivn Pastor Sanz is a lead data scientist and machine learning enthusiast with extensive experience in finance, risk management, and credit risk modeling. Ivn has always endeavored to find solutions to make banking more comprehensible, accessible, and fair. Thus, in his thesis to obtain his PhD in economics, Ivn tried to identify the origins of the 2008 financial crisis and suggest ways to avoid a similar crisis in the future.
Read more about Iván Pastor Sanz