Reader small image

You're reading from  Machine Learning with R Quick Start Guide

Product typeBook
Published inMar 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781838644338
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Iván Pastor Sanz
Iván Pastor Sanz
author image
Iván Pastor Sanz

Ivn Pastor Sanz is a lead data scientist and machine learning enthusiast with extensive experience in finance, risk management, and credit risk modeling. Ivn has always endeavored to find solutions to make banking more comprehensible, accessible, and fair. Thus, in his thesis to obtain his PhD in economics, Ivn tried to identify the origins of the 2008 financial crisis and suggest ways to avoid a similar crisis in the future.
Read more about Iván Pastor Sanz

Right arrow

Predicting Failures of Banks - Univariate Analysis

In recent years, big data and machine learning have become increasingly popular in many areas. It is generally believed that the greater the number of variables there are, the more accurate a classifier becomes. However, this is not always true.

In this chapter, we will reduce the number of variables in the dataset by analyzing the individual predictive power of each variable and using different alternatives.

In this chapter, we will cover the following topics:

  • Feature selection algorithm
  • Filter method
  • Wrapper method
  • Embedded methods
  • Dimensionality reduction

Feature selection algorithm

In this real-world case of predicting the failure of banks, we have a high number of variables or financial ratios to train a classifier, so we would expect to obtain a great predictive model. With this in mind, why would we want to select alternate variables and reduce their number?

Well, in some cases, increasing the dimensionality of the problem by adding new features could reduce the performance of our model. This is called the curse of dimensionality problem.

According to this problem, the fact of adding more features or increasing the dimensionality of our feature space will require collecting more data. In this sense, the new observations we need to collect have to grow exponentially quickly to maintain the learning process and to avoid overfitting.

This problem is commonly observed in cases in which the ratio between the number of variables...

Filter methods

Let’s start with a filter method to reduce the number of variables in a first step. For that, we will measure the predictive power or the ability of a variable to classify our target variable individually and correctly.

In this case, we try to find variables that differentiate correctly between solvent and non-solvent banks. To measure the predictive power of a variable, we use a metric named Information Value (IV).

Specifically, given a grouped variable in n groups, each with a certain distribution of good banks and bad banks—or in our case, solvent and non-solvent banks—the information value for that predictor can be calculated as follows:

The IV statistic is generally interpreted depending on its value:

  • < 0.02: The variable of analysis does not accurately separate the classes in the target variable
  • 0.02 to 0.1: The variable has a weak...

Wrapper methods

As stated at the beginning of this section, wrapper methods evaluate subsets of variables to detect the possible interactions between variables being a step ahead of the filter methods.

In wrapper methods, several combinations of variables are used in a predictive model and a score is given to each combination according to the model accuracy.

In wrapper methods, a classifier is iteratively trained with multiple combinations of variables acting as a black box, for which the only output is a ranking of important features.

Boruta package

One of the most known wrapper packages in R is called Boruta. This package is mainly based on the algorithm of random forests.

Although this algorithm will be explained in more...

Embedded methods

The main difference between filter and wrapper approaches is that in filter approaches, such as embedded methods, you cannot separate the learning and feature selection parts.

Regularization methods are the most common type of embedded feature selection methods.

In classification problems such as this one, the logistic regression method cannot handle the multi-collinearity problem, which occurs when variables are very correlated. When the number of observations is not much larger than the number of variables of covariates, p, then there can be a lot of variability. Consequently, this variability could even increase the likelihood by simply adding more parameters, resulting in overfitting.

If variables are highly correlated or if collinearity exists, we expect the model parameters and variance to be inflated. The high variance is because of the wrongly specified...

Dimensionality reduction

Dimensionality projection, or feature projection, consists of converting data in a high-dimensional space to a space of fewer dimensions.

High dimensionality increases the computational complexity substantially, and could even increase the risk of overfitting.

Dimensionality reduction techniques are useful for featuring selection as well. In this case, variables are converted into other new variables through different combinations. These combinations extract and summarize the relevant information from a complex database with fewer variables.

Different algorithms exist, with the following being the most important:

  • Principal Component Analysis (PCA)
  • Sammon mapping
  • Singular value decomposition (SVD)
  • Isomap
  • Local linear embedding (LLE)
  • Laplacian eigenmaps
  • t-distributed Stochastic Neighbor Embedding (t-SNE)

Although dimensionality reduction is not very common...

Summary

In this chapter, we saw how univariate analysis reduced the sample space of our problem data and analyzed the data. Consequently, in the next chapter, we will see how these variables can be combined to obtain an accurate model, where several algorithms will be tested.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with R Quick Start Guide
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644338
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Iván Pastor Sanz

Ivn Pastor Sanz is a lead data scientist and machine learning enthusiast with extensive experience in finance, risk management, and credit risk modeling. Ivn has always endeavored to find solutions to make banking more comprehensible, accessible, and fair. Thus, in his thesis to obtain his PhD in economics, Ivn tried to identify the origins of the 2008 financial crisis and suggest ways to avoid a similar crisis in the future.
Read more about Iván Pastor Sanz