You're reading from Machine Learning with R Quick Start Guide

Product typeBook

Published inMar 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781838644338

Edition1st Edition

Languages

Tools

RStudio

Concepts

Machine Learning

Author (1)

Iván Pastor Sanz

Predicting Failures of Banks - Univariate Analysis

In recent years, big data and machine learning have become increasingly popular in many areas. It is generally believed that the greater the number of variables there are, the more accurate a classifier becomes. However, this is not always true.

In this chapter, we will reduce the number of variables in the dataset by analyzing the individual predictive power of each variable and using different alternatives.

In this chapter, we will cover the following topics:

Feature selection algorithm
Filter method
Wrapper method
Embedded methods
Dimensionality reduction

Feature selection algorithm

In this real-world case of predicting the failure of banks, we have a high number of variables or financial ratios to train a classifier, so we would expect to obtain a great predictive model. With this in mind, why would we want to select alternate variables and reduce their number?

Well, in some cases, increasing the dimensionality of the problem by adding new features could reduce the performance of our model. This is called the curse of dimensionality problem.

According to this problem, the fact of adding more features or increasing the dimensionality of our feature space will require collecting more data. In this sense, the new observations we need to collect have to grow exponentially quickly to maintain the learning process and to avoid overfitting.

This problem is commonly observed in cases in which the ratio between the number of variables...

Filter methods

Let’s start with a filter method to reduce the number of variables in a first step. For that, we will measure the predictive power or the ability of a variable to classify our target variable individually and correctly.

In this case, we try to find variables that differentiate correctly between solvent and non-solvent banks. To measure the predictive power of a variable, we use a metric named Information Value (IV).

Specifically, given a grouped variable in n groups, each with a certain distribution of good banks and bad banks—or in our case, solvent and non-solvent banks—the information value for that predictor can be calculated as follows:

The IV statistic is generally interpreted depending on its value:

< 0.02: The variable of analysis does not accurately separate the classes in the target variable
0.02 to 0.1: The variable has a weak...

Wrapper methods

As stated at the beginning of this section, wrapper methods evaluate subsets of variables to detect the possible interactions between variables being a step ahead of the filter methods.

In wrapper methods, several combinations of variables are used in a predictive model and a score is given to each combination according to the model accuracy.

In wrapper methods, a classifier is iteratively trained with multiple combinations of variables acting as a black box, for which the only output is a ranking of important features.

Boruta package

One of the most known wrapper packages in R is called Boruta. This package is mainly based on the algorithm of random forests.

Although this algorithm will be explained in more...

Embedded methods

The main difference between filter and wrapper approaches is that in filter approaches, such as embedded methods, you cannot separate the learning and feature selection parts.

Regularization methods are the most common type of embedded feature selection methods.

In classification problems such as this one, the logistic regression method cannot handle the multi-collinearity problem, which occurs when variables are very correlated. When the number of observations is not much larger than the number of variables of covariates, p, then there can be a lot of variability. Consequently, this variability could even increase the likelihood by simply adding more parameters, resulting in overfitting.

If variables are highly correlated or if collinearity exists, we expect the model parameters and variance to be inflated. The high variance is because of the wrongly specified...

Dimensionality reduction

Dimensionality projection, or feature projection, consists of converting data in a high-dimensional space to a space of fewer dimensions.

High dimensionality increases the computational complexity substantially, and could even increase the risk of overfitting.

Dimensionality reduction techniques are useful for featuring selection as well. In this case, variables are converted into other new variables through different combinations. These combinations extract and summarize the relevant information from a complex database with fewer variables.

Different algorithms exist, with the following being the most important:

Principal Component Analysis (PCA)
Sammon mapping
Singular value decomposition (SVD)
Isomap
Local linear embedding (LLE)
Laplacian eigenmaps
t-distributed Stochastic Neighbor Embedding (t-SNE)

Although dimensionality reduction is not very common...

Summary

In this chapter, we saw how univariate analysis reduced the sample space of our problem data and analyzed the data. Consequently, in the next chapter, we will see how these variables can be combined to obtain an accurate model, where several algorithms will be tested.

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with R Quick Start Guide

Published in: Mar 2019Publisher: PacktISBN-13: 9781838644338

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Iván Pastor Sanz

Ivn Pastor Sanz is a lead data scientist and machine learning enthusiast with extensive experience in finance, risk management, and credit risk modeling. Ivn has always endeavored to find solutions to make banking more comprehensible, accessible, and fair. Thus, in his thesis to obtain his PhD in economics, Ivn tried to identify the origins of the 2008 financial crisis and suggest ways to avoid a similar crisis in the future.
Read more about Iván Pastor Sanz

Other recommended products

Related to this chapter

Ensemble Machine Learning Cookbook

This book uses a recipe-based approach to showcase the power of machine learning algorithms to build ensemble models using Python libraries. Through this book, you will be able to pick up the code, understand in depth how it works, execute and implement it efficiently. This will be a desk reference to implement a wide range of tasks and solve the common and uncommon problems in ensemble machine learning domain.

BookJan 2019336 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Hands-On Time Series Analysis with R

This book introduces you to time series analysis and forecasting with R; this is one of the key fields in statistical programming and includes techniques for analyzing data to extract meaningful insights. You will explore methods, such as prediction with time series analysis, and identify the relationship between each data point in the series.

BookMay 2019448 pages

R Data Mining

This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. Explore a data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques.

BookNov 2017442 pages

Mastering Machine Learning with R

Machine learning is the field of Artificial Intelligence where we build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data. This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2.

BookApr 2017420 pages

Neural Networks with R

The book helps you learn neural networks and implement them in R. It covers real-world use cases that will help you better understand their concepts. A basic understanding of R and mathematics is required.

BookSep 2017270 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Learning Quantitative Finance with R

This book covers applications of quantitative finance in R. It starts with the basics of quantitative finance and goes to complexity at the end of the book along with a varying degree of R complexity. This will guide you to implement different trading strategies for various financial instruments using basic to complex techniques along with its optimization and keeping the risk of financial instruments in check.

BookMar 2017284 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

Statistical Application Development with R and Python

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

BookAug 2017432 pages

R Machine Learning Projects

The purpose of the book is to help a machine learning practitioner gets hands-on experience in working with real-world data and apply modern machine learning algorithms. You will learn to implement each algorithm to a specific industry problem. It covers projects involving both supervised as well as unsupervised learning approaches.

BookJan 2019334 pages

Mastering Machine Learning with R

Machine learning is a field of AI where we build systems that learn from data. This book explains complicated concepts with real-world applications. It demonstrates the power of R and machine learning extensively while highlighting the constraints. Finally, it will walk you through topics such as text analysis, time series, and deep learning.

BookJan 2019354 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages