You're reading from Python for Finance Cookbook

Product type Book

Published in Jan 2020

Publisher Packt

ISBN-13 9781789618518

Pages 432 pages

Edition 1st Edition

Languages

Python

Concepts

Financial Technology

Author (1):

Eryk Lewinson

Identifying Credit Default with Machine Learning

In recent years, we have witnessed machine learning gaining more and more popularity in solving traditional business problems. Every so often, a new algorithm is published, beating the current state of the art. It is only natural for businesses (in all industries) to try to leverage the incredible powers of machine learning in their core functionalities.

Before specifying a problem, we provide a brief introduction to the field of machine learning. Machine learning can be broken down into two main areas: supervised learning and unsupervised learning. In the former, we have a target variable (label), which we try to predict as accurately as possible. In the latter, there is no target, and we try to use different techniques to draw some insights from the data. An example of unsupervised learning might be clustering, which is often...

Loading data and managing data types

In this recipe, we show how to load a dataset into Python. In order to show the entire pipeline—including working with messy data—we apply some transformation to the original dataset. For more information on applied changes, please refer to the accompanying GitHub repository.

How to do it...

Execute the following steps to load a dataset into Python.

Import the libraries:

import pandas as pd

Preview a CSV file:

!head -n 5 credit_card_default.csv

The output looks like this:

Load the data from the CSV file:

df = pd.read_csv('credit_card_default.csv', index_col=0, 
                  na_values='')

The DataFrame has 30,000 rows and 24 columns.

Separate...

Exploratory data analysis

The second step, after loading the data, is to carry out Exploratory Data Analysis (EDA). By doing this, we get to know the data we are supposed to work with. Some insights we try to gather are:

What kind of data do we actually have, and how should we treat different types?
What is the distribution of the variables?
- Are there outliers in the data, and how can we treat them?
- Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might want to use techniques such as log transformation.
- Does the distribution vary per group (for example, gender or education level)?
Do we have cases of missing data? How frequent are these, and in which variables?
Is there a linear relationship between some variables (correlation)?
Can we create new features using the existing set of variables? An example...

Splitting data into training and test sets

Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:

Training set—On this part of the data, we train a machine learning model
Test set—This part of the data was not seen by the model during training, and is used to evaluate the performance

What we want to achieve by splitting the data is preventing overfitting. Overfitting is a phenomenon whereby a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.

This is a very important step in the analysis, as doing it incorrectly can introduce bias, for example, in the form of data leakage. Data leakage can occur when, during the training phase, a model observes information to which it should...

Dealing with missing values

In most real-life cases, we do not work with clean, complete data. One of the potential problems we are bound to encounter is that of missing values. We can categorize missing values by the reason they occur:

Missing completely at random (MCAR)—The reason for the missing data is unrelated to the rest of the data. An example could be a respondent accidentally missing a question in a survey.
Missing at random (MAR)—The missingness of the data can be inferred from data in another column(-s). For example, the missingness to a response to a certain survey question can be to some extent determined conditionally by other factors such as gender, age, lifestyle, and so on.
Missing not at random (MNAR)—When there is some underlying reason for the missing values. For example, people with very high incomes tend to be hesitant about revealing...

Encoding categorical variables

In the previous recipes, we have seen that some features are categorical variables (originally represented as either object or category data types). However, most machine learning algorithms work exclusively with numeric data. That is why we need to encode categorical features into a representation compatible with the models.

In this recipe, we cover some popular encoding approaches:

Label encoding
One-hot encoding

In label encoding, we replace the categorical value with a numeric value between 0 and # of classes - 1—for example, with three distinct classes, we use {0, 1, 2}.

This is already very similar to the outcome of converting to the category class in pandas . We can access the codes of the categories by running df_cat.education.cat.codes. Additionally, we can recover the mapping by running dict(zip(df_cat.education.cat.codes, df_cat...

Fitting a decision tree classifier

A decision tree classifier is a relatively simple, yet very important machine learning algorithm, for both regression and classification problems. The name comes from the fact that the model creates a set of rules (for example: if x_1 > 50 and x_2 < 10 then y = 'default'), which taken together can be visualized in the form of a tree. The decision trees segment the feature space into a number of smaller regions, by repeatedly splitting the features at a certain value. To do so, they use a greedy algorithm (together with some heuristics) to find a split that minimizes the combined impurity of the children nodes (measured using the Gini impurity or entropy).

In the case of a binary classification problem, the algorithm tries to obtain nodes that contain as many observations from one class as possible, thus minimizing the impurity...

Implementing scikit-learn's pipelines

In the previous recipes, we showed all the steps required to build a machine learning model —starting with loading data, splitting it into a training and a test set, imputing missing values, encoding categorical features, and—lastly—fitting a decision tree classifier.

The process requires multiple steps to be executed in a certain order, which can sometimes be tricky with a lot of modifications to the pipeline mid-work. That is why scikit-learn introduced Pipelines. By using Pipelines, we can sequentially apply a list of transformations to the data, and then train a given estimator (model).

One important point to be aware of is that the intermediate steps of the Pipeline must have the fit and transform methods (the final estimator only needs the fit method, though). Using Pipelines has several benefits:

The flow...

Tuning hyperparameters using grid searches and cross-validation

Cross-validation, together with grid search, is commonly used to tune the hyperparameters of the model in order to achieve better performance. Below, we outline the differences between hyperparameters and parameters.

Hyperparameters:

External characteristic of the model
Not estimated based on data
Can be considered the model's settings
Set before the training phase
Tuning them can result in better performance

Parameters:

Internal characteristic of the model
Estimated based on data, for example, the coefficients of linear regression
Learned during the training phase

One of the challenges of machine learning is training models that are able to generalize well to unseen data (overfitting versus underfitting; a bias-variance trade-off). While tuning the model's hyperparameters, we would like to evaluate...