You're reading from Python for Finance Cookbook - Second Edition

Product type Book

Published in Dec 2022

Publisher Packt

ISBN-13 9781803243191

Pages 740 pages

Edition 2nd Edition

Languages

Concepts

Financial Technology

Author (1):

Eryk Lewinson

Table of Contents (18) Chapters

Preface

Acquiring Financial Data

Data Preprocessing

Visualizing Financial Time Series

Exploring Financial Time Series Data

Technical Analysis and Building Interactive Dashboards

Time Series Analysis and Forecasting

Machine Learning-Based Approaches to Time Series Forecasting

Multi-Factor Models

Modeling Volatility with GARCH Class Models

Monte Carlo Simulations in Finance

Asset Allocation

Backtesting Trading Strategies

Applied Machine Learning: Identifying Credit Default

Advanced Concepts for Machine Learning Projects

Deep Learning in Finance

Other Books You May Enjoy

Index

Applied Machine Learning: Identifying Credit Default

In recent years, we have witnessed machine learning gaining more and more popularity in solving traditional business problems. Every so often, a new algorithm is published, beating the current state of the art. It is only natural for businesses (in all industries) to try to leverage the incredible powers of machine learning in their core functionalities.

Before specifying the task we will be focusing on in this chapter, we provide a brief introduction to the field of machine learning. The machine learning domain can be broken down into two main areas: supervised learning and unsupervised learning. In the former, we have a target variable (label), which we try to predict as accurately as possible. In the latter, there is no target, and we try to use different techniques to draw some insights from the data.

We can further break down supervised problems into regression problems (where a target variable is a continuous number...

Loading data and managing data types

In this recipe, we show how to load a dataset from a CSV file into Python. The very same principles can be used for other file formats as well, as long as they are supported by pandas. Some popular formats include Parquet, JSON, XLM, Excel, and Feather.

pandas has a very consistent API, which makes finding its functions much easier. For example, all functions used for loading data from various sources have the syntax pd.read_xxx, where xxx should be replaced by the file format.

We also show how certain data type conversions can significantly reduce the size of DataFrames in the memory of our computers. This can be especially important when working with large datasets (GBs or TBs), which can simply not fit into memory unless we optimize their usage.

In order to present a more realistic scenario (including messy data, missing values, and so on) we applied some transformations to the original dataset. For more information...

Exploratory data analysis

The second step of a data science project is to carry out Exploratory Data Analysis (EDA). By doing so, we get to know the data we are supposed to work with. This is also the step during which we test the extent of our domain knowledge. For example, the company we are working for might assume that the majority of its customers are people between the ages of 18 and 25. But is this actually the case? While doing EDA we might also run into some patterns that we do not understand, which are then a starting point for a discussion with our stakeholders.

While doing EDA, we can try to answer the following questions:

What kind of data do we actually have, and how should we treat different data types?
What is the distribution of the variables?
Are there outliers in the data and how can we treat them?
Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might...

Splitting data into training and test sets

Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:

Training set—on this part of the data, we train a machine learning model
Test set—this part of the data was not seen by the model during training and is used to evaluate its performance

By splitting the data this way, we want to prevent overfitting. Overfitting is a phenomenon that occurs when a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.

This is a very important step in the analysis, as doing it incorrectly can introduce bias, for example, in the form of data leakage. Data leakage can occur when, during the training phase, a model observes information to which it should not have access. We follow up with an example. A common scenario is that of imputing...

Identifying and dealing with missing values

In most real-life cases, we do not work with clean, complete data. One of the potential problems we are bound to encounter is that of missing values. We can categorize missing values by the reason they occur:

Missing completely at random (MCAR)—The reason for the missing data is unrelated to the rest of the data. An example could be a respondent accidentally missing a question in a survey.
Missing at random (MAR)—The missingness of the data can be inferred from data in another column(s). For example, a missing response to a certain survey question can to some extent be determined conditionally by other factors such as sex, age, lifestyle, and so on.
Missing not at random (MNAR)—When there is some underlying reason for the missing values. For example, people with very high incomes tend to be hesitant about revealing it.
Structurally missing data—Often a subset of MNAR, the data is...

Feedback

We are constantly looking at improving our content, so what could be better than listening to what you as a reader have to say? Your feedback is important to us and we will do our best to incorporate it. Could you take two mins to fill out the feedback form for this book and let us know what your thoughts are about it? Here's the link: https://forms.office.com/r/sYbSyLm2cX.

Thank you in advance.

Before specifying a problem we will be focusing on in this chapter, we provide a brief introduction to the field of machine learning. The machine learning domain can be broken down into two main areas: supervised learning and unsupervised...

Loading data and managing data types

In this recipe, we show how to load a dataset from a CSV file into Python. The very same principles can be used for other file formats as well, as long as they are supported by `pandas`. Some popular formats include parquet, JSON, XLM, Excel, and feather.

`pandas` has a very consistent API, which makes finding its functions much easier. For example, all functions used for loading data from various sources have the following syntax `pd.read_xxx`, where `xxx` should be replaced by the file format.

We also show how certain data type conversions can significantly reduce the size that DataFrames take in the memory of our computers. This can be especially important when working with large datasets (GBs or TBs), which can simply not fit into memory unless we try to optimize its usage.

In order to present a more realistic scenario (including messy data, missing values, etc.) we applied some transformations to the original dataset. For more information on...

Exploratory data analysis

The second step of a data science project is to carry out Exploratory Data Analysis (EDA). By doing so, we get to know the data we are supposed to work with. This is also the step during which we test the extent of our domain knowledge. For example, the company we are working for might assume that the majority of its customers are people between the age of 18 and 2But is this actually the case? While doing EDA we might also run into some patterns that we do not understand, which are then a starting point for a discussion with our stakeholders.

While doing EDA, we can try to answer the questions:

What kind of data do we actually have, and how should we treat different data types?
What is the distribution of the variables?
Are there outliers in the data, and how can we treat them?
Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might want to use techniques such as log transformation...

Splitting data into training and test sets

Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:

Training set—on this part of the data we train a machine learning model,
Test set—this part of the data was not seen by the model during training and is used to evaluate its performance.

By splitting the data this way we want to prevent overfitting. Overfitting is a phenomenon that occurs when a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.

Identifying and dealing with missing values

Missing completely at random (MCAR)—The reason for the missing data is unrelated to the rest of the data. An example could be a respondent accidentally missing a question in a survey.
Missing at random (MAR)—The missingness of the data can be inferred from data in another column(-s). For example, a missing response to a certain survey question can be to some extent determined conditionally by other factors such as gender, age, lifestyle, etc.
Missing not at random (MNAR)—When there is some underlying reason for the missing values. For example, people with very high incomes tend to be hesitant about revealing it.
Structurally missing data—Often a subset of MNAR, the data is missing because of a logical reason...