Reader small image

You're reading from  The Data Analysis Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839211386
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Shubhangi Hora
Shubhangi Hora
author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

Konstantin Palagachev
Konstantin Palagachev
author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev

View More author details
Right arrow

4. Tackling Company Bankruptcy

Overview

In this chapter, we will be looking at a Polish company's bankruptcy data to try and understand the main reasons behind bankruptcy and whether it could be possible to identify early warning signs. By the end of this chapter, you will be able to perform exploratory data analysis using pandas profiling. You will also be able to apply missing value treatments with two different types of imputers and successfully handle imbalances in the data.

Introduction

In the previous chapter, we analyzed the data derived from the direct marketing campaign of a Portuguese bank using techniques such as hypothesis testing and clustering to be able to identify the impact it had on acquiring new customers for the bank.

In this chapter, we will be using exploratory data analysis to identify early warning signs of fatigue in the financial data. The dataset is about the bankruptcy prediction of Polish companies. The data was collected from the Emerging Markets Information Service. The bankrupt companies were analyzed for the period of 2000-2012, while the still-operating companies were evaluated from 2007 to 2013.

Note

The dataset that's being used for this chapter can be found in the UCI repository:

https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

Additionally, the dataset is also available on our GitHub repository:

https://packt.live/2JMSiqj.

For further information on this topic, you can...

Missing Value Analysis

One of the major steps in data analysis is missing value analysis. The primary reason we need to perform missing value analysis is to know how much data is missing in a column and how we are going to handle it.

In general, missing values can be handled in two ways. The first way is to drop the rows with missing values, unless the percentage of missing values is high (for example, 40% missing values in a column) as this will lead to loss of information.

The second method is imputing missing values, which is where we fill in the missing values based on the imputation method employed. For example, in mean imputation, we use the mean value of the particular column to fill in the missing values.

The next step is missing value analysis.

In order to find out how many missing values are present in the DataFrame, we are going to introduce you to a package called missingno, which will help you visualize the count of missing values in the DataFrames.

Exercise...

Imputation of Missing Values

In this section, we will be looking at two different methods that we can use to handle the missing values:

  • Mean imputation
  • Iterative imputation

Let's look at each of these methods in detail.

Mean Imputation

In mean imputation, the missing values are filled with the mean of each column where the missing values are located. We will be performing mean imputation on the DataFrames in the next exercise.

Exercise 4.03: Performing Mean Imputation on the DataFrames

In this exercise, you will perform mean imputation on the first DataFrame. This exercise is a continuation of Exercise 4.02, Performing Missing Value Analysis for the DataFrames. Follow these steps to complete this exercise:

  1. Import Imputer from sklearn.preprocessing to perform mean imputation to fill in the missing values:
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(missing_values=np.nan, \
            ...

Splitting the Features

In the previous section, we saw how the missing values are filled with different types of imputation.

In this section, we will be splitting the dependent variables in the DataFrame into y and the independent variables into X. The dependent variables are an outcome of a process. In our case, this process is whether a company is bankrupt or not. Independent variables (also called features) are the input to our process, which in this case is the rest of the variables.

Splitting the features acts as a precursor to our next step, where we select the most important X variables that determine the dependent variable.

We will need to split the features for mean-imputed DataFrames, as shown in the following code:

#First DataFrame
X0=mean_imputed_df1.drop('Y',axis=1)
y0=mean_imputed_df1.Y
#Second DataFrame
X1=mean_imputed_df2.drop('Y',axis=1)
y1=mean_imputed_df2.Y
#Third DataFrame
X2=mean_imputed_df3.drop('Y',axis=1)
y2=mean_imputed_df3...

Feature Selection with Lasso

Feature selection is one of the most important steps to be performed before building any kind of machine learning model. In a dataset, not all the columns are going to have an impact on the dependent variable. If we include all the irrelevant features for model building, we'll end up building a model with bad performance. This gives rise to the need to perform feature selection. In this section, we will be performing feature selection using the lasso method.

Lasso regularization is a method of feature selection where the coefficients of irrelevant features are set to zero. By doing so, we remove the features that are insignificant and only the remaining significant features are included for further analysis.

Let's perform lasso regularization for our mean- and iterative-imputed DataFrames.

Lasso Regularization for Mean-Imputed DataFrames

Let's perform lasso regularization for the mean-imputed DataFrame 1.

As the first...

Summary

In this chapter, we learned how to import an ARFF file into a pandas DataFrame. Pandas profiling was performed on the DataFrame to get the correlated features. We detected the missing values using the missingno package and performed imputation using the mean and iterative imputation methods.

In order to find the important features that contribute to bankruptcy, we performed lasso regularization. With lasso regularization, we found which features are responsible for bankruptcy. Even though we get the different important features across all five DataFrames, one of the features occurs across all five DataFrames, which is nothing but the ratio of total liabilities to total assets. This particular ratio has a very high significance in leading to bankruptcy.

However, our analysis is not fully complete since we only found the factors that affect bankruptcy, but not the direction (whether bankruptcy may occur when a particular ratio increases or decreases).

To get a complete...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Analysis Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781839211386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev