Reader small image

You're reading from  Data Science Projects with Python - Second Edition

Product typeBook
Published inJul 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800564480
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Stephen Klosterman
Stephen Klosterman
author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman

Right arrow

1. Data Exploration and Cleaning

Activity 1.01: Exploring the Remaining Financial Features in the Dataset

Solution:

Before beginning, set up your environment and load in the cleaned dataset as follows:

import pandas as pd
import matplotlib.pyplot as plt #import plotting package
#render plotting automatically
%matplotlib inline
import matplotlib as mpl #additional plotting functionality
mpl.rcParams['figure.dpi'] = 400 #high resolution figures
mpl.rcParams['font.size'] = 4 #font size for figures
from scipy import stats
import numpy as np
df = pd.read_csv('../../Data/Chapter_1_cleaned_data.csv')
  1. Create lists of feature names for the remaining financial features.

    These fall into two groups, so we will make lists of feature names as before, to facilitate analyzing them together. You can do this with the following code:

    bill_feats = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', \
          ...

2. Introduction to Scikit-Learn and Model Evaluation

Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

Solution:

  1. Use scikit-learn's train_test_split to make a new set of training and test data. This time, instead of EDUCATION, use LIMIT_BAL, the account's credit limit, as the feature.

    Execute the following code to do this:

    X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split\
                                              (df['LIMIT_BAL']\
                                               ...

3. Details of Logistic Regression and Feature Exploration

Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients

Solution:

The first few steps are similar to things we've done in previous activities:

  1. Create a train/test split (80/20) with PAY_1 and LIMIT_BAL as features:
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
        df[['PAY_1', 'LIMIT_BAL']].values,
        df['default payment next month'].values,
        test_size=0.2, random_state=24)
  2. Import LogisticRegression, with the default options, but set the solver to 'liblinear':
    from sklearn.linear_model import LogisticRegression
    lr_model = LogisticRegression(solver='liblinear')
  3. Train on the training data and obtain predicted classes, as well as class probabilities, using the test data:
    lr_model.fit(X_train, y_train...

4. The Bias-Variance Trade-Off

Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data

Solution:

  1. Select out the features from the DataFrame of the case study data.

    You can use the list of feature names that we've already created in this chapter, but be sure not to include the response variable, which would be a very good (but entirely inappropriate) feature:

    features = features_response[:-1]
    X = df[features].values
  2. Make a training/test split using a random seed of 24:
    X_train, X_test, y_train, y_test = \
    train_test_split(X, df['default payment next month'].values,
                     test_size=0.2, random_state=24)

    We'll use this going forward and reserve this test data as the unseen test set. By specifying the random seed, we can easily create separate notebooks with other modeling approaches using the same training data.

  3. Instantiate MinMaxScaler...

5. Decision Trees and Random Forests

Activity 5.01: Cross-Validation Grid Search with Random Forest

Solution:

  1. Create a dictionary representing the grid for the max_depth and n_estimators hyperparameters that will be searched. Include depths of 3, 6, 9, and 12, and 10, 50, 100, and 200 trees. Leave the other hyperparameters at their defaults. Create the dictionary using this code:
    rf_params = {'max_depth':[3, 6, 9, 12],
                 'n_estimators':[10, 50, 100, 200]}

    Note

    There are many other possible hyperparameters to search over. In particular, the scikit-learn documentation for random forest indicates that "The main parameters to adjust when using these methods are n_estimators and max_features" and that "Empirical good default values are … max_features=sqrt(n_features) for classification tasks."

    Source: https://scikit-learn.org/stable/modules/ensemble.html...

6. Gradient Boosting, XGBoost, and SHAP Values

Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP 

Solution:

In this activity, we'll take what we've learned in this chapter with a synthetic dataset and apply it to the case study data. We'll see how an XGBoost model performs on a validation set and explain the model predictions using SHAP values. We have prepared the dataset for this activity by replacing the samples that had missing values for the PAY_1 feature, that we had previously ignored, while maintaining the same train/test split for the samples with no missing values. You can see how the data was prepared in the Appendix to the notebook for this activity.

  1. Load the case study data that has been prepared for this exercise. The file path is ../../Data/Activity_6_01_data.pkl and the variables are: features_response, X_train_all, y_train_all, X_test_all, y_test_all...

7. Test Set Analysis, Financial Insights, and Delivery to the Client

Activity 7.01: Deriving Financial Insights

Solution:

  1. Using the testing set, calculate the cost of all defaults if there were no counseling program.

    Use this code for the calculation:

    cost_of_defaults = np.sum(y_test_all * X_test_all[:,5])
    cost_of_defaults 

    The output should be this:

    60587763.0
  2. Calculate by what percent the cost of defaults can be decreased by the counseling program.

    The potential decrease in the cost of default is the greatest possible net savings of the counseling program, divided by the cost of all defaults in the absence of a program:

    net_savings[max_savings_ix]/cost_of_defaults

    The output should be this:

    0.2214260658542551

    Results indicate that we can decrease the cost of defaults by 22% using a counseling program, guided by predictive modeling.

  3. Calculate the net savings per account (considering all accounts it might be possible to counsel, in other words relative to the whole...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Science Projects with Python - Second Edition
Published in: Jul 2021Publisher: PacktISBN-13: 9781800564480
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman