Reader small image

You're reading from  Data Science Projects with Python - Second Edition

Product typeBook
Published inJul 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800564480
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Stephen Klosterman
Stephen Klosterman
author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman

Right arrow

3. Details of Logistic Regression and Feature Exploration

Overview

This chapter teaches you how to evaluate features quickly and efficiently, in order to know which ones will probably be most important for a machine learning model. Once we get a taste for this, we'll explore the inner workings of logistic regression so you can continue your journey to mastery of this fundamental technique. After reading this chapter, you will be able to make a correlation plot of many features and a response variable and interpret logistic regression as a linear model.

Introduction

In the previous chapter, we developed a few example machine learning models using scikit-learn, to get familiar with how it works. However, the features we used, EDUCATION and LIMIT_BAL, were not chosen in a systematic way.

In this chapter, we will start to develop techniques that can be used to assess features for their usefulness in modeling. This will enable you to make a quick pass over all candidate features, to have an idea of which will be the most important. For the most promising features, we will see how to create visual summaries that serve as useful communication tools.

Next, we will begin our detailed examination of logistic regression. We'll learn why logistic regression is considered to be a linear model, even if the formulation involves some non-linear functions. We'll learn what a decision boundary is and see that as a key consequence of its linearity, the decision boundary of logistic regression could make it difficult to accurately classify...

Examining the Relationships Between Features and the Response Variable

In order to make accurate predictions of the response variable, good features are necessary. We need features that are clearly linked to the response variable in some way. Thus far, we've examined the relationship between a couple of features and the response variable, either by calculating the groupby/mean of a feature and the response variable, or using individual features in a model and examining performance. However, we have not yet done a systematic exploration of how all the features relate to the response variable. We will do that now and begin to capitalize on all the hard work we put in when we were exploring the features and making sure the data quality was good.

A popular way of getting a quick look at how all the features relate to the response variable, as well as how the features are related to each other, is by using a correlation plot. We will first create a correlation plot for the case...

Univariate Feature Selection: What it Does and Doesn't Do

In this chapter, we have learned techniques for going through features one by one to see whether they have predictive power. This is a good first step, and if you already have features that are very predictive of the outcome variable, you may not need to spend much more time considering features before modeling. However, there are drawbacks to univariate feature selection. In particular, it does not consider the interactions between features. For example, what if the credit default rate is very high specifically for people with both a certain education level and a certain range of credit limit?

Also, with the methods we used here, only the linear effects of features are captured. If a feature is more predictive when it's undergone some type of transformation, such as a polynomial or logarithmic transformation, or binning (discretization), linear techniques of univariate feature selection may not be effective. Interactions...

Summary

In this chapter, we have learned how to explore features one at a time, using univariate feature selection methods including Pearson correlation and an ANOVA F-test. While looking at features in this way does not always tell the whole story, since you are potentially missing out on important interactions between features, it is often a helpful step. Understanding the relationships between the most predictive features and the response variable, and creating effective visualizations around them, is a great way to communicate your findings to your client. We used customized plots, such as overlapping histograms created with Matplotlib, to create visualizations of the most important features.

Then we began an in-depth description of how logistic regression works, exploring such topics as the sigmoid function, log odds, and the linear decision boundary. While logistic regression is one of the simplest classification models, and often is not as powerful as other methods, it is...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Science Projects with Python - Second Edition
Published in: Jul 2021Publisher: PacktISBN-13: 9781800564480
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman