You're reading from Building Statistical Models in Python

Product typeBook

Published inAug 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781804614280

Edition1st Edition

Languages

Python

Concepts

Statistics

Authors (3):

Huy Hoang Nguyen

Paul N Adams

Stuart J Miller

View More author details

Discrete Models

In the previous two chapters, we discussed models for predicting a continuous response variable. In this chapter, we will begin discussing models for predicting discrete response variables. We will start by discussing the probit and logit models for predicting binary outcome variables (categorical variables with two levels). Then, we will extend this idea to predicting categorical variables with multiple levels. Finally, we will look at predicting count variables, which are like categorical variables but only take values of integers and have an infinite number of levels.

In this chapter, we’re going to cover the following main topics:

Probit and logit models
Multinomial logit model
Poisson model
The negative binomial regression model

Probit and logit models

Previously, we discussed different types of problems that can be solved with regression models. In particular, the dependent variable is continuous, such as house prices, salaries, and so on. A natural question is if dependent variables are not continuous – in other words, if they are categorical – how would we adapt our regression equation to predict a categorical response variable? For instance, a human resources department in a company wants to conduct an attrition study to predict whether an employee will stay with the company or a car dealership wants to know if one car can be sold or not based on prices, car models, colors, and so on.

First, we will study binary classification. Here, the outcome (dependent variable) is a binary response such as yes/no or to do/not to do. Let’s look back at the simple linear regression model:

y = β 0 + β 1 x+ ϵ

Here, the predicted outcome is a line crossing data...

Multinomial logit model

In practice, there are many situations where the outcomes (dependent variables) are not binary but have more than two possibilities. Multinomial logistic regression can be understood as a general case of the logit model, which we studied in the previous section. In this section, we will consider a hands-on study on Iris data by using the MNLogit class from statsmodels: https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.MNLogit.html.

Iris data (https://archive.ics.uci.edu/ml/datasets/iris) is one of the best-known statistical and machine learning datasets for education. The independent variables are sepal length (in cm), sepal width (in cm), petal length (in cm), and petal width (in cm). The dependent variable is a categorical variable with three levels: Iris Setosa (0), Iris Versicolor (1), and Iris Virginia (2). The following Python codes illustrate how to conduct this using sklearn and statsmodels:

# import packages
import...

Poisson model

In the previous section, we discussed models where the response variable was categorical. In this section, we will look at a model for count data. Count data is like categorical data (the categories are integers), but there are an infinite number of levels (0, 1, 2, 3, and so on). We model count data with the Poisson distribution. In this section, we will start by examining the Poisson distribution and its properties. Then, we will model a count variable with explanatory variables using the Poisson model.

The Poisson distribution

The Poisson distribution is given by the following formula:

P(k) = λ k e −λ _ k !

Here, λ is the average number of events and k is the number of events for which we would like the probability. P(k) is the probability that the k events occur. This distribution is used to calculate the probability of k events occurring in a fixed time interval or a defined space.

The shape...

The negative binomial regression model

Another useful approach to discrete regression is the log-linear negative binomial regression model, which uses the negative binomial probability distribution. At a high level, negative binomial regression is useful with over-dispersed count data where the conditional mean of the count is smaller than the conditional variance of the count. Model over-dispersion is where the variance of the target variable is greater than the mean assumed by the model. In a regression model, the mean is the regression line. We make the determination of using the negative binomial model based on target variable counts analysis (mean versus variance) and supply a measure of model over-dispersion to the negative binomial model to adjust for the over-dispersion, which we will discuss here.

It is important to note that the negative binomial model is not for modeling simply discrete data, but specifically count data associated with a fixed number of random trials...

Summary

In this chapter, we explained the issue of encountering negative raw probabilities that are generated by building a binary classification probability model based strictly on linear regression, where probabilities in a range of [0, 1] are expected. We provided an overview of the log-odds ratio and probit and logit modeling using the cumulative distribution function of both the standard normal distribution and logistic distribution, respectively. We also demonstrated methods for applying logistic regression to solve binary and multinomial classification problems. Lastly, we covered count-based regression using the log-linear Poisson and negative binomial models, which can also be logically extended to rate data without modification. We provided examples of their implementations.

In the following chapter, we will introduce conditional probability using Bayes’ theorem in addition to dimension reduction and classification modeling using linear discriminant analysis and...

The rest of the chapter is locked

You have been reading a chapter from

Building Statistical Models in Python

Published in: Aug 2023Publisher: PacktISBN-13: 9781804614280

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Huy Hoang Nguyen

Huy Hoang Nguyen is a Mathematician and a Data Scientist with far-ranging experience, championing advanced mathematics and strategic leadership, and applied machine learning research. He holds a Master's in Data Science and a PhD in Mathematics. His previous work was related to Partial Differential Equations, Functional Analysis and their applications in Fluid Mechanics. He transitioned from academia to the healthcare industry and has performed different Data Science projects from traditional Machine Learning to Deep Learning.
Read more about Huy Hoang Nguyen

Paul N Adams

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering and classification. Paul holds a Master of Science in Data Science from Southern Methodist University.
Read more about Paul N Adams

Stuart J Miller

Stuart Miller is a Machine Learning Engineer with degrees in Data Science, Electrical Engineering, and Engineering Physics. Stuart has worked at several Fortune 500 companies, including Texas Instruments and StateFarm, where he built software that utilized statistical and machine learning techniques. Stuart is currently an engineer at Toyota Connected helping to build a more modern cockpit experience for drivers using machine learning.
Read more about Stuart J Miller

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages