Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Dealing with an imbalanced dataset

When building a logistic regression model using a dataset whose target is a binary outcome, it could be the case that the target values are not equally distributed. This means that we would observe more non-events (y = 0) than events (y = 1), as is often the case in applications such as fraudulent transactions in banks, spam/phishing emails for corporate employees, identification of diseases such as cancer, and natural disasters such as earthquakes. In these situations, the classification performance may be dominated by the majority class.

Such domination can result in misleadingly high accuracy scores, which correspond to poor predictive performance. To see this, suppose we are developing a default prediction model using a dataset that consists of 1,000 observations, where only 10 (or 1%) of them are default cases. A naive model would simply predict every observation as non-default, resulting in a 99% accuracy.

When we encounter an imbalanced...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng