Reader small image

You're reading from  Extending Excel with Python and R

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781804610695
Edition1st Edition
Right arrow
Authors (2):
Steven Sanderson
Steven Sanderson
author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun
David Kun
author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

View More author details
Right arrow

Statistical Analysis: Linear and Logistic Regression

Welcome to our comprehensive guide on linear and logistic regression using R and Python, where we will explore these essential statistical techniques using two popular frameworks: tidymodels and base R and Python. Whether you’re a data science enthusiast or a professional looking to sharpen your skills, this tutorial will help you gain a deep understanding of linear and logistic regression and how to implement them in R and Python. Now, it is possible to perform linear and logistic regression. The issue here is that linear regression can only be performed on a single series of ungrouped data, and performing logistic regression is cumbersome and may require the use of external solver add-ins. Also, the process can only be performed against ungrouped or non-nested data. In R and Python, we do not have such limitations.

In this chapter, we will cover the following topics in both base R and Python and using the tidymodels framework...

Technical requirements

All code for this chapter can be found on GitHub at this URL: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/tree/main/Chapter9. You will need the following R packages installed to follow along:

  • readxl 1.4.3
  • performance 0.10.8
  • tidymodels 1.1.1
  • purrr 1.0.2

We will begin by learning about what linear and logistic regression are and then move into the details of everything.

Linear regression

Linear regression is a fundamental statistical method used for modeling the relationship between a dependent variable (usually denoted as “Y”) and one or more independent variables (often denoted as “X”). It aims to find the best-fitting linear equation that describes how changes in the independent variables affect the dependent variable. Many of you may know this as the ordinary least squares (OLS) method.

In simpler terms, linear regression helps us predict a continuous numeric outcome based on one or more input features. For this to work, if you are unaware, many assumptions must be held true. If you would like to understand these more, then a simple search will bring you a lot of good information on them. In this tutorial, we will delve into both simple linear regression (one independent variable) and multiple linear regression (multiple independent variables).

Logistic regression

Logistic regression is another crucial statistical technique, which is primarily used for binary classification problems. Instead of predicting continuous outcomes, logistic regression predicts the probability of an event occurring, typically expressed as a “yes” or “no” outcome. This method is particularly useful for scenarios where we need to model the likelihood of an event, such as whether a customer will churn or not or whether an email is spam or not. Logistic regression models the relationship between the independent variables and the log odds of the binary outcome.

Frameworks

We will explore two approaches to implementing linear and logistic regression in R. First, we will use the base R framework, which is an excellent starting point to understand the underlying concepts and functions. Then, we will dive into tidymodels, a modern and tidy approach to modeling and machine learning in R. tidymodels provides a consistent and...

Performing linear regression in R

For this section, we are going to perform linear regression in R, both in base R and by way of the tidymodels framework. In this section, you will learn how to do this on a dataset that has different groups in it. We will do this because if you can learn to do it this way, then doing it in a single group becomes simpler as there is no need to group data and perform actions by group. The thought process here is that by doing it on grouped data, we hope you can learn an extra skill.

Linear regression in base R

The first example we are going to show is using the lm() function to perform a linear regression in base R. Let’s dive right into it with the iris dataset.

We will break the code down into chunks and discuss what is happening at each step. The first step for us is to use the library command to bring in the necessary packages into our development environment:

library(readxl)

In this section, we’re loading a library called...

Performing logistic regression in R

As we did in the section on linear regression, in this section, we will also perform logistic regression in base R and with the tidymodels framework. We are going to only perform a simple binary classification regression problem using the Titanic dataset, where we will be deciding if someone is going to survive or not. Let’s dive right into it.

Logistic regression with base R

In order to get going, we are going to start with a base R implementation of logistic regression on the Titanic dataset where we will be modeling the response of Survived. So, let’s get straight into it.

The following is the code that will perform the data modeling along with explanations of what is happening:

library(tidyverse)
df <- Titanic |>
       as.data.frame() |>
       uncount(Freq)

This block of code starts by loading a library called tidyverse, which contains...

Performing linear regression in Python using Excel data

Linear regression in Python can be carried out with the help of libraries such as pandas, scikit-learn, statsmodels, and matplotlib. The following is a step-by-step code example:

  1. First, import the necessary libraries:
    # Import necessary libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    import statsmodels.api as sm
    from statsmodels.graphics.regressionplots import plot_regress_exog
    from statsmodels.graphics.gofplots import qqplot
  2. Then, we create an Excel file with test data. Of course, in a real-life scenario, you would not need the mock data – you would skip this step and load the data from Excel (see the next step) after loading the necessary libraries:
    # Step 0: Generate sample data and save as Excel file
    np.random.seed(0)
    n_samples = 100
    X = np.random.rand(n_samples, 2)  # Two features
    y = 2 * X[:, 0] + 3 * X[:, 1] ...

Logistic regression in Python using Excel data

In the following code, we generate random sample data with two features (Feature1 and Feature2) and a binary target variable (Target) based on a simple condition. We perform logistic regression, evaluate the model using accuracy, the confusion matrix, and a classification report, visualize the results for binary classification, and interpret the coefficients.

The following is a step-by-step code example:

  1. Again, we start with importing the necessary libraries:
    # Import necessary libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

    For this example, we will use a different sample dataset:

    # Step 0: Generate sample data
    np.random.seed(0)
    n_samples = 100
    X = np.random.rand(n_samples, 2)  # Two features
    y = (X...

Summary

In this chapter, we explored the powerful world of linear and logistic regression using Excel data. Linear regression, a fundamental statistical technique, allows us to model relationships between dependent and independent variables. We discussed its assumptions and applications, and walked through the entire process of loading data from Excel, preparing it for analysis, and fitting linear regression models using both R (using base R and tidymodels) and Python (with the scikit-learn and statsmodels libraries).

Through comprehensive code examples, you learned how to perform regression analysis, assess model accuracy, and generate valuable statistics and metrics to interpret model results. We gained insights into creating diagnostic plots, such as residual plots and Q-Q plots, which aid in identifying issues such as heteroscedasticity and outliers.

Additionally, we delved into logistic regression, a powerful tool for class probability prediction and binary classification...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Excel with Python and R
Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun