Reader small image

You're reading from  Extending Excel with Python and R

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781804610695
Edition1st Edition
Right arrow
Authors (2):
Steven Sanderson
Steven Sanderson
author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun
David Kun
author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

View More author details
Right arrow

Data Analysis and Visualization with R and Python in Excel – A Case Study

In this final chapter, we are going to perform an analysis—visualization and a simple model—built with data from Excel and place all those outcomes back into it. This can be useful when there is a lot of data, or the calculations themselves are best suited to being done outside of Excel.

First, we will start with importing our data and then performing some data exploration via visualizations. For this chapter, we are going to use the diamonds dataset from the R package called ggplot2. We will view the data where the price is the outcome and look at it via different facets of the diamond’s characteristics. After the visualizations are done, we will perform some simple modeling to predict the price of a diamond based on its characteristics.

In this chapter, we’re going to cover the following main topics:

  • Getting a visualization
  • Performing a simple machine learning...

Technical requirements

For this chapter, we will be using the following packages/libraries:

  • ggplot2 3.4.4
  • dplyr 1.1.4
  • healthyR 0.2.1
  • readxl 1.4.3
  • tidyverse 2.0.0
  • janitor 2.2.0
  • writexl 1.5.0
  • healthyR.ai 0.0.13

Getting visualizations with R

In this section, we are going to go over getting some visualizations of the data. We will create several visualizations and give short interpretations of the outcomes in them. For this, we will create two histograms in base R and a few different visuals using the ggplot2 library.

Getting the data

The first thing we need to do is load the libraries and get the data. I am working in a directory specific to this book so I can source the function directly from the chapter I wrote the read_excel_sheets()? function in; your path might be different. Let’s look at the code up to this point:

# Library Load
library(ggplot2)
library(dplyr)
library(healthyR)
library(readxl)
# Source Functions
source(paste0(getwd(),"/Chapter1/excel_sheet_reader.R"))
# Read data
file_path <- paste0(getwd(), "/Chapter12/")
df <- read_excel_sheets(
  filename = paste0(file_path, "diamonds_split.xlsx"),"),
  single_tbl...

Performing a simple ML model with R

In this section, we are going to go over performing a simple ML model in R. There are so many different ways to do this in R that it would be impossible for me to list them all, however, CRAN has done this so you and I don’t have to. If you want to see a task view of ML on CRAN, you can follow this link: https://cran.r-project.org/view=MachineLearning.

For this section, we are going to use the XGBoost algorithm as implemented by the healthyR.ai package. The algorithm is not written differently, the only difference is how data is saved in the output. The healthyR.ai package also contains a preprocessor for the XGBoost algorithm to ensure that the input data matches what the algorithm is expecting before modeling. The two main functions that we will be using are hai_xgboost_data_prepper() and hai_auto_xgboost().

We will not cover loading the data in again as it was covered previously. Let’s get started!

Data preprocessing

...

Getting visualizations with Python

In this section, we are going to go over visualizations of the data in Python, analogous to the preceding R section. We will use plotnine to have visualizations similar to those created in R using ggplot2 and provide interpretations of the results.

Getting the data

Like in the earlier chapters, we will load the data using pandas. Just like before, the path to the XLSX file may be different for you from what I have, so adjust the filepath accordingly:

import pandas as pd
# Define the file path (may be different for you)
file_path = "./Chapter 12/diamonds.xlsx"
# Load the dataset into a pandas DataFrame
df = pd.read_excel(file_path)
# Display the first few rows of the DataFrame
print(df.head())

Note that we use the raw diamonds dataset without spitting it first and then recombining it, as it was done in the R part of the chapter.

Visualizing the data

Once we have our data loaded, we can use plotnine to create visualizations...

Performing a simple ML model with Python

In this section, we create a simple ML model in Python. Python has grown to be the primary go-to language for ML work (with R as the obvious alternative) and the number of packages implementing ML algorithms is difficult to overestimate. Having said that, sklearn remains the most widely used so we will also choose it for this section. Similarly to the R part of the chapter, we will use the xgboost model because it has a great balance between performance and explainability.

We will use the data loaded in the previous section.

Data preprocessing

The first thing to do for the modeling phase is to prepare the data. Fortunately, sklearn comes with a preprocessing functionality built-in!

Let’s review the steps involved in data preprocessing:

  • Handling missing values: Before training a model, it’s essential to address missing values in the dataset. sklearn provides methods for imputing missing values or removing rows...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Excel with Python and R
Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun