You're reading from Extending Excel with Python and R

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781804610695

Edition1st Edition

Concepts

Data Analysis

Authors (2):

Steven Sanderson

David Kun

View More author details

Data Analysis and Visualization with R and Python in Excel – A Case Study

In this final chapter, we are going to perform an analysis—visualization and a simple model—built with data from Excel and place all those outcomes back into it. This can be useful when there is a lot of data, or the calculations themselves are best suited to being done outside of Excel.

First, we will start with importing our data and then performing some data exploration via visualizations. For this chapter, we are going to use the diamonds dataset from the R package called ggplot2. We will view the data where the price is the outcome and look at it via different facets of the diamond’s characteristics. After the visualizations are done, we will perform some simple modeling to predict the price of a diamond based on its characteristics.

In this chapter, we’re going to cover the following main topics:

Getting a visualization
Performing a simple machine learning...

Technical requirements

For this chapter, we will be using the following packages/libraries:

ggplot2 3.4.4
dplyr 1.1.4
healthyR 0.2.1
readxl 1.4.3
tidyverse 2.0.0
janitor 2.2.0
writexl 1.5.0
healthyR.ai 0.0.13

Getting visualizations with R

In this section, we are going to go over getting some visualizations of the data. We will create several visualizations and give short interpretations of the outcomes in them. For this, we will create two histograms in base R and a few different visuals using the ggplot2 library.

Getting the data

The first thing we need to do is load the libraries and get the data. I am working in a directory specific to this book so I can source the function directly from the chapter I wrote the read_excel_sheets()? function in; your path might be different. Let’s look at the code up to this point:

# Library Load
library(ggplot2)
library(dplyr)
library(healthyR)
library(readxl)
# Source Functions
source(paste0(getwd(),"/Chapter1/excel_sheet_reader.R"))
# Read data
file_path <- paste0(getwd(), "/Chapter12/")
df <- read_excel_sheets(
  filename = paste0(file_path, "diamonds_split.xlsx"),"),
  single_tbl...

Performing a simple ML model with R

In this section, we are going to go over performing a simple ML model in R. There are so many different ways to do this in R that it would be impossible for me to list them all, however, CRAN has done this so you and I don’t have to. If you want to see a task view of ML on CRAN, you can follow this link: https://cran.r-project.org/view=MachineLearning.

For this section, we are going to use the XGBoost algorithm as implemented by the healthyR.ai package. The algorithm is not written differently, the only difference is how data is saved in the output. The healthyR.ai package also contains a preprocessor for the XGBoost algorithm to ensure that the input data matches what the algorithm is expecting before modeling. The two main functions that we will be using are hai_xgboost_data_prepper() and hai_auto_xgboost().

We will not cover loading the data in again as it was covered previously. Let’s get started!

Data preprocessing

...

Getting visualizations with Python

In this section, we are going to go over visualizations of the data in Python, analogous to the preceding R section. We will use plotnine to have visualizations similar to those created in R using ggplot2 and provide interpretations of the results.

Getting the data

Like in the earlier chapters, we will load the data using pandas. Just like before, the path to the XLSX file may be different for you from what I have, so adjust the filepath accordingly:

import pandas as pd
# Define the file path (may be different for you)
file_path = "./Chapter 12/diamonds.xlsx"
# Load the dataset into a pandas DataFrame
df = pd.read_excel(file_path)
# Display the first few rows of the DataFrame
print(df.head())

Note that we use the raw diamonds dataset without spitting it first and then recombining it, as it was done in the R part of the chapter.

Visualizing the data

Once we have our data loaded, we can use plotnine to create visualizations...

Performing a simple ML model with Python

In this section, we create a simple ML model in Python. Python has grown to be the primary go-to language for ML work (with R as the obvious alternative) and the number of packages implementing ML algorithms is difficult to overestimate. Having said that, sklearn remains the most widely used so we will also choose it for this section. Similarly to the R part of the chapter, we will use the xgboost model because it has a great balance between performance and explainability.

We will use the data loaded in the previous section.

Data preprocessing

The first thing to do for the modeling phase is to prepare the data. Fortunately, sklearn comes with a preprocessing functionality built-in!

Let’s review the steps involved in data preprocessing:

Handling missing values: Before training a model, it’s essential to address missing values in the dataset. sklearn provides methods for imputing missing values or removing rows...

The rest of the chapter is locked

You have been reading a chapter from

Extending Excel with Python and R

Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages