Reader small image

You're reading from  Extending Excel with Python and R

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781804610695
Edition1st Edition
Right arrow
Authors (2):
Steven Sanderson
Steven Sanderson
author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun
David Kun
author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

View More author details
Right arrow

Exploratory Data Analysis with R and Python

Exploratory data analysis (EDA) is a crucial initial step in the data analysis process for data scientists. It involves the systematic examination and visualization of a dataset to uncover its underlying patterns, trends, and insights. The primary objectives of EDA are to gain a deeper understanding of the data, identify potential problems or anomalies, and inform subsequent analysis and modeling decisions.

EDA typically starts with a series of data summarization techniques, such as calculating basic statistics (mean, median, and standard deviation), generating frequency distributions, and examining data types and missing values. These preliminary steps provide an overview of the dataset’s structure and quality.

Visualization plays a central role in EDA. Data scientists create various charts and graphs, including histograms, box plots, scatter plots, and heat maps, to visualize the distribution and associations within the data...

Technical requirements

For this chapter, all scripts and files can be found on GitHub at the following link: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/tree/main/Chapter%208.

For the R section, we will cover the following libraries:

  • skimr 2.1.5
  • GGally 2.2.0
  • DataExplorer 0.8.3

Understanding data with skimr

As an R programmer, the skimr package is a useful tool for providing summary statistics about variables that can come in a variety of forms such as data frames and vectors. The package provides a larger set of statistics in order to give the end user a more robust set of information as compared to the base R summary() function.

To use the skimr package, it must first be installed from CRAN using the install.packages("skimr") command. Once installed, the package can be loaded using the library(skimr) command. The skim() function is then used to summarize a whole dataset. For example, skim(iris) would provide summary statistics for the iris dataset. The output of skim() is printed horizontally, with one section per variable type and one row per variable.

The package also provides the skim_to_wide() function, which converts the output of skim() to a wide format. This can be useful for exporting the summary statistics to a spreadsheet or other...

Using the GGally package in R

At its core, GGally is an extension of the immensely popular ggplot2 package in R. It takes the elegance and flexibility of ggplot2 and supercharges it with a dazzling array of functions, unleashing your creativity to visualize data in stunning ways.

With GGally, you can effortlessly create beautiful scatter plots, histograms, bar plots, and more. What makes it stand out? GGally simplifies the process of creating complex multivariate plots, saving you time and effort. Want to explore correlations, visualize regression models, or craft splendid survival curves? GGally has your back.

But GGally is not just about aesthetics; it’s about insights. It empowers you to uncover hidden relationships within your data through visual exploration. The intuitive syntax and user-friendly interface make it accessible to both novices and seasoned data scientists.

What’s even better is that GGally encourages collaboration. Its easy-to-share visualizations...

Using the DataExplorer package

The DataExplorer R package is created to streamline the majority of data management and visualization responsibilities during the EDA process. EDA is a critical and primary stage in data analysis, during which analysts take their initial glimpse at the data to formulate meaningful hypotheses and determine subsequent action.

DataExplorer provides a variety of functions to do the following:

  • Scan and analyze data variables: The package can automatically scan and analyze each variable in a dataset, identifying its type, data distribution, outliers, and missing values.
  • Visualize data: DataExplorer provides a variety of visualization functions to help analysts understand the relationships between variables and identify patterns in the data. These functions include histograms, scatter plots, box plots, heat maps, and correlation matrices.
  • Transform data: DataExplorer also provides functions to transform data, such as converting categorical...

Getting started with EDA for Python

As explained earlier, EDA is the process of visually and statistically exploring datasets to uncover patterns, relationships, and insights. It’s a critical step before diving into more complex data analysis tasks. In this section, we’ll introduce you to the fundamentals of EDA and show you how to prepare your Python environment for EDA.

EDA is the initial phase of data analysis where you examine and summarize your dataset. The primary objectives of EDA are as follows:

  • Understand the data: Gain insights into the structure, content, and quality of your data
  • Identify patterns: Discover patterns, trends, and relationships within the data
  • Detect anomalies: Find outliers and anomalies that may require special attention
  • Generate hypotheses: Formulate initial hypotheses about your data
  • Prepare for modeling: Preprocess data for advanced modeling and analysis

Before you can perform EDA, you’ll need to...

Data cleaning in Python for Excel data

Data cleaning is a critical process when working with Excel data in Python. It ensures that your data is in the right format and free of errors, enabling you to perform accurate EDA.

We will start with generating some dirty data as an example:

import pandas as pd
import numpy as np
# Create a DataFrame with missing data, duplicates, and mixed data types
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva', 'Eva'],
    'Age': [25, np.nan, 30, 28, 22, 23],
    'Salary': ['$50,000', '$60,000', 'Missing', '$65,000', '$55,000',
    '$75,000']
}
df = pd.DataFrame(data)
# Introduce some missing data
df.loc[1, 'Age'] = np.nan
df.loc[3, &apos...

Performing EDA in Python

With your data loaded and cleaned, you can embark on your initial data exploration journey. This phase is crucial for gaining a deep understanding of your dataset, revealing its underlying patterns, and identifying potential areas of interest or concern.

These preliminary steps not only provide a solid foundation for your EDA but also help you uncover hidden patterns and relationships within your data. Armed with this initial understanding, you can proceed to more advanced data exploration techniques and dive deeper into the Excel dataset.

In the subsequent sections, we’ll delve into specific data exploration and visualization techniques to further enhance your insights into the dataset. With this knowledge, let’s move on to the next section, where we’ll explore techniques for understanding data distributions and relationships in greater detail.

Summary statistics

Begin by generating summary statistics for your dataset. This...

Summary

In this chapter, we delved into two pivotal processes: data cleaning and EDA using R and Python, with a specific focus on Excel data.

Data cleaning is a fundamental step. We learned how to address missing data, be it through imputation, removal, or interpolation. Dealing with duplicates was another key focus, as Excel data, often sourced from multiple places, can be plagued with redundancies. Ensuring the correct assignment of data types was emphasized to prevent analysis errors stemming from data type issues.

In the realm of EDA, we started with summary statistics. These metrics, such as mean, median, standard deviation, and percentiles for numerical features, grant an initial grasp of data central tendencies and variability. We then explored data distribution, understanding which is critical for subsequent analysis and modeling decisions. Lastly, we delved into the relationships between variables, employing scatter plots and correlation matrices to unearth correlations...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Excel with Python and R
Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

author image
David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun