You're reading from Extending Excel with Python and R

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781804610695

Edition1st Edition

Concepts

Data Analysis

Authors (2):

Steven Sanderson

David Kun

View More author details

Exploratory Data Analysis with R and Python

Exploratory data analysis (EDA) is a crucial initial step in the data analysis process for data scientists. It involves the systematic examination and visualization of a dataset to uncover its underlying patterns, trends, and insights. The primary objectives of EDA are to gain a deeper understanding of the data, identify potential problems or anomalies, and inform subsequent analysis and modeling decisions.

EDA typically starts with a series of data summarization techniques, such as calculating basic statistics (mean, median, and standard deviation), generating frequency distributions, and examining data types and missing values. These preliminary steps provide an overview of the dataset’s structure and quality.

Visualization plays a central role in EDA. Data scientists create various charts and graphs, including histograms, box plots, scatter plots, and heat maps, to visualize the distribution and associations within the data...

Technical requirements

For this chapter, all scripts and files can be found on GitHub at the following link: https://github.com/PacktPublishing/Extending-Excel-with-Python-and-R/tree/main/Chapter%208.

For the R section, we will cover the following libraries:

skimr 2.1.5
GGally 2.2.0
DataExplorer 0.8.3

Understanding data with skimr

As an R programmer, the skimr package is a useful tool for providing summary statistics about variables that can come in a variety of forms such as data frames and vectors. The package provides a larger set of statistics in order to give the end user a more robust set of information as compared to the base R summary() function.

To use the skimr package, it must first be installed from CRAN using the install.packages("skimr") command. Once installed, the package can be loaded using the library(skimr) command. The skim() function is then used to summarize a whole dataset. For example, skim(iris) would provide summary statistics for the iris dataset. The output of skim() is printed horizontally, with one section per variable type and one row per variable.

The package also provides the skim_to_wide() function, which converts the output of skim() to a wide format. This can be useful for exporting the summary statistics to a spreadsheet or other...

Using the GGally package in R

At its core, GGally is an extension of the immensely popular ggplot2 package in R. It takes the elegance and flexibility of ggplot2 and supercharges it with a dazzling array of functions, unleashing your creativity to visualize data in stunning ways.

With GGally, you can effortlessly create beautiful scatter plots, histograms, bar plots, and more. What makes it stand out? GGally simplifies the process of creating complex multivariate plots, saving you time and effort. Want to explore correlations, visualize regression models, or craft splendid survival curves? GGally has your back.

But GGally is not just about aesthetics; it’s about insights. It empowers you to uncover hidden relationships within your data through visual exploration. The intuitive syntax and user-friendly interface make it accessible to both novices and seasoned data scientists.

What’s even better is that GGally encourages collaboration. Its easy-to-share visualizations...

Using the DataExplorer package

The DataExplorer R package is created to streamline the majority of data management and visualization responsibilities during the EDA process. EDA is a critical and primary stage in data analysis, during which analysts take their initial glimpse at the data to formulate meaningful hypotheses and determine subsequent action.

DataExplorer provides a variety of functions to do the following:

Scan and analyze data variables: The package can automatically scan and analyze each variable in a dataset, identifying its type, data distribution, outliers, and missing values.
Visualize data: DataExplorer provides a variety of visualization functions to help analysts understand the relationships between variables and identify patterns in the data. These functions include histograms, scatter plots, box plots, heat maps, and correlation matrices.
Transform data: DataExplorer also provides functions to transform data, such as converting categorical...

Getting started with EDA for Python

As explained earlier, EDA is the process of visually and statistically exploring datasets to uncover patterns, relationships, and insights. It’s a critical step before diving into more complex data analysis tasks. In this section, we’ll introduce you to the fundamentals of EDA and show you how to prepare your Python environment for EDA.

EDA is the initial phase of data analysis where you examine and summarize your dataset. The primary objectives of EDA are as follows:

Understand the data: Gain insights into the structure, content, and quality of your data
Identify patterns: Discover patterns, trends, and relationships within the data
Detect anomalies: Find outliers and anomalies that may require special attention
Generate hypotheses: Formulate initial hypotheses about your data
Prepare for modeling: Preprocess data for advanced modeling and analysis

Before you can perform EDA, you’ll need to...

Data cleaning in Python for Excel data

Data cleaning is a critical process when working with Excel data in Python. It ensures that your data is in the right format and free of errors, enabling you to perform accurate EDA.

We will start with generating some dirty data as an example:

import pandas as pd
import numpy as np
# Create a DataFrame with missing data, duplicates, and mixed data types
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva', 'Eva'],
    'Age': [25, np.nan, 30, 28, 22, 23],
    'Salary': ['$50,000', '$60,000', 'Missing', '$65,000', '$55,000',
    '$75,000']
}
df = pd.DataFrame(data)
# Introduce some missing data
df.loc[1, 'Age'] = np.nan
df.loc[3, &apos...

Performing EDA in Python

With your data loaded and cleaned, you can embark on your initial data exploration journey. This phase is crucial for gaining a deep understanding of your dataset, revealing its underlying patterns, and identifying potential areas of interest or concern.

These preliminary steps not only provide a solid foundation for your EDA but also help you uncover hidden patterns and relationships within your data. Armed with this initial understanding, you can proceed to more advanced data exploration techniques and dive deeper into the Excel dataset.

In the subsequent sections, we’ll delve into specific data exploration and visualization techniques to further enhance your insights into the dataset. With this knowledge, let’s move on to the next section, where we’ll explore techniques for understanding data distributions and relationships in greater detail.

Summary statistics

Begin by generating summary statistics for your dataset. This...

Summary

In this chapter, we delved into two pivotal processes: data cleaning and EDA using R and Python, with a specific focus on Excel data.

Data cleaning is a fundamental step. We learned how to address missing data, be it through imputation, removal, or interpolation. Dealing with duplicates was another key focus, as Excel data, often sourced from multiple places, can be plagued with redundancies. Ensuring the correct assignment of data types was emphasized to prevent analysis errors stemming from data type issues.

In the realm of EDA, we started with summary statistics. These metrics, such as mean, median, standard deviation, and percentiles for numerical features, grant an initial grasp of data central tendencies and variability. We then explored data distribution, understanding which is critical for subsequent analysis and modeling decisions. Lastly, we delved into the relationships between variables, employing scatter plots and correlation matrices to unearth correlations...

The rest of the chapter is locked

You have been reading a chapter from

Extending Excel with Python and R

Published in: Apr 2024Publisher: PacktISBN-13: 9781804610695

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Steven Sanderson

Steven Sanderson, MPH, is an applications manager for the patient accounts department at Stony Brook Medicine. He received his bachelor's degree in economics and his master's in public health from Stony Brook University. He has worked in healthcare in some capacity for just shy of 20 years. He is the author and maintainer of the healthyverse set of R packages. He likes to read material related to social and labor economics and has recently turned his efforts back to his guitar with the hope that his kids will follow suit as a hobby they can enjoy together.
Read more about Steven Sanderson

David Kun

David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics and the creator of the ownR Infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization.
Read more about David Kun

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages