You're reading from The Statistics and Machine Learning with R Workshop

Product typeBook

Published inOct 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781803240305

Edition1st Edition

Languages

Concepts

Machine Learning

Author (1)

Liu Peng

Exploratory Data Analysis

The previous chapter covered the basic plotting principles using ggplot2, including the use of various geometries and themes layers. It turns out that cleaning and massaging the raw data (covered in Chapter 2 and Chapter 3) and visualizing the data (covered in Chapter 4) belong to the first stage of a typical data science project workflow – that is, exploratory data analysis (EDA). We will cover this using a few case studies in this chapter. We will learn how to apply the coding techniques we covered earlier in this book and focus on analyzing the data through the lens of EDA.

By the end of this chapter, you will know how to uncover the structures of data using numerical and graphical techniques, discover interesting relationships among variables, and spot unusual observations.

In this chapter, we will cover the following topics:

EDA fundamentals
EDA in practice

Technical requirements

To complete the exercises in this chapter, you will need to have the following:

The latest version of the yfR package, which is 1.0.0 at the time of writing
The latest version of the corrplot package, which is 0.92 at the time of writing

The code and data for this chapter are available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/blob/main/Chapter_5/chapter_5.R.

EDA fundamentals

When facing a new dataset in the form of a table (a DataFrame) in Excel or a dataset, EDA helps us gain insight into the underlying pattern and irregularities of variables in the dataset. This is an important first-step exercise before building any predictive model. As the saying goes, garbage in, garbage out. When the input variables used for model development suffer from problems, such as missing values or different scales, the resulting model will either perform poorly, converge slowly, or even hit an error in the training stage. Therefore, understanding your data and ensuring the raw materials are in check are critical steps in warrantying a good-performing model later on.

This is where EAD comes in. Instead of being a rigid statistical procedure, EAD is a set of exploratory analyses that enables you to develop a better understanding of the features and potential relationships in the data. It serves as a transitional analysis to guide modeling later on, involving...

EDA in practice

In this section, we will analyze a dataset that consists of the stock prices of the top five companies in 2021. First, we will look at how to download and process these stock indexes, followed by performing univariate analysis and bivariate analysis in terms of correlation.

Obtaining the stock price data

To obtain the daily stock prices of a particular ticker, we can use the yfR package to download the data from Yahoo! Finance, a vast repository of financial data that covers a large number of markets and assets and has been widely used in both academia and industry. The following exercise illustrates how to download the stock data using yfR.

Exercise 5.8 – downloading stock prices

In this exercise, we will look at how to specify the different parameters so that we can download stock prices from Yahoo! Finance, including the ticker name and date range:

Install and load the yfR package:
```
>>> install.packages("yfR")
>>...
```

Summary

In this chapter, we introduced basic techniques to conduct EDA. We started by going over the common approaches to analyzing and summarizing categorical data, including frequency count and bar charts. We then introduced marginal distribution and faceted bar charts when working with multiple categorical variables.

Next, we switched to analyzing numerical variables and covered sensitive measures such as central tendency (mean) and variation (variance), as well as robust measures such as median and IQR. Several types of charts are available for visualizing a numerical variable, including histograms, density plots, and box plots, all of which can be combined with another categorical variable.

Finally, we went through a case study using the stock price data. We started by downloading the real data from Yahoo! Finance and applying all the EDA techniques to analyze the data, followed by creating a correlation plot to indicate the strength of covariation between each pair of variables...

The rest of the chapter is locked

You have been reading a chapter from

The Statistics and Machine Learning with R Workshop

Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages