Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Exploratory Data Analysis

The previous chapter covered the basic plotting principles using ggplot2, including the use of various geometries and themes layers. It turns out that cleaning and massaging the raw data (covered in Chapter 2 and Chapter 3) and visualizing the data (covered in Chapter 4) belong to the first stage of a typical data science project workflow – that is, exploratory data analysis (EDA). We will cover this using a few case studies in this chapter. We will learn how to apply the coding techniques we covered earlier in this book and focus on analyzing the data through the lens of EDA.

By the end of this chapter, you will know how to uncover the structures of data using numerical and graphical techniques, discover interesting relationships among variables, and spot unusual observations.

In this chapter, we will cover the following topics:

  • EDA fundamentals
  • EDA in practice

Technical requirements

To complete the exercises in this chapter, you will need to have the following:

  • The latest version of the yfR package, which is 1.0.0 at the time of writing
  • The latest version of the corrplot package, which is 0.92 at the time of writing

The code and data for this chapter are available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/blob/main/Chapter_5/chapter_5.R.

EDA fundamentals

When facing a new dataset in the form of a table (a DataFrame) in Excel or a dataset, EDA helps us gain insight into the underlying pattern and irregularities of variables in the dataset. This is an important first-step exercise before building any predictive model. As the saying goes, garbage in, garbage out. When the input variables used for model development suffer from problems, such as missing values or different scales, the resulting model will either perform poorly, converge slowly, or even hit an error in the training stage. Therefore, understanding your data and ensuring the raw materials are in check are critical steps in warrantying a good-performing model later on.

This is where EAD comes in. Instead of being a rigid statistical procedure, EAD is a set of exploratory analyses that enables you to develop a better understanding of the features and potential relationships in the data. It serves as a transitional analysis to guide modeling later on, involving...

EDA in practice

In this section, we will analyze a dataset that consists of the stock prices of the top five companies in 2021. First, we will look at how to download and process these stock indexes, followed by performing univariate analysis and bivariate analysis in terms of correlation.

Obtaining the stock price data

To obtain the daily stock prices of a particular ticker, we can use the yfR package to download the data from Yahoo! Finance, a vast repository of financial data that covers a large number of markets and assets and has been widely used in both academia and industry. The following exercise illustrates how to download the stock data using yfR.

Exercise 5.8 – downloading stock prices

In this exercise, we will look at how to specify the different parameters so that we can download stock prices from Yahoo! Finance, including the ticker name and date range:

  1. Install and load the yfR package:
    >>> install.packages("yfR")
    >>...

Summary

In this chapter, we introduced basic techniques to conduct EDA. We started by going over the common approaches to analyzing and summarizing categorical data, including frequency count and bar charts. We then introduced marginal distribution and faceted bar charts when working with multiple categorical variables.

Next, we switched to analyzing numerical variables and covered sensitive measures such as central tendency (mean) and variation (variance), as well as robust measures such as median and IQR. Several types of charts are available for visualizing a numerical variable, including histograms, density plots, and box plots, all of which can be combined with another categorical variable.

Finally, we went through a case study using the stock price data. We started by downloading the real data from Yahoo! Finance and applying all the EDA techniques to analyze the data, followed by creating a correlation plot to indicate the strength of covariation between each pair of variables...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng