You're reading from Hands-On Exploratory Data Analysis with R

Product typeBook

Published inMay 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789804379

Edition1st Edition

Languages

Tools

ggplot

Concepts

Data Analysis

Authors (2):

Radhika Datar

Harish Garg

View More author details

Time Series Datasets

This chapter will introduce a time series dataset and help us to understand how to use EDA techniques to analyze the data. We will also learn about and use EDA techniques using an autocorrelation plot, spectrum plot, complex demodulation amplitude plot, and phase plots. In this chapter, we will first learn how to read and tidy up the data, after which we will learn how to map and understand the underlying structure of the dataset, and identify the important variables. We will then learn how to create a list of outliers or other anomalies using Grubbs' test. We will also cover the parsimonious model and Bartlett's test.

The following topics will be covered in this chapter:

Introducing and reading in the data
Cleaning and tidying up the data
Mapping and understanding the underlying structure of the dataset, and identifying the most important variables...

Technical requirements

You should have hands-on experience or knowledge of the following points before getting started with this chapter:

R programming language
RStudio
R packages (including readr, readxl, jsonlite, httr, rvest, DBI, dplyr, stringr, forcats, lubridate, hms, blob, ggplot2, and knitr)

Introducing and reading the dataset

In this chapter, we will focus on the dataset that consists of the responses of a gas with the help of a multi-sensor device. The dataset includes an hourly response average, which is being recorded along with gas concentrations and proportions. This dataset is referred to as an Air Quality Dataset.

You can download the file from the following link:

https://github.com/PacktPublishing/Hands-On-Exploratory-Data-Analysis-with-R/tree/master/ch07.

For more information, you can refer to the link specified as follows:
https://archive.ics.uci.edu/ml/machine-learning-databases/00360/.

In this section, we will be focusing on reading the attributes of the dataset and converting the .csv or .xls file to a data frame or dataset in the R workspace (with workspace, we are referring to the R environment where various data manipulations can be performed). As...

Cleaning the dataset

Data cleaning, or tidying up the data, is the process of transforming raw data into a specific form of consistent data, which includes analysis in a simple manner. The R programming language includes a set of comprehensive tools that are specifically designed to clean the data in an effective manner. We will be focusing on cleaning the dataset here in a specific way by observing the following steps:

Include the libraries that are required to clean and tidy up the dataset:

> library(dplyr) 
> library(tidyr)

Analyze the summary of our dataset, which will help us to focus on the attributes we need to work on:

> summary(AirQualityUCI) 
      Date                          Time                         CO(GT)         PT08.S1(CO)      NMHC(GT)      
 Min.   :2004-03-10 00:00:00   Min.   :1899-12-31 00:00:00   Min.   :-200.00   Min.   :-200   Min.   ...

Mapping and understanding structure

This section involves understanding each and every attribute in depth, which is considered to be important for the dataset specified.

We need to carry out the following steps to understand the data structure and mapping attributes, if any:

Try to get a feel for the data as per the attribute structure:

> class(AirQualityUCI) 
[1] "tbl_df"     "tbl"        "data.frame"

The output shows that the dataset is merely a tabular format of a data frame.

Check the dimensions of the dataset:

> dim(AirQualityUCI) 
[1] 9357 15

This shows that the dataset comprises 9357 rows and 15 columns. The column structure has already been discussed in the first section.

View the column names of the dataset. We need to check whether these correspond to the records included in the Excel file:

> colnames(AirQualityUCI) 
 [1...

Hypothesis test

This section is all about hypothesis testing in R. This testing is merely an assumption made by the researcher regarding the population of data collected for any experiment. The first step entails an introduction to the statistical hypothesis in R, and later we will cover the decision error in R, which includes one- and two-sample t-tests, u-tests, correlation, and covariance in R.

t-test in R

This is also known as Student's t-test, which is a method for comparing two samples. It can usually be implemented to determine whether the samples are proper or different. This is considered as a parametric test, and the data should be normally distributed. R can handle the various versions of the t-test using the...

Grubbs' test and checking outliers

In R programming, an outlier is merely an observation that is unique in comparison with most of the other observations. An outlier is present because of errors in measurement in the data frame.

The following script is used to detect the particular outliers for each and every attribute:

> outlierKD <- function(dt, var) { 
+     var_name <- eval(substitute(var),eval(dt)) 
+     na1 <- sum(is.na(var_name)) 
+     m1 <- mean(var_name, na.rm = T) 
+     par(mfrow=c(2, 2), oma=c(0,0,3,0)) 
+     boxplot(var_name, main="With outliers") 
+     hist(var_name, main="With outliers", xlab=NA, ylab=NA) 
+     outlier <- boxplot.stats(var_name)$out 
+     mo <- mean(outlier) 
+     var_name <- ifelse(var_name %in% outlier, NA, var_name) 
+     boxplot(var_name, main="Without outliers") 
+     hist...

Parsimonious models

Parsimonious models are simple models with great explanatory predictive power. They usually explain data with a minimum number of parameters, or predictor variables. MoEClust is the R package that fits finite Gaussian mixtures of experts' models. It uses a range of parsimonious covariance with the help of EM or CEM algorithms.

Follow these steps to create a range of parsimonious covariance models with our AirQualityUCI dataset:

Install the package that is required to create a parsimonious model of our dataset:

> install.packages('devtools') 
Installing package into 'C:/Users/Radhika/Documents/R/win-library/3.5' 
(as 'lib' is unspecified) 
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/devtools_2.0.2.zip' 
Content type 'application/zip' length 383720 bytes (374 KB) 
downloaded 374 KB ...

Bartlett's test

Bartlett's test is useful when executing a comparison between two or more samples to specify whether they are taken from populations with equal variance. Bartlett's test works successfully for normally distributed data. This test includes a null hypothesis, with a calculation of equal variances, and the alternative hypothesis, where variances are not considered equal. This test is considered useful for checking the assumptions regarding variance analysis.

The user can perform Bartlett's test with the bartlett.test function in R. The normal syntax for this is as follows :

> bartlett.test(values~groups, dataset)

Here, the parameters refer to the following:

values: The name of the variable containing the data value
groups: The name of the variable that specifies which sample each value belongs to

If the data is in an unstacked form (with...

Data visualization

In this section, we will focus on the creation of the following plots:

Autocorrelation
Spectrum
Phase

Autocorrelation plots

Autocorrelation plots are regarded as plots for creating randomness in a particular dataset. This randomness is very powerful regarding autocorrelations of data values with varying time lags. It is mandatory that autocorrelations for any dataset should be near zero, for any and all time-lag separations.

The Acf function computes (and, by default, plots) an estimate of the autocorrelation function of a (possibly multivariate) time series. The syntax is as follows:

> Acf(x, lag.max = NULL, type = c("correlation", "covariance","partial"), plot = TRUE...

Summary

In this chapter, we focused on the implementation of all the libraries of a univariate dataset, which holds a strong representation for time series creation. The best illustration considered in this chapter is the measurement of pollution with respect to parsimonious models with RH and AH. We have listed some of the various packages that are available for reading, in various kinds of attributes, within the dataset indicated in R. There are lots of different options, and even the options we have listed have a wide functionality that we are going to cover and use as we progress through the book.

In the next chapter, we will cover multivariate datasets. Multivariate datasets include a combination of fixed and continuous variables that help us with further exploratory analysis.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Exploratory Data Analysis with R

Published in: May 2019Publisher: PacktISBN-13: 9781789804379

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Radhika Datar

Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.
Read more about Radhika Datar

Harish Garg

Harish Garg is a Principal Software Developer, author, and co-founder of a software development and training company, Bignumworks. Harish has more than 19 years of experience in a wide variety of technologies, including blockchain, data science and enterprise software. During this time, he has worked for companies such as McAfee, Intel, etc.
Read more about Harish Garg

Other recommended products

Related to this chapter

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

R Programming Fundamentals

Data analysis is crucial to accurately predict the performance of an application. The book begins by getting you started with R, including basic programming and data import, data visualization, pivoting, merging, aggregating, and joins. Once you are comfortable with the basics, you can read ahead and learn all about data visualization and graphics. You can learn data management, statistics and applications, forecasting, and reporting. With this various case studies and examples, this book gives you the knowledge to confidently start your career in the field of data science.

BookSep 2018206 pages

Neural Networks with R

The book helps you learn neural networks and implement them in R. It covers real-world use cases that will help you better understand their concepts. A basic understanding of R and mathematics is required.

BookSep 2017270 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Jupyter for Data Science

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations and visualizations. This book will be a comprehensive guide to getting started with data science using the popular Jupyter notebook. It will show you how to leverage the capabilties of Jupyter to perform various data science tasks efficiently. From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter.

BookOct 2017242 pages

R Data Analysis Cookbook

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.

BookSep 2017560 pages

Data Analysis with R

R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

BookMar 2018570 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

Practical Predictive Analytics

This book teaches six specific steps needed to implement predictive analytics using R. It also teaches how team collaboration is critical and how it increases the chances of implementing a successful model. The book uses cases from healthcare, marketing, and government to build practical skills. Big Data is also covered, in this book, which will extend your skill sets by learning Databricks and RSpark.

BookJun 2017576 pages

Hands-On Data Science with R

Hands-On Data Science with R explore various popular R packages to perform various data science tasks, including core statistical concepts and a wide array of use cases. This practical book covers the entire data science ecosystem for aspiring data scientists, including machine learning, NLP, and neural networks

BookNov 2018420 pages

R Data Mining

This book will empower you to produce and present impressive analyses from data, by selecting and implementing the appropriate data mining techniques in R. Explore a data mining crime case, where you will be requested to help resolving a real fraud case affecting a commercial company, by the mean of both basic and advanced data mining techniques.

BookNov 2017442 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages