Hands-On Exploratory Data Analysis with R

4 (1 reviews total)
By Radhika Datar , Harish Garg
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Setting Up Our Data Analysis Environment

About this book

Hands-On Exploratory Data Analysis with R will help you build not just a foundation but also expertise in the elementary ways to analyze data. You will learn how to understand your data and summarize its main characteristics. You'll also uncover the structure of your data, and you'll learn graphical and numerical techniques using the R language.

This book covers the entire exploratory data analysis (EDA) process—data collection, generating statistics, distribution, and invalidating the hypothesis. As you progress through the book, you will learn how to set up a data analysis environment with tools such as ggplot2, knitr, and R Markdown, using tools such as DOE Scatter Plot and SML2010 for multifactor, optimization, and regression data problems.

By the end of this book, you will be able to successfully carry out a preliminary investigation on any dataset, identify hidden insights, and present your results in a business context.

Publication date:
May 2019
Publisher
Packt
Pages
266
ISBN
9781789804379

 

Chapter 1. Setting Up Our Data Analysis Environment

In this chapter,we will look at how Exploratory Data Analysis (EDA) benefits businesses and has a significant impact on almost all vertical markets.

EDA is nothing but a pattern of analyzing datasets to summarize their main features, mostly with visual methods.We will list the R packages and tools that are required to do EDA. We will also focus on the installation procedure and setting up the packages for the EDA environment from an R perspective.

The following topics will be covered in this chapter:

  • The benefits of EDA across vertical markets
  • The most popular R packages for EDA
  • Installing the required R packages and tools

 

Technical requirements


R is an open source software that is platform independent. All you need to do is download the particular package from the following links:

The following steps are used to install R in your system:

The first time you open the RStudio user interface after installation, it will look as shown in the following screenshot:

 

The benefits of EDA across vertical markets


Every organization today produces and relies on a lot of data in their everyday processes. Before making assumptions and decisions based on this data, organizations need to be able to understand it. EDA enables data analysts and data scientists to bring this information to the right people. It is the most important step on which a data-driven organization should focus its energy and resources.

Having practical tools in hand for carrying out EDA helps data analysts and data scientists produce reproducible and knowledgeable data analysis results. R is one of the most popular data analysis environments, so it makes sense to equip your data analysis teams with powerful R techniques to make the most of their EDA skills.

At the time of writing this book, there are more than 13,000 R packages available according to CRAN. You can get R packages for all kinds of tasks and domains. For our purpose, we will be concentrating on a particular set of R packages that are considered the best by the R community for the purpose of EDA. Some of the packages that we are going to cover may not be directly related to EDA, but they are relevant for other stages of dealing with the data, as indicated by the following diagram:

We will introduce these packages briefly in this chapter and go into more detail as the book progresses. The different stages are as mentioned as follows:

  • Pre Modeling Stage: This stage involves the manipulation of the data frame based on Data Visualization, Data Transformation, Missing Value Imputations, Outlier Detection, Feature Selection, and Dimension Reduction.
  • Modeling Stage: This stage is considered as an intermediate stage that involves Continuous Regression, Ordinal Regression, Classification, Clustering, and Time Series with Survival.
  • Post Modeling Stage: This stage is considered as a final stage where only output interpretation is considered on high priority. It includes the implementation of various algorithms such as clustering, classification, and regression.
 

Manipulating data


Before you can start exploring your data, you first need to import it into your data analysis environment. There are many types of data, ranging from plain data in comma-separated value files to binary data in databases. Different R packages are equipped to handle these different kinds of data expertly and to import them almost ready for use in our environment. Since we are using R and RStudio, we will describe some of the most powerful R packages to import data in the following sections:

  • readr: readr can be used to read flat, rectangular data into R. It works with both comma-separated and tab-separated values.
  • readxl: We can use the readxl package to read data from MS Excel files.
  • jsonlite: Web services have increasingly started to provide data in a JSON format. The jsonlite package is a good way to import this kind of data into R.
  • httr, rvest: httr, and rvest are very good packages to get data from the web, either from web APIs or by web scraping.
  • DBI: DBI is used to read data from relational databases into R.

Examining, cleaning, and filtering data

The next steps after importing the data are to examine it and check for missing or erroneous data. We then need to clean the data and apply filters and selections. Different kinds of datasets need different approaches to carry out these steps. R has powerful packages to handle this and some of them are as follows:

  • dplyr: dplyr is a powerful R package that provides methods to make examining, cleaning, and filtering data fast and easy.
  • tidyr: The tidyr package helps to organize messy data for easier data analysis.
  • stringr: The stringr package provides methods and techniques of working with string data efficiently.
  • forcats: Factors are widely used while doing data analysis in R. The forcats package makes it easy to work with factors.
  • lubridate: lubridate makes wrangling date-time data quick and easy.
  • hms: hms is a great package for handling datasets that include data with time of day values.
  • blob: Not all data always comes stored in plain ASCII text; you sometimes have to deal with binary data formats. The blob package makes this easy.

Visualizing data

Visualizing data is one of the best ways to carry out graphical EDA. Visualizing data with plots and charts allows us to discover facts about the data that may not be very obvious when applying quantitative EDA techniques:

  • ggplot2: One of the best packages to visualize data in R is ggplot2. It is so popular with the R community that it has almost become an industry standard.
  • GGally: This is another package that helps visualize data created in a data frame. It includes various plot features such as creating matrix scattered plots.
  • Scatterplot3d: This package helps create 3D scatter plots, which adds more visualization features.

Creating data reports

Once you have finished exploring the data and you are ready to present your results, you need a way to put your observations, code, and visualizations into a great-looking report. 

The following are the list of packages that help us create fascinating reports in R, and it is mandatory that you should install them in your environment:

  • knitr: The knitr package allows us to generate dynamic reports. It also has a lot of other functionalities that make reports easy to read for both technical and non-technical audiences in a wide variety of formats.
  • R Markdown: The R Markdown package allows us to keep text, data, and graphs in one place. It also provides an input to the knitr package.
 

Installing the required R packages and tools


R packages can be installed in two ways: from the Terminal or from inside RStudio. Let's take a look at these.

Installing R packages from the Terminal

To install R packages from the Terminal, follow these steps:

  1. Open the Terminal
  2. Type and run the following command. Make sure to replace packagename1 with an actual package name, such as dplyr:
install.packages("packagename1")

Installing R packages from inside RStudio

To install R packages from RStudio, follow these steps:

  1. Open RStudio
  2. Click on Tools from the menu bar and then click on Install Packages...:
  1. In the Install Packages dialog box, type in the package name in the Packages text box and click on the Install button:

You can also install multiple packages at the same time with both the R command line and RStudio. Just separate the individual packages with commas:

install.packages("packagename1", "packagename2")

Make sure to install all the packages that are covered in this chapter before proceeding to the next chapter.

 

Summary


In this chapter, we have learned about the benefits that EDA can bring to businesses across various verticals. We introduced the R packages that will be used in this book to teach concepts related to EDA. We also learned how to set up and install these packages using both the Terminal and RStudio.

The next chapter will demonstrate practical, hands-on code examples that show howto handle reading all kinds of data into R for EDA. We will cover how to use advanced options while importingdatasets, including delimited data, Excel data, JSON data, and data from web APIs. We will also look at how to scrape and read in data from the web and how toconnect to relational databases from R. We will use Rpackages such as readr, readxl, jsonlite, httr, rvest, and DBI.

About the Authors

  • Radhika Datar

    Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.

    Browse publications by this author
  • Harish Garg

    Harish Garg is a Principal Software Developer, author, and co-founder of a software development and training company, Bignumworks. Harish has more than 19 years of experience in a wide variety of technologies, including blockchain, data science and enterprise software. During this time, he has worked for companies such as McAfee, Intel, etc.

    Browse publications by this author

Latest Reviews

(1 reviews total)
No review yet, I've scanned the book and dived into several topics relating to my current work. Looks very good so far, expect to get into this book in more detail in the coming months.

Recommended For You

Book Title
Unlock this full book FREE 10 day trial
Start Free Trial