EDA is nothing but a pattern of analyzing datasets to summarize their main features, mostly with visual methods.We will list the RÂ packages and toolsÂ that are required to do EDA. We will also focus on the installation procedure and setting up the packages for the EDA environment from an R perspective.
The following topics will be covered in this chapter:
- The benefits of EDA acrossÂ vertical markets
- The most popular R packages for EDA
Installing the required R packages and tools
R is an open source software that is platform independent. All you need to do is download the particular package from the followingÂ links:
The following steps are used to install R in your system:
- You need to have the R language installed.Â Download theÂ RÂ installer from here:Â https://cran.r-âproject.org/.
- We recommend using
RStudio. If you don't already have it installed, you can get it from theÂ following link:Â https://www.rstudio.com/products/rstudio/download.
- Check thatÂ RÂ and RStudio are working.
- Install the RÂ packages required for the workshop.
The first time you open the RStudio user interface after installation, it will look as shown in the following screenshot:
- You will also need to have prior knowledge of the R programming language. Packt has a wide range of books and video titles that are available for this purpose.
- The code for this chapter is available at the following link: Â https://github.com/PacktPublishing/Hands-On-Exploratory-Data-Analysis-with-R.
Every organization today produces and relies on a lot of data in their everyday processes. Before making assumptions and decisions based on this data, organizations need to be able to understand it. EDA enables data analysts and data scientists to bring this information to the right people. It is the most important step on which a data-driven organization should focus its energy and resources.
Having practical tools in hand for carrying out EDA helps data analysts and data scientists produce reproducible and knowledgeable data analysis results. R is one of the most popular data analysis environments, so it makes sense to equip your data analysis teams with powerful R techniques to make the most of their EDA skills.
At the time of writing this book, there are more than 13,000 R packages available according to CRAN. You can get R packages for all kinds of tasks and domains. For our purpose, we will be concentrating on a particular set of R packages that are considered the best by the R community for the purpose of EDA. Some of the packages that we are going to cover may not be directly related to EDA, but they are relevant for other stages of dealing with the data, as indicated by the following diagram:
We will introduce these packages briefly in this chapter and go into more detail as the book progresses. The different stages are as mentioned as follows:
- Pre Modeling Stage: This stage involves the manipulation of the data frame based on Data Visualization, Data Transformation, Missing Value Imputations, Outlier Detection, Feature Selection, and Dimension Reduction.
- Modeling Stage: This stage is considered as an intermediate stage that involves Continuous Regression, Ordinal Regression, Classification, Clustering, and Time Series with Survival.
- Post Modeling Stage: This stage is considered as a final stage where only output interpretation is considered on high priority. It includes the implementation of various algorithms such as clustering, classification, and regression.
Before you can start exploring your data, youÂ firstÂ need to import it into your data analysis environment. There are many types of data, ranging from plain data in comma-separated value files to binary data in databases. Different R packages are equipped to handle these different kinds of data expertly and to import them almost ready for use in our environment. Since we are using R and RStudio, we will describe some of the most powerful R packages to import data in the following sections:
readrcan be used to read flat, rectangular data into R. It works with both comma-separated and tab-separated values.
readxl: We can use the
readxlpackage to read data from MS Excel files.
jsonlite: Web services have increasingly started to provide data in a JSON format. The
jsonliteÂ package is a good way to import this kind of data into R.
rvestare very good packages to get data from the web, either from web APIs or by web scraping.
DBIis used to read dataÂ from relational databasesÂ into R.
The next steps after importing the data are to examine it and check for missing or erroneous data. We then need to clean the data and apply filters and selections. Different kinds of datasets need different approaches to carry out these steps. R has powerful packages to handle this and some of them are as follows:
dplyris a powerful R package that provides methods to make examining, cleaning, and filtering data fast and easy.
tidyrpackage helps to organize messy data for easier data analysis.
stringrpackage provides methods and techniques of working with string data efficiently.
forcats:Â Factors areÂ widelyÂ used while doing data analysis in R. The
forcatspackage makes it easy to work with factors.
lubridatemakes wrangling date-time data quick and easy.
hmsis a great package for handling datasets that include data with time of day values.
blob:Â Not all dataÂ alwaysÂ comes stored in plain ASCII text; you sometimes have to deal with binary data formats. The
blobpackage makes this easy.
Visualizing data is one of the best ways to carry out graphical EDA. Visualizing data with plots and charts allows us to discover facts about the data that may not be very obvious when applying quantitative EDA techniques:
ggplot2:Â One of the best packages to visualize data in R is
ggplot2. It is so popular with the R community that it hasÂ almostÂ become an industry standard.
GGally: This is another package that helps visualize data created in a data frame. It includes various plot features such as creating matrix scattered plots.
Scatterplot3d: This package helps create 3D scatter plots, which adds more visualization features.
The following are the list of packages that help us create fascinating reports in R, and it is mandatory that you should install them in your environment:
knitrÂ package allows us to generate dynamic reports. It also has a lot of other functionalities that make reports easy to read for both technical and non-technical audiences in a wide variety of formats.
R Markdown:Â The
R Markdownpackage allows us to keep text, data, and graphs in one place. It also provides an input to the
- Open the Terminal
- Type and run the following command. Make sure to replace
packagename1Â with an actual package name, such as
- Click on
Toolsfrom the menu bar and then click on
- In the
Install Packagesdialog box, type in the package name in the
Packagestext box and click on theÂ
Make sure to install all the packages that are covered in this chapter before proceeding to the next chapter.
In this chapter, we have learned about the benefits that EDA can bring to businesses across various verticals. We introduced the R packages that will be used in this book to teach concepts related to EDA. We also learned how to set up and install these packages using both the Terminal and RStudio.
The next chapter will demonstrate practical, hands-on code examples that show howto handle reading all kinds of data into R for EDA. We will cover how to use advanced options while importingdatasets, including delimited data, Excel data, JSON data, and data from web APIs. We will also look atÂ how to scrape and read in data from the web and how toconnect to relational databases from R. We will use Rpackages such as