Home Data Hands-On Data Science with R

Hands-On Data Science with R

By Vitor Bianchi Lanzetta , Doug Ortiz , Nataraj Dasgupta and 1 more
books-svg-icon Book
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Getting Started with Data Science and R
About this book
R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems. The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data. Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.
Publication date:
November 2018
Publisher
Packt
Pages
420
ISBN
9781789139402

 

Getting Started with Data Science and R

“It is a capital mistake to theorise before one has data.”
― Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes

Data, like science, has been ubiquitous the world over since early history. The term data science is not generally taken to literally mean science with data, since without data there would be of science. Rather, it is a specialized field in which data scientists and other practitioners apply advanced computing techniques, usually along with algorithms or predictive analytics to uncover insights that may be challenging to obtain with traditional methods.

Data science as a distinct subject was proposed since the early 1960s by pioneers and thought leaders such as Peter Naur, Prof. Jeff Wu, and William Cleveland. Today, we have largely realized the vision that Prof. Wu and others had in mind when the concept first arose; data science as an amalgamation of computing, data mining, and predictive analytics, all leading up to deriving key insights that drive business and growth across the world today.

The driving force behind this has been the rapid but proportional growth of computing capabilities and algorithms. Computing languages have also played a key role in supporting the emergence of data science, primary among them being the statistical language R.

In this introductory chapter, we will cover the following topics:

  • Introduction to data science and R
  • Active domains of data science
  • Solving problems with data science
  • Using R for data science
  • Setting up R and RStudio
  • Our first R program
 

Introduction to data science

The term, data science, as mentioned earlier, was first proposed in the 1960s and 1970s by Peter Naur. In the late 1990s, Jeff Wu, while at the University of Michigan, Ann Arbor, proposed the term in a formal paper titled Statistics = Data Science?. The paper, which Prof. Wu subsequently presented at the seventh series of P.C. Mahalonobis Lectures at the Indian Statistical Institute in 1998, raised some interesting questions about what an appropriate definition of statistics might be in light of the tasks that a statistician did beyond numerical calculations.

In the paper Prof. Wu highlighted the concept of Statistical Trilogy, consisting of data collection, data modeling and analysis, and problem solving. The following sections reflected upon the future directions in which Dr. Wu raised the prospects of neural network models to model complex, non-linear relationships, the use of cross validation to improve model performance, and data mining of large-scale data among others. [Source: https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf].

The paper, although written more than 20 years ago, is a reflection of the foresight that a few academicians such as Dr. Wu had at the time, which has been realized in full, almost verbatim as it was propositioned back then, both in thought and practical concepts. A copy of Dr. Wu's paper is available at https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf.

Key components of data science

The practice of data science requires the application of three distinct disciplines to uncover insights from data. These disciplines are as follows:

  • Computer science
  • Predictive analytics
  • Domain knowledge

The following diagram shows the core components of data science:

Computer science

During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.

While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.

Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.

Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.

Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.

Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:

Predictive analytics (machine learning)

In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.

The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.

In general, creating a machine learning model involves a series of steps such as the sequence:

  1. Cleanse and curate the dataset to extract the cohort on which the model will be built.
  2. Analyze the data using descriptive statistics, for example, distributions and visualizations.
  3. Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
  4. Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
  5. Select appropriate machine learning models and create the model using cross validation.
  6. Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
  7. Perform predictions on the test dataset.
  8. Deliver the final model.

The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.

Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.

The Comprehensive R Archive Network (CRAN) has a task view page titled CRAN Task View: Machine Learning & Statistical Learning (https://cran.r-project.org/web/views/MachineLearning.html) that summarizes some of the key packages in use today for machine learning using R.

Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.

It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.

Domain knowledge

More often than data scientists would like to admit, machine learning models produce results that are obvious and intuitive. For instance, we once conducted an elaborate analysis of physicians, prescribing behavior to find out the strongest predictor of how many prescriptions a physician would write in the next quarter. We used a broad set of input variables such as the physicians locations, their specialties, hospital affiliations, prescribing history, and other data. In the end, the best performing model produced a result that we all knew very well. The strongest predictor of how many prescriptions a physician would write in the next quarter was the number of prescriptions the physician had written in the previous quarter! To filter out the truly meaningful variables and build a more robust model, we eventually had to engage someone who had extensive experience of working in the pharma industry. Machine learning models work best when produced in a hybrid approach—one that combines domain expertise along with the sophistication of models developed.

 

Active domains of data science

Data science plays a role in virtually all aspects of our day-to-day lives and is used across nearly all industries. The adoption of data science was largely spurred by the successes of start-ups such as Uber, Airbnb, and Facebook that rose rapidly and earned valuations of billions of dollars in a very short span of time.

Data generated by social media networks such as Facebook and Twitter, search engines such as Google and Yahoo!, and various other networks, such as Pinterest and Instagram led to a deluge of information about personal tastes, preferences, and habits of individuals. Companies leveraged the information using various machine learning techniques to gain insights.

For example, Natural Language Processing (NLP) is a machine learning technique used to analyse textual data on comments posted on public forums to extract users' interests. The users are then shown ads relevant to their interests generating sales from which companies earn ad revenue. Image recognition algorithms are utilized to automatically identify objects in an image and serve the relevant images when users search for those objects on search engines.

The use of data science as a means to not only increase user engagement but also increase revenue, has become a widespread phenomenon. Some of the domains in which data science is prevalent is given as follows. The list is not all-inclusive, but highlights some of the key industries in which data science plays an important role today:

A few of these domains have been discussed in the following sections.

Finance

Data science has been used in finance, especially in trading for many decades. Investment banks, especially trading desks, have employed complex models to analyse and make trading decisions. Some examples of data science as used in finance include:

  • Credit risk management: Analyse the creditworthiness of a user by analyzing the historical financial records, assets, and transactions of the user
  • Loan fraud: Identifying applications for credit or loans that may be fraudulent by analyzing the loan and applicant's characteristics
  • Market Basket Analysis: Understanding the correlation among stocks and other securities and formulating trading and hedging strategies
  • High-frequency trading: Analyzing trades and quotes to discover pricing inefficiencies and arbitrage opportunities

Healthcare

Healthcare and related fields such as pharmaceuticals and life sciences, have also seen a gradual rise in the adoption and use of machine learning. A leading example has been IBM Watson. Developed in late 2000s, IBM Watson rose to popularity after it won the Double Jeopardy, a popular quiz contest in the US in 2011. Today, IBM Watson is being used for clinical research and several institutions have published preliminary results of success. (Source: http://www.ascopost.com/issues/june-25-2017/how-watson-for-oncology-is-advancing-personalized-patient-care/). The primary impediment to wider adoption has been the extremely high cost of using the system with usually an uncertain return on investment. Companies that are generally well capitalized can invest in the technology.

More common uses of data science in healthcare include:

  • Epidemiology: Preventing the spread of diseases and other epidemiology related use cases are being solved with various machine learning techniques. A recent example of the use of clustering to detect the Ebola outbreak received attention, being one of the first times that machine learning was used in a medical use case very effectively. (Source: https://spectrum.ieee.org/tech-talk/biomedical/diagnostics/healthmap-algorithm-ebola-outbreak).
  • Health insurance fraud detection: The health insurance industry loses billions each year in the US due to fraudulent claims for insurance. Machine learning, and more generally, data science is being used to detect cases of fraud and reduce the loss incurred by leading health insurance firms. (Source: https://www.sciencedirect.com/science/article/pii/S1877042812036099).
  • Recommender engines: Algorithms that match patients with physicians are used to provide recommendations based on the patients' symptoms and doctor specialties.
  • Image recognition: Arguably, the most common use of data science in healthcare, image recognition algorithms are used for a variety of cases ranging from segmentation of malignant and non-malignant tumours to cell segmentation. (Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159221/).

Pharmaceuticals

Although closely linked to the data science use cases in healthcare, data science use cases in pharma are geared toward the development of drugs, physician marketing, and treatment-related analysis. Examples of data science in pharma include the following:

Government

Data science is used by state and national governments for a wide range of uses. These include topics in cyber security, voter benefits, climate change, social causes, and other similar use cases that are geared toward public policy and public benefits.

Some examples include the following:

  • Climate change: One of the most popular topics among climate change proponents, there is extensive machine learning related work that is being conducted around the globe to detect and understand the causes of climate change. (Source: https://toolkit.climate.gov).
  • Cyber security: The use of extremely advanced machine learning techniques for national cyber security is evident and well known all over the world, ever since such practices were disclosed by consultants at security firms a few years back. Security-related organizations employ some of the most advanced hardware and software stacks for detecting cyber threats and prevent hacking attempts. (Source: https://www.csoonline.com/article/2942083/big-data-security/cybersecurity-is-the-killer-app-for-big-data-analytics.html).
  • Social causes: The use of data science for a wide range of use cases geared toward social good is well known due to several conferences and papers that have been organized and released respectively on the topic. Examples include topics in urban analytics, power grids utilizing smart meters, criminal justice. (Source: https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/agenda/).

Manufacturing and retail

The manufacturing and retail industry has used data science to designing better products, optimize pricing, and design strategic marketing techniques. Some examples include the following:

Web industry

One of the earliest beneficiaries of data science was the web industry. Empowered by the collection of user-specific data from social networks, firms around the world employ algorithms to understand user behavior and generate targeted ads. Google, one of the earliest proponents of targeted ad marketing today, earns most of its revenue from ads, more than $95 billion in 2017. (Source: https://www.statista.com/statistics/266249/advertising-revenue-of-google/). The use of data science for web-related businesses is ubiquitous today and companies such as Uber, Airbnb, Netflix, and Amazon have successfully navigated and made full use of this complex ecosystem, generating not only huge profits but also added millions of new jobs directly or indirectly as a result.

  • Targeted ads: Click through ads have been one of the prime areas of machine learning. By reading cookies saved on users' computers from various sites, other sites can assess the users interests and accordingly decide which ads to serve when they visit new sites. As per online sources, the value of internet advertising is over $1 trillion and has generated over 10 million jobs in 2017 alone. (Source: https://www.iab.com/insights/economic-value-advertising-supported-internet-ecosystem/).
  • Recommender engines: Netflix, Pandora, and other movies and audio streaming services utilize recommender engines to understand which movies or music the viewer or listener would be interested in and make recommendations. The recommendations are often based on what other users with similar tastes might have already seen and leverage recommender algorithms such as collaborative, content-based, and hybrid filtering.
  • Web design: Using A/B testing, mouse tracking, and other sophisticated techniques, web developers leverage data science to design better web pages such as landing pages and in general websites. A/B testing for instance allows developers to decide between different versions of the same web page and deploy accordingly.

Other industries

There are various other industries today that benefit from data science and as such, it has become so common that it would be impractical to list all, but at a high level, some of the others include the following:

  • Oil and natural gas for oil production
  • Meteorology for understanding weather patterns
  • Space research for detecting and/or analyzing stars and galaxies
  • Utilities for energy production and energy savings
  • Biotechnology for research and finding new cures for diseases

In general, since data science, or machine learning algorithms are not specific to any particular industry, it is entirely possible to apply algorithms to creative use cases and derive business benefits.

 

Solving problems with data science

Data science is being used today to solve problems ranging from poverty alleviation to scientific research. It has emerged as the leading discipline that aims to disrupt the industry's status quo and provide a new alternative to pressing business issues.

However, while the promise of data science and machine learning is immense, it is important to bear in mind that it takes time and effort to realize the benefits. The return-on-investment on a machine learning project typically takes a fairly long time. It is thus essential to not overestimate the value it can bring in the short run.

A typical data science project in a corporate setting would require the collaborative efforts of various groups, both on the technical and the business side. Generally, this means that the project should have a business sponsor and a technical or analytics lead in addition to the data science team or data scientist. It is important to set expectations at the onset—both in terms of the time it would take to complete the project and the outcome that may be uncertain until the task has completed. Unlike other projects that may have a definite goal, it is not possible to predetermine the outcome of machine learning projects.

Some common questions to ask include the following:

  • What business value does the data science project bring to the organization?
  • Does it have a critical base of users, that is, would multiple users benefit from the expected outcome of the project?
  • How long would it take to complete the project and are all the business stakeholders aware of the timeline?
  • Have the project stakeholders taken all variables that may affect the timeline into account? Projects can often get delayed due to dependencies on external vendors.
  • Have we considered all other potential business use cases and made an assessment of what approach would have an optimal chance of success?

A few salient points for successful data science projects are given as follows:

  • Find projects or use cases related to business operations that are:
    • Challenging
    • Not necessarily complex, that is, they can be simple tasks but which add business value
    • Intuitive, easily understood (you can explain it to friends and family)
    • Takes effort to accomplish today or requires a lot of manual effort
    • Used frequently by a range of users and the benefits of the outcome would have executive visibility
  • Identify low difficulty–high value (shorter) versus high difficulty–high value (longer)
  • Educate business sponsors, share ideas, show enthusiasm (it's like a long job interview)
  • Score early wins on low difficulty–high value; create minimum viable solutions, get management buy-in before enhancing them (takes time)
  • Early wins act as a catalyst to foster executive confidence; and also make it easier to justify budgets, making it easier to move on to high difficulty—high value tasks
 

Using R for data science

Being arguably the oldest and consequently the most mature language for statistical operations, R has been used by statisticians all over the world for over 20 years. The precursor to R was the S programming language, written by John Chambers in 1976 in Bell Labs. R, named after the initials of its developers, Ross Ihaka and Robert Gentleman, was implemented as an open source equivalent to S while they were at the University of Auckland.

The language has gained immensely in popularity since the early 2000s, averaging between 20% to 30% growth on a year-on-year basis:

The growth of R packages

In 2018, there were more than 12,000 R packages, up from about 7,500 just 3 years before, in 2015.

A few key features of R makes it not only very easy to learn, but also very versatile due to the number of available packages.

Key features of R

The key features of R are as follows:

  • Data mining: The R package, data.table, developed by Dowle and Srinivasan, is arguably one of the most sophisticated packages for data mining in any language provides R users with the ability to query millions, if not billions of rows of data. In addition, there is tibble, an alternative to data.frame developed by Hadley Wickham. Other packages from Wickham include, plyr, dplyr and ggplot2 for visualization.
  • Visualizations: The ggplot2 package is the most commonly used visualization package in R. Packages such as rcharts, htmlwidgets have also become extremely popular in recent years. Most of these packages allow R users to leverage elegant graphics features commonly found in JavaScript packages such as D3. Many of them act as wrappers for popular JavaScript visualization libraries to facilitate the creation of graphics elements in R.
  • Data science: R has had various statistical libraries used for research for many years. With the growth of data science as a popular subject in the public domain, R users have released and further developed both new and existing packages that allows users to deploy complex machine learning algorithms. Examples include randomforest, gbm.
  • General availability of packages: The 12,000+ packages in R provide coverage for a wide range of projects. These include packages for machine learning, data science, and even general purpose needs such as web scraping, cartography, and even fisheries sciences. Due to this rich ecosystem that can cater to the needs of a wide variety of use cases, R has grown exponentially in popularity. Whether you are working with JSON files or trying to solve an obscure machine learning problem, it is very likely that someone in the R community has already developed a package that contains (or can indirectly fulfill) the functionality you need.
  • Setting up R and RStudio: This book will focus on using R for data science related tasks. The language R, as mentioned, is available as an open source product from http://r-project.org. In addition, we will be installing RStudio—an IDE (a graphical user interface) for writing and running our R code as well as R Shiny, a platform that allows users to develop elegant dashboards.

Downloading and installing R is as follows:

  1. Go to http://r-project.org and click on the CRAN (http://cran.r-project.org/mirrors.html):
  1. Select any one of the links in the corresponding page. These are links to CRAN Mirrors, that is, sites that host R packages and R installation files:
  1. Once you select and click on the link, you'll be taken to a page with the links to download R for different operating systems, such as Windows, macOS, and Linux. Select the distribution that you need to start the download process:

  1. This is the R for macOS download page:

  1. This is the R for Windows download page (click on install R for the first time if it is a new installation):
  1. This is the R for Windows download page. Download and install the .exe file for R:
  1. The R for macOS installation process will require you to download the .dmg file. Select the default options for installation if you do not intend to make any changes, such as installing in a different directory:

You will also need to download and install RStudio and R Shiny. RStudio is used as the frontend, which you'll use to develop your R code. As such, it is not necessary to use RStudio to write code in R as you can launch the R console from the desktop (Windows), but RStudio has a nicer and a more user-friendly interface that makes it easier to code in R.

  1. Download RStudio and R Shiny from https://www.rstudio.com:
  1. Click on Products in the top menu and select RStudio to download and install the software.
  1. Download the open source version of RStudio. Note that there are other versions which are paid commercial versions of the software. For our exercise, we'll be using the open source version only. Download it from https://www.rstudio.com/products/rstudio/download/:
  1. Once you have installed RStudio, launch the application. This will bring up the Following screenshot. There are four panels in RStudio. The first three are shown when you first launch RStudio:
  1. Click on File | New File | R Script. This will open a new panel. This is the section where you'll be writing your R code:

RStudio is a very mature interface for developing R code and has been in use for several years. You should familiarize yourself with the different features in RStudio as you'll be using the tool throughout the book.

 

Our first R program

In this section, we will create our first R program for data analysis. We'll use the human development data available from the United Nations development program. The initiative produces a Human Development Index (HDI) corresponding to each country, which signifies the level of economic development, including general public health, education, and various other societal factors.

Further information on HDI can be found at http://hdr.undp.org/en/content/human-development-index-hdi.The site also hosts an FAQ page that provides short summary explanations of the various characteristics of the program at http://hdr.undp.org/en/faq-page/human-development-index-hdi.

The following diagram from the UN development program's website summaries the concept at a high level:

UN development index

In this exercise, we will be looking at the life expectancy and expected years of schooling on a per country per year basis starting from 1990 onward. Not all data is available for all countries, due to various geopolitical and other reasons that have made it difficult to obtain data for respective years.

The datasets for the HDP program have been obtained from http://hdr.undp.org/en/data.

In the exercises, the data has been cleaned and formatted to make it easier for the reader to analyse the information, especially given it is the first chapter of the book. Download the data from the Packt code repository for this book. Following are the steps to complete the exercise:

  1. Launch RStudio and click on File | New File | R Script.
  2. Save the file as Chapter1.R.
  1. Copy the commands shown in the following script and save.
  2. Install the required packages for this exercise by running the following command. First, copy the command into the code window in RStudio:
install.packages(c("data.table","plotly","ggplot2","psych"))
  1. Then, place your cursor on the line and click on Run:
  1. This will install the respective packages in your system. In case you encounter any errors, search on Google for the cause of the error. There are various online forums, such as Stack Overflow, where you can search for common errors and learn how to fix them. Since errors can depend on the specific configuration of your machine, we cannot identify all of them, but it is very likely that someone else might have experienced the same error conditions.

We have already created the requisite CSV files, and the following code illustrates the entire process of reading in the CSV files and analyzing the data:


# We'll install the following packages:
## data.table: a package for managing & manipulating datasets in R
## plotly: a graphics library that has gained popularity in recent year
## ggplot2: another graphics library that is extremely popular in R
## psych: a tool for psychmetry that also includes some very helpful #statistical functions

install.packages(c("data.table","plotly","ggplot2","psych"))

# Load the libraries
# This is necessary if you will be using functionalities that are #available outside
# The functions already available as part of standard R

library(data.table)
library(plotly)
library(ggplot2)
library(psych)
library(RColorBrewer)

# In R, packages contain multiple functions and once the package has #been loaded
# the functions become available in your workspace
# To find more information about a function, at the R console, type #in ?function_name
# Note that you should replace function_name with the name of the actual function
# This will bring up the relevant help notes for the function
# Note that the "R Console" is the interactive screen generally #found

# Read in Human Development Index File
hdi <- fread("ch1_hdi.csv",header=T) # The command fread can be used to read in a CSV file

# View contents of hdi
head(hdi) # View the top few rows of the data table hdi
//

The output of the preceding code is as follows:

Read the life expectancy file by using the following code:

life <- fread("ch1_life_exp.csv", header=T)

# View contents of life
head(life)

The output of the code file is as follows:

Read the years of schooling file by using the following code:

# Read Years of Schooling File
school <- fread("ch1_schoolyrs.csv", header=T)

# View contents of school
head(school)

The output of the preceding code is as follows:

Now we will read the country information:

iso <- fread("ch1_iso.csv")

# View contents of iso
head(iso)

The following is the output of the previous code:

Here we will see the processing of the hdi table by using the following code:

# Use melt.data.table to change hdi into a long table format

hdi <- melt.data.table(hdi,1,2:ncol(hdi))

# Set the names of the columns of hdi
setnames(hdi,c("Country","Year","HDI"))

# Process the life table
# Use melt.data.table to change life into a long table format
life <- melt.data.table(life,1,2:ncol(life))
# Set the names of the columns of hdi
setnames(life,c("Country","Year","LifeExp"))

# Process the school table
# Use melt.data.table to change school into a long table format
school <- melt.data.table(school,1,2:ncol(school))
# Set the names of the columns of hdi
setnames(school,c("Country","Year","SchoolYrs"))

# Merge hdi and life along the Country and Year columns
merged <- merge(merge(hdi, life,
by=c("Country","Year")),school,by=c("Country","Year"))

# Add the Region attribute to the merged table using the iso file
# This can be done using the merge function
# Type in ?merge in your R console
merged <- merge(merged, iso, by="Country")
merged$Info <- with(merged, paste(Country,Year,"HDI:",HDI,"LifeExp:",LifeExp,"SchoolYrs:",
SchoolYrs,sep=" "))

# Use View to open the dataset in a different tab
# Close the tab to return to the code screen
View(head(merged))

The output of the preceding code is as follows:

Here is the code for finding summary statistics for each country:


mergedDataSummary <-
describeBy(merged[,c("HDI","LifeExp","SchoolYrs")],
group=merged$Country, na.rm = T, IQR=T)


# Which Countries are available in the mergedDataSummary Data Frame ?
names(mergedDataSummary)
mergedDataSummary["Cuba"] # Enter any country name here to view
#the summary information

The output is as follows:

Useing ggplot2 to view density charts and boxplots:

ggplot(merged, aes(x=LifeExp, fill=Region)) + geom_density(alpha=0.25)

The output is as follows:

Now we will see what the result is for geom_boxplot:


ggplot(merged, aes(x=Region, y=LifeExp, fill=Region)) + geom_boxplot()

The output is as follows:


Create an animated chart using plot_ly:

# Reference: https://plot.ly/r/animations/
p <- merged %>%
plot_ly(
x = ~SchoolYrs,
y = ~LifeExp,
color = ~Region,
frame = ~Year,
text = ~Info,
size = ~LifeExp,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
) %>%
layout(
xaxis = list(
type = "log"
)
) %>%
animation_opts(
150, easing = "elastic", redraw = FALSE
)

# View plot
p

The output is as follows:

Creating a summary table with the average of SchoolYrs and LifeExp by Region and Year by using the following code:


mergedSummary <- merged[,.(AvgSchoolYrs=round(mean(SchoolYrs, na.rm =
T),2), AvgLifeExp=round(mean(LifeExp),2)), by=c("Year","Region")]
mergedSummary$Info <- with(mergedSummary,
paste(Region,Year,"AvgLifeExp:",AvgLifeExp,"AvgSchoolYrs:",
AvgSchoolYrs,sep=" "))


# Create an animated plot similar to the prior diagram
# Reference: https://plot.ly/r/animations/
ps <- mergedSummary %>%
plot_ly(
x = ~AvgSchoolYrs,
y = ~AvgLifeExp,
color = ~Region,
frame = ~Year,
text = ~Info,
size=~AvgSchoolYrs,
opacity=0.75,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
) %>%
layout(title = 'Average Life Expectancy vs Average School Years
(1990-2015)',
xaxis = list(title="Average School Years"),
yaxis = list(title="Average Life Expectancy"),
showlegend = FALSE)
# View plot
ps

 

Summary

In this section, we were introduced to R and in particular, data science with R. We learnt about the various applications of R across different industries and how R is being used across disciplines to solve a wide range of challenges. R has been growing at a tremendous rate, averaging 20% to 30% growth year-on-year, and today has over 12,000 packages.

We also downloaded and installed R and RStudio and wrote our very first program in R. The R program utilizes various libraries for both data analysis and charting. In the next chapter, we will work on descriptive and inferential statistics. We will learn about hypothesis testing, t-tests and various other measures in probability. While this chapter provided a high-level overview, in the next chapter, we will delve into more fundamental data science topics and see how you can use R to develop code and analyse data using R.

 

Quiz

  1. What does the acronym CRAN stand for ?
    1. Comprehensive R Archive Network
    2. Common R Archive Nomenclature
    3. Categorical Regression And NLP
  2. Which of the following options best describes Tensorflow?
    1. A machine learning package for Fluid Dynamics
    2. Tension Analysis for Fluid Dynamics
    3. A machine learning package from Google
  3. Which of the following Algorithms is Netflix most likely to use to provide movie suggestions:
    1. Particle Swarm
    2. Genetic Algorithms for IMDB
    3. Recommender Engines

Answers:

Q1 - 1, Q2 - 3, Q3 - 3

About the Authors
  • Vitor Bianchi Lanzetta

    Vitor Bianchi Lanzetta (@vitorlanzetta) has a master's degree in Applied Economics (University of So PauloUSP) and works as a data scientist in a tech start-up named RedFox Digital Solutions. He has also authored a book called R Data Visualization Recipes. The things he enjoys the most are statistics, economics, and sports of all kinds (electronics included). His blog, made in partnership with Ricardo Anjoleto Farias (@R_A_Farias), can be found at ArcadeData dot org, they kindly call it R-Cade Data.

    Browse publications by this author
  • Doug Ortiz

    Doug Ortiz is an experienced enterprise cloud, big data, data analytics, and solutions architect who has architected, designed, developed, engineered, re-engineered, and integrated enterprise solutions. The technologies he has experience with include: Amazon Web Services, Azure, Google Cloud, Business Intelligence, Data Science, Hadoop, Spark, NoSQL and Graph Databases, and Web Front-End Technologies.

    Browse publications by this author
  • Nataraj Dasgupta

    Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.

    Browse publications by this author
  • Ricardo Anjoleto Farias

    Ricardo Anjoleto Farias is an economist who graduated from the Universidade Estadual de Maring in 2014. In addition to being a sports enthusiast (electronic or otherwise) and enjoying a good barbecue, he also likes math, statistics, and correlated studies. His first contact with R was when he embarked on his master's degree, and since then, he has tried to improve his skills with this powerful tool.

    Browse publications by this author
Latest Reviews (3 reviews total)
My experience has been excellent
Excellent book! I recommend this for anyone learning R.
Excelente información y fácil de entender
Hands-On Data Science with R
Unlock this book and the full library FREE for 7 days
Start now