Reader small image

You're reading from  Jupyter for Data Science

Product typeBook
Published inOct 2017
Reading LevelBeginner
PublisherPackt
ISBN-139781785880070
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dan Toomey
Dan Toomey
author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Right arrow

Chapter 5. R with Jupyter

In this chapter we will be using R coding within Jupyter. I think R is one of the primary languages expected to be used within Jupyter. The full extent of the language is available to Jupyter users.

How to set up R for Jupyter


In the past, it was necessary to install the separate components of Jupyter, Python, and so on to have a working system. With Continuum Analytics, the process of installing Jupyter and adding the R engine to the solution set for Jupyter is easy and works on both Windows and Mac.

Assuming you have installed conda already, we have one command to add support for R programming to Jupyter:

conda install -c r r-essentials

Note

At this point, when you start Jupyter, one of the kernels listed will now be R.

R data analysis of the 2016 US election demographics


To get a flavor of the resources available to R developers, we can look at the 2016 election data. In this case, I am drawing from Wikipedia (https://en.wikipedia.org/wiki/United_States_presidential_election,_2016), specifically the table named 2016 presidential vote by demographic subgroup. We have the following coding below.

Define a helper function so we can print out values easily. The new printf function takes any arguments passed (...) and passes them along to sprintf:

printf <- function(...)print(sprintf(...))

I have stored the separate demographic statistics into different TSV (tab-separated value) files, which can be read in using the following coding. For each table, we use the read.csv function and specify the field separator as a tab instead of the default comma. We then use the head function to display information about the data frame that was loaded:

age <- read.csv("Documents/B05238_05_age.tsv", sep="\t")head(age)education...

Analyzing 2016 voter registration and voting


Similarly, we can look at voter registration versus actual voting (using census data from https://www.census.gov/data/tables/time-series/demo/voting-and-registration/p20-580.html).

First, we load our dataset and display head information to visually check for accurate loading:

df <- read.csv("Documents/B05238_05_registration.csv")summary(df)

So, we have some registration and voting information by state. Use R to automatically plot all the data in x and y format using the plot command:

plot(df)

We are specifically looking at the relationship between registering to vote and actually voting. We can see in the following graphic that most of the data is highly correlated (as evidenced by the 45 degree angles of most of the relationships):

We can produce somewhat similar results using Python, but the graphic display is not even close.

Import all of the packages we are using for the example:

from numpy import corrcoef, sum, log, arange
from numpy.random import...

Analyzing changes in college admissions


We can look at trends in college admissions acceptance rates over the last few years. For this analysis, I am using the data on https://www.ivywise.com/ivywise-knowledgebase/admission-statistics.

First, we read in our dataset and show the summary points, from head to validate:

df <- read.csv("Documents/acceptance-rates.csv")summary(df)head(df)

We see the summary data for school acceptance rates as follows:

It's interesting to note that the acceptance rate varies so widely, from a low of 5 percent to a high of 41 percent in 2017.

Let us look at the data plots, again, to validate that the data points are correct:

plot(df)

From the correlation graphics shown, it does not look like we can use the data points from 2007. The graphs show a big divergence between 2007 and the other years, whereas the other three have good correlations.

So, we have 3 consecutive years of data from 25 major US universities. We can convert the data into a time series using a few steps...

Predicting airplane arrival time


R has built-in functionality for splitting up a data frame between training and testing sets, building a model based on the training set, predicting results using the model and the testing set, and then visualizing how well the model is working.

For this example, I am using airline arrival and departure times versus scheduled arrival and departure times from http://stat-computing.org/dataexpo/2009/the-data.html for 2008. The dataset is distributed as a .bz2 file that unpacks into a CSV file. I like this dataset, as the initial row count is over 7 million and it all works nicely in Jupyter.

We first read in the airplane data and display a summary. There are additional columns in the dataset that we are not using:

df <- read.csv("Documents/2008-airplane.csv")summary(df)...CRSElapsedTime      AirTime          ArrDelay          DepDelay       Min.   :-141.0   Min.   :   0     Min.   :-519.00   Min.   :-534.00   1st Qu.:  80.0   1st Qu.:  55     1st Qu.: -10.00...

Summary


In this chapter, we first set up R as one of the engines available for a notebook. Then we used some rudimentary R to analyze voter demographics for the presidential election. We looked at voter registration versus actual voting. Next, we analyzed the trend in college admissions. Finally, we looked at using a predictive model to determine whether flights would be delayed or not.

In the next chapter, we will look into wrangling data in different ways under Jupyter.

 

 

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Jupyter for Data Science
Published in: Oct 2017Publisher: PacktISBN-13: 9781785880070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey