Reader small image

You're reading from  Applied Data Visualization with R and ggplot2

Product typeBook
Published inSep 2018
Reading LevelIntermediate
Publisher
ISBN-139781789612158
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dr. Tania Moulik
Dr. Tania Moulik
author image
Dr. Tania Moulik

Tania Moulik has a PhD in particle physics. She has worked at CERN, the European Organization for Nuclear Research, and on the Tevatron at Fermi National Accelerator Laboratory in IL, USA. She has years of programming experience in C++, Python, and R. She has also worked in the feld of big data and has worked with technologies such as grid computing. She has a passion for data analysis and would like to share her passion with others who would like to delve into the world of data analytics. She especially likes R and ggplot2 as a powerful analytics package.
Read more about Dr. Tania Moulik

Right arrow

Appendix 1. Solutions

This section contains the worked-out answers for the activities present in each lesson. Note that in case of descriptive questions, your answers might not match the ones provided in this section completely. As long as the essence of the answers remain the same, you can consider them correct.

Chapter 1:  Basic Plotting in ggplot2


The following are the activity solutions for this chapter.

Activity: Creating a Histogram and Explaining its Features

Steps for Completion:

  1. Use the template code Lesson1_student.R.

Note

This is an empty code, wherein the libraries are already loaded. You will be writing your code here.

  1. Load the dataset temperature.csv from the directory data.
  2. Create the histogram for two cities (Vancouver and Miami) by using the command discussed previously.
  3. Once the histogram is ready, run the code.
  4. Analyze the two histograms by giving three points for each histogram, and two points of difference between the two.

Outcome:

Two histograms should be created and compared. The complete code is as follows:

df_t <- read.csv("data/historical-hourly-weather-data/temperature.
csv")
ggplot(df_t,aes(x=Vancouver))+geom_histogram()
ggplot(df_t,aes(x=Miami))+geom_histogram()

Activity: Creating One- and Two-Dimensional Visualizations with a Given Dataset

Steps for Completion:

  1. Load the given datasets and investigate them by using the appropriate commands in dataset: xAPI-Edu-Data.csv.
  2. Decide which visualizations to use for the given variables: Topic, gender, and VisitedResources.
  3. Create one-dimensional visualizations and explain why you chose that type of visual (one per variable). Provide one point of observation for each visualization.
  4. Create two-dimensional boxplots or scatterplots for VisitedResources versus Topic, VisitedResources versus AnnouncementsView, and Discussion versus Gender. What are your observations? Write at least five points.

Outcome:

Three one-dimensional plots and three two-dimensional plots should be created, with the following axes (count versus topic) and observations. (Note that the students may provide different observations, so the instructor should verify the answers.)

The complete code is as follows:

df_edu <- read.csv("data/xAPI-Edu-Data.csv")
str(df_edu)

#Functions for Plotting a barchart/Histogram
plotbar <- function(df,mytxt) {
  ggplot(df,aes_string(x=mytxt)) + geom_bar()
}
plothist <- function(df,mytxt) {
  ggplot(df,aes_string(x=mytxt)) + geom_histogram()
}

#Alternatively one can use a function to plot but students can just
#do it directly at this point.
#1-D Plots
plotbar(df_edu,"Topic")
plotbar(df_edu,"gender")
plotbar(df_edu,"ParentschoolSatisfaction")
plothist(df_edu,"VisitedResources")

#2-D Plots
ggplot(df_edu,aes(x=Topic,y=VisitedResources)) + geom_boxplot()
ggplot(df_edu,aes(x=AnnouncementsView,y=VisitedResources)) + geom_point()
ggplot(df_edu,aes(x=gender,y=Discussion)) + geom_boxplot()

Activity: Improving the Default Visualization

Steps for Completion:

  1. Use the basic ggplot commands to create two of the plots from Activity B(Topic and VisitedResources).
  2. Use the Grammar of Graphics to improve your graphics by layering upon the base graphic. The graph should follow these guidelines:
    1. Histograms should be rebinned.
    2. Change the fill colors of one- and two-dimensional objects. The line colors should be black.
    3. Add a title to the graph.
    4. Apply the appropriate font sizes and colors to the x- and y-axes.

Outcome:

The complete code is as follows:

p1 <- ggplot(df_edu,aes(x=Topic))
p2 <- ggplot(df_edu,aes(x=VisitedResources))

p1 +
    geom_bar(color=1,fill=3) +
    ylab("Count")+
    theme(axis.text.y=element_text(size=10),
          axis.text.x=element_text(size = 10),
          axis.title.x=element_text(size=15,color=4),
          axis.title.y=element_text(size=15,color=4))+
    ggtitle("Topics in Education data")

p2 +
    geom_histogram(bins=20,fill="white",color=1)+
    ggtitle("Visited Resources for Education data")+
    xlab("Visited Resources")+
    theme(axis.text.x=element_text(size = 12),
          axis.text.y=element_text(size=12),
          axis.title.x=element_text(size=15,color=4),
          axis.title.y=element_text(size=15,color=4)) 

Chapter 2:  Grammar of Graphics and Visual Components


The following are the activity solutions for this chapter.

Activity: Applying Grammar of Graphics to Create a Complex Visualization

Steps for Completion:

  1. Use the commands that we just explored to create the scatterplot.
  2. For this activity, you will use the gapminder dataset.
  3. You can use the help command to explore the options.
  4. To change scales, you will have to use one of the preceding label formats.
  5. Use labels=scales::unit_format ("K", 1e-3)) for labeling.

Outcome:

The output code is as follows:

ggplot(df, aes(x=gdp_per_capita,y=Electricity_consumption_per_capita))+
    geom_point()+
    scale_x_continuous(name="GDP",breaks = seq(0,50000,5000),
                       labels=scales::unit_format("K", 1e-3)) +
    scale_y_continuous(name="Electricity Consumption",
                       breaks = seq(0,20000,2000),
                       labels=scales::unit_format("K", 1e-3))

Activity: Using Faceting to Understand Data

Steps for Completion:

  1. Use the loan data and plot a histogram (use fill color=cadetblue4 and bins=10).
  2. Use facet_wrap() to plot the loan data for the different credit grades.
  3. Now, you will need to change the default options for facet_wrap, in order to produce the following plots. Use ?facet_wrap on the command line to view the options that can be changed.

Outcome:

Refer to the complete code at the following path: https://goo.gl/RheL2G. The answers to the questions are given here:

  1. scale=free_y.
  2. A, B, and C have maximum loan amounts below 10,000. (A, B, C, and D is also an acceptable answer.)
  3. F and G show uniform distributions.
  4. No, none of the distributions are normally distributed.

Activity: Using Color Differentiation in Plots

Steps for Completion:

  1. Use the LoanStats dataset and make a subset using the following variables:
dfn <- df3[,c("home_ownership","loan_amnt","grade")]
  1. Clean the dataset (removing the NONE and NA cases), using the following code:
dfn <- na.omit(dfn)
dfn <- subset (dfn, !dfn$home_ownership %in% c("NONE"))
  1. Create a boxplot showing the loan amount versus home ownership.
  2. Color differentiate by credit grade.

Outcome:

Refer to the following URL for the output: https://goo.gl/RheL2G.

The answers to question 5 are as follows:

  1. Credit grades F and G are the highest. Credit grades A and B are the lowest.
  2. They are higher for a person who has a mortgage.
  3. The median value for A is 2,000, and the median value for G is 20,000, so the difference is 180,000.

Activity: Using Themes and Color Differentiation in a Plot

Steps for Completion:

  1. Make a scatterplot of female versus male BMIs.
  2. Build your plot in layers, to avoid creating three separate plots.
    1. Create the default plot. Store this plot as p1.
    2. Points should be differentiated by color. Differentiate the two BMIs by country using color. The size of the points should be 2.
    3. Change the color scheme by using scale_color_brewer. The palette used is Dark2. Store this plot as p2.
    4. Add a plot title: BMI female vs BMI Male.
    5. Change more of the theme's aspects to produce plot p3. The theme aspects to be changed, and their values, are as follows:
      • Panel Background: azure; Color: black
      • No grid lines
      • Axis Title Size: 15; Axis Title Color: cadetblue4
      • Change x and y titles: BMI female and BMI Male
      • Legend: Position bottom, Lef justifid, No Legend Title, legend key (fil – gray97, color of the line=3)
      • Plot Title Color: cadetblue4; Size: 18; Face: bold.italic

Outcome:

The output code is as follows:

pd1 <- ggplot(df,aes(x=BMI_male,y=BMI_female))
pd2 <- pd1+geom_point()
pd3 <- pd1+geom_point(aes(color=Country),size=2)+
    scale_colour_brewer(palette="Dark2")
pd4 <- pd3+theme(axis.title=element_text(size=15,color="cadetblue4",
                 face="bold"),
                 plot.title=element_text(color="cadetblue4", size=18,
                 face="bold.italic"),
                 panel.background = element_rect(fill="azure",color="black"),
                 panel.grid=element_blank(),
                 legend.position="bottom",
                 legend.justification="left",
                 legend.title = element_blank(),
                 legend.key = element_rect(color=3,fill="gray97")
)+
    xlab("BMI Male")+
    ylab("BMI female")+
    ggtitle("BMI female vs BMI Male")

Chapter 3:  Advanced Geoms and Statistics


The following are the activity solutions for this chapter.

Activity: Using Density Plots to Compare Distributions

Steps for Completion:

  1. Use the RestaurantTips dataset in Lock5data.
  2. Compare the TIP amount for various days. Use aes=color for geom_density command.
  3. Superimpose all of the plots.
  4. Use the scale_x_continuous command for the x-axis tick marks.

Activity: Plot the Monthly Closing Stock Prices and the Mean Values

Steps for Completion:

  1. Use the strftime command to get the month from each date and make another variable (Month), as follows:
df_fb$Month <- strftime(df_fb$Date,"%m")
  1. Change the month to a numerical value by using as.numeric:
df_fb$Month <- as.numeric(df_fb$Month)
  1. Now, use ggplot to make a plot of closing prices versus months.
  2. Plot the data using geom_point (color=red).
  3. Change the x scale to show each month, and label the x-axis, such that each month is shown.
  4. Title your plot Monthly closing stock prices: Facebook.
  5. Use geom_line(stat='summary',fun.y=mean) to plot the mean.

Outcome:

The complete code is shown as follows:

ggplot(df_fb, aes(Month,Close)) + geom_point(color="red",alpha=1/2,position = position_jitter(h=0.0,w=0.0
))+
    geom_line(stat='summary',fun.y=mean, color="blue",size=1)+
    scale_x_continuous(breaks=seq(0,13,1))+
    ggtitle("Monthly Closing Stock Prices: Facebook")+theme_classic()

Activity: Creating a Variable-Encoded Regional Map

Steps for Completion:

  1. Merge the USStates data with states_map.
  2. Before merging, change the states variable in USStates to the same format used in states_map.
  1. Use the ggplot options geom_polygon and coord_map to create the map.
  2. For aesthetics, run the following code and specify x=long, y=latgroup=group, and fill=ObamaVote.

Outcome:

The complete code is shown as follows:

USStates$Statelower <- as.character(tolower(USStates$State))
glimpse(USStates)
us_data <- merge(USStates,states_map,by.x="Statelower",by.y="region")
head(us_data)

Activity: Studying Correlated Variables

Steps for Completion:

  1. Make a subset of the loan dataset by using some of the following variables:
df3_1 <- df3[,c("funded_amnt","annual_inc","dti","inq_last_6mths",
                "total_acc","total_pymnt_inv")]
  1. Use cor for the preceding loan data subset, and then choose two highly correlated variables in the loan dataset. Use pairs, as follows:
total_rec_prncp and total_pymnt_int
funded_amnt,total_pymnt_inv
  1. Make a scatterplot for the preceding pairs for grade A, then fit a linear regression model.
  2. Determine what are the correlations of the preceding pairs.

Outcome:

Answer to step 4: The correlations are as follows:

  1. 93%
  2. 85%
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Applied Data Visualization with R and ggplot2
Published in: Sep 2018Publisher: ISBN-13: 9781789612158
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dr. Tania Moulik

Tania Moulik has a PhD in particle physics. She has worked at CERN, the European Organization for Nuclear Research, and on the Tevatron at Fermi National Accelerator Laboratory in IL, USA. She has years of programming experience in C++, Python, and R. She has also worked in the feld of big data and has worked with technologies such as grid computing. She has a passion for data analysis and would like to share her passion with others who would like to delve into the world of data analytics. She especially likes R and ggplot2 as a powerful analytics package.
Read more about Dr. Tania Moulik