This section contains the worked-out answers for the activities present in each lesson. Note that in case of descriptive questions, your answers might not match the ones provided in this section completely. As long as the essence of the answers remain the same, you can consider them correct.
You're reading from Applied Data Visualization with R and ggplot2
The following are the activity solutions for this chapter.
Steps for Completion:
- Use the template code
Lesson1_student.R
.
Note
This is an empty code, wherein the libraries are already loaded. You will be writing your code here.
- Load the dataset
temperature.csv
from the directory data. - Create the histogram for two cities (Vancouver and Miami) by using the command discussed previously.
- Once the histogram is ready, run the code.
- Analyze the two histograms by giving three points for each histogram, and two points of difference between the two.
Outcome:
Two histograms should be created and compared. The complete code is as follows:
df_t <- read.csv("data/historical-hourly-weather-data/temperature. csv") ggplot(df_t,aes(x=Vancouver))+geom_histogram() ggplot(df_t,aes(x=Miami))+geom_histogram()
Steps for Completion:
- Load the given datasets and investigate them by using the appropriate commands in dataset:
xAPI-Edu-Data.csv
. - Decide which visualizations to use for the given variables: Topic, gender, and VisitedResources.
- Create one-dimensional visualizations and explain why you chose that type of visual (one per variable). Provide one point of observation for each visualization.
- Create two-dimensional boxplots or scatterplots for VisitedResources versus Topic, VisitedResources versus AnnouncementsView, and Discussion versus Gender. What are your observations? Write at least five points.
Outcome:
Three one-dimensional plots and three two-dimensional plots should be created, with the following axes (count versus topic) and observations. (Note that the students may provide different observations, so the instructor should verify the answers.)
The complete code is as follows:
df_edu <- read.csv("data/xAPI-Edu-Data.csv") str(df_edu) #Functions for Plotting a barchart/Histogram plotbar <- function(df,mytxt) { ggplot(df,aes_string(x=mytxt)) + geom_bar() } plothist <- function(df,mytxt) { ggplot(df,aes_string(x=mytxt)) + geom_histogram() } #Alternatively one can use a function to plot but students can just #do it directly at this point. #1-D Plots plotbar(df_edu,"Topic") plotbar(df_edu,"gender") plotbar(df_edu,"ParentschoolSatisfaction") plothist(df_edu,"VisitedResources") #2-D Plots ggplot(df_edu,aes(x=Topic,y=VisitedResources)) + geom_boxplot() ggplot(df_edu,aes(x=AnnouncementsView,y=VisitedResources)) + geom_point() ggplot(df_edu,aes(x=gender,y=Discussion)) + geom_boxplot()
Steps for Completion:
- Use the basic ggplot commands to create two of the plots from Activity B(Topic and VisitedResources).
- Use the Grammar of Graphics to improve your graphics by layering upon the base graphic. The graph should follow these guidelines:
- Histograms should be rebinned.
- Change the fill colors of one- and two-dimensional objects. The line colors should be black.
- Add a title to the graph.
- Apply the appropriate font sizes and colors to the x- and y-axes.
Outcome:
The complete code is as follows:
p1 <- ggplot(df_edu,aes(x=Topic)) p2 <- ggplot(df_edu,aes(x=VisitedResources)) p1 + geom_bar(color=1,fill=3) + ylab("Count")+ theme(axis.text.y=element_text(size=10), axis.text.x=element_text(size = 10), axis.title.x=element_text(size=15,color=4), axis.title.y=element_text(size=15,color=4))+ ggtitle("Topics in Education data") p2 + geom_histogram(bins=20,fill="white",color=1)+ ggtitle("Visited Resources for Education data")+ xlab("Visited Resources")+ theme(axis.text.x=element_text(size = 12), axis.text.y=element_text(size=12), axis.title.x=element_text(size=15,color=4), axis.title.y=element_text(size=15,color=4))
The following are the activity solutions for this chapter.
Steps for Completion:
- Use the commands that we just explored to create the scatterplot.
- For this activity, you will use the
gapminder
dataset. - You can use the
help
command to explore the options. - To change scales, you will have to use one of the preceding label formats.
- Use
labels=scales::unit_format ("K", 1e-3))
for labeling.
Outcome:
The output code is as follows:
ggplot(df, aes(x=gdp_per_capita,y=Electricity_consumption_per_capita))+ geom_point()+ scale_x_continuous(name="GDP",breaks = seq(0,50000,5000), labels=scales::unit_format("K", 1e-3)) + scale_y_continuous(name="Electricity Consumption", breaks = seq(0,20000,2000), labels=scales::unit_format("K", 1e-3))
Steps for Completion:
- Use the loan data and plot a histogram (use
fill color=cadetblue4
andbins=10
). - Use
facet_wrap()
to plot the loan data for the different credit grades. - Now, you will need to change the default options for
facet_wrap
, in order to produce the following plots. Use?facet_wrap
on the command line to view the options that can be changed.
Outcome:
Refer to the complete code at the following path: https://goo.gl/RheL2G. The answers to the questions are given here:
scale=free_y
.- A, B, and C have maximum loan amounts below 10,000. (A, B, C, and D is also an acceptable answer.)
- F and G show uniform distributions.
- No, none of the distributions are normally distributed.
Steps for Completion:
- Use the
LoanStats
dataset and make a subset using the following variables:
dfn <- df3[,c("home_ownership","loan_amnt","grade")]
- Clean the dataset (removing the NONE and NA cases), using the following code:
dfn <- na.omit(dfn) dfn <- subset (dfn, !dfn$home_ownership %in% c("NONE"))
- Create a boxplot showing the loan amount versus home ownership.
- Color differentiate by credit grade.
Outcome:
Refer to the following URL for the output: https://goo.gl/RheL2G.
The answers to question 5 are as follows:
- Credit grades F and G are the highest. Credit grades A and B are the lowest.
- They are higher for a person who has a mortgage.
- The median value for A is 2,000, and the median value for G is 20,000, so the difference is 180,000.
Steps for Completion:
- Make a scatterplot of female versus male BMIs.
- Build your plot in layers, to avoid creating three separate plots.
- Create the default plot. Store this plot as p1.
- Points should be differentiated by color. Differentiate the two BMIs by country using color. The size of the points should be 2.
- Change the color scheme by using
scale_color_brewer
. The palette used is Dark2. Store this plot as p2. - Add a plot title: BMI female vs BMI Male.
- Change more of the theme's aspects to produce plot p3. The theme aspects to be changed, and their values, are as follows:
- Panel Background:
azure
; Color:black
- No grid lines
- Axis Title Size: 15; Axis Title Color:
cadetblue4
- Change x and y titles: BMI female and BMI Male
- Legend: Position bottom, Lef justifid, No Legend Title, legend key (fil –
gray97
, color of the line=3) - Plot Title Color:
cadetblue4
; Size: 18; Face:bold.italic
- Panel Background:
Outcome:
The output code is as follows:
pd1 <- ggplot(df,aes(x=BMI_male,y=BMI_female)) pd2 <- pd1+geom_point() pd3 <- pd1+geom_point(aes(color=Country),size=2)+ scale_colour_brewer(palette="Dark2") pd4 <- pd3+theme(axis.title=element_text(size=15,color="cadetblue4", face="bold"), plot.title=element_text(color="cadetblue4", size=18, face="bold.italic"), panel.background = element_rect(fill="azure",color="black"), panel.grid=element_blank(), legend.position="bottom", legend.justification="left", legend.title = element_blank(), legend.key = element_rect(color=3,fill="gray97") )+ xlab("BMI Male")+ ylab("BMI female")+ ggtitle("BMI female vs BMI Male")
The following are the activity solutions for this chapter.
Steps for Completion:
- Use the
RestaurantTips
dataset inLock5data
. - Compare the TIP amount for various days. Use
aes=color
forgeom_density
command. - Superimpose all of the plots.
- Use the
scale_x_continuous
command for the x-axis tick marks.
Steps for Completion:
- Use the
strftime
command to get the month from each date and make another variable (Month
), as follows:
df_fb$Month <- strftime(df_fb$Date,"%m")
- Change the month to a numerical value by using
as.numeric
:
df_fb$Month <- as.numeric(df_fb$Month)
- Now, use ggplot to make a plot of closing prices versus months.
- Plot the data using
geom_point (color=red)
. - Change the x scale to show each month, and label the x-axis, such that each month is shown.
- Title your plot Monthly closing stock prices: Facebook.
- Use
geom_line(stat='summary',fun.y=mean)
to plot the mean.
Outcome:
The complete code is shown as follows:
ggplot(df_fb, aes(Month,Close)) + geom_point(color="red",alpha=1/2,position = position_jitter(h=0.0,w=0.0 ))+ geom_line(stat='summary',fun.y=mean, color="blue",size=1)+ scale_x_continuous(breaks=seq(0,13,1))+ ggtitle("Monthly Closing Stock Prices: Facebook")+theme_classic()
Steps for Completion:
- Merge the
USStates
data withstates_map
. - Before merging, change the
states
variable inUSStates
to the same format used instates_map
.
- Use the ggplot options
geom_polygon
andcoord_map
to create the map. - For aesthetics, run the following code and specify
x=long
,y=lat
,group=group
, andfill=ObamaVote
.
Outcome:
The complete code is shown as follows:
USStates$Statelower <- as.character(tolower(USStates$State)) glimpse(USStates) us_data <- merge(USStates,states_map,by.x="Statelower",by.y="region") head(us_data)
Steps for Completion:
- Make a subset of the
loan
dataset by using some of the following variables:
df3_1 <- df3[,c("funded_amnt","annual_inc","dti","inq_last_6mths", "total_acc","total_pymnt_inv")]
- Use
cor
for the precedingloan
data subset, and then choose two highly correlated variables in theloan
dataset. Use pairs, as follows:
total_rec_prncp and total_pymnt_int funded_amnt,total_pymnt_inv
- Make a scatterplot for the preceding pairs for grade A, then fit a linear regression model.
- Determine what are the correlations of the preceding pairs.
Outcome:
Answer to step 4: The correlations are as follows:
- 93%
- 85%