**50%**off this eBook here

### Statistical Analysis with R — Save 50%

Take control of your data and produce superior statistical analysis with R.

The R Project for Statistical Computing (or just R for short) is a powerful data analysis tool. It is both a programming language and a computational and graphical environment. R is free, open source software made available under the GNU General Public License. It runs on Mac, Windows, and Unix operating systems.

The official R website is available at the following site:

In this article by **John M. Quick**, author of the book Statistical Analysis with R, you will learn how to:

- Organize and clarify your raw R data analyses
- Communicate your raw R data analyses effectively
- Apply the steps common to all well-conducted R analyses

## Statistical Analysis with R

Read more about this book |

*(For more resources on R, see here.)*

# Retracing and refining a complete analysis

For demonstration purposes, it will be assumed that a fire attack was chosen as the optimal battle strategy. Throughout this segment, we will retrace the steps that lead us to this decision. Meanwhile, we will make sure to organize and clarify our analyses so they can be easily communicated to others.

Suppose we determined our fire attack will take place 225 miles away in Anding, which houses 10,000 Wei soldiers. We will deploy 2,500 soldiers for a period of 7 days and assume that they are able to successfully execute the plans. Let us return to the beginning to develop this strategy with R in a clear and concise manner.

# Time for action – first steps

To begin our analysis, we must first launch R and set our working directory:

- Launch R.
- The R console will be displayed.
- Set your R working directory using the
*setwd(dir)*function. The following code is a hypothetical example. Your working directory should be a relevant location on your own computer.

> #set the R working directory using setwd(dir)

> setwd("/Users/johnmquick/rBeginnersGuide/") - Verify that your working directory has been set to the proper location using the
*getwd()*command :

> #verify the location of your working directory

> getwd()

[1] "/Users/johnmquick/rBeginnersGuide/"

## What just happened?

We prepared R to begin our analysis by launching the soft ware and setting our working directory. At this point, you should be very comfortable completing these steps.

# Time for action – data setup

Next, we need to import our battle data into R and isolate the portion pertaining to past fire attacks:

- Copy the
*battleHistory.csv*file into your R working directory. This file contains data from 120 previous battles between the Shu and Wei forces. - Read the contents of
*battleHistory.csv*into an R variable named*battleHistory*using the*read.table(...)*command:

> #read the contents of battleHistory.csv into an R variable

> #battleHistory contains data from 120 previous battles

between the Shu and Wei forces

> battleHistory <- read.table("battleHistory.csv", TRUE, ",") - Create a subset using the
*subset(data, ...)*function and save it to a new variable named*subsetFire*:

> #use the subset(data, ...) function to create a subset of

the battleHistory dataset that contains data only from battles

in which the fire attack strategy was employed

> subsetFire <- subset(battleHistory, battleHistory$Method ==

"fire") - Verify the contents of the new subset. Note that the console should return 30 rows, all of which contain
*fire*in the*Method*column:

> #display the fire attack data subset

> subsetFire

## What just happened?

We imported our dataset and then created a subset containing our fire attack data. However, we used a slightly different function, called *read.table(...*), to import our external data into R.

## read.table(...)

U p to this point, we have always used the *read.csv()* function to import data into R. However, you should know that there are oft en many ways to accomplish the same objectives in R. For instance, *read.table(...)* is a generic data import function that can handle a variety of file types. While it accepts several arguments, the following three are required to properly import a CSV file, like the one containing our battle history data:

*file*: t he name of the file to be imported, along with its extension, in quotes*header*: whether or not the file contains column headings;*TRUE*for yes,*FALSE*(default) for no*sep*: t he character used to separate values in the file, in quotes

Using these arguments, we were able to import the data in our *battleHistory.csv* into R. Since our file contained headings, we used a value of *TRUE* for the *header* argument and because it is a comma-separated values file, we used *","* for our *sep* argument:

> battleHistory <- read.table("battleHistory.csv", TRUE, ",")

This is just one example of how a different technique can be used to achieve a similar outcome in R. We will continue to explore new methods in our upcoming activities.

### Pop quiz

- Suppose you wanted to import the following dataset, named
*newData*into R. Which of the following*read.table(...)*functions would be best to use?

4,5

5,9

6,12*read.table("newData", FALSE, ",")**read.table("newData", TRUE, ",")**read.table("newData.csv", FALSE, ",")**read.table("newData.csv", TRUE, ",")*

# Time for action – data exploration

To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses:

- Generate a summary of the fire attack subset using
*summary(object)*:

> #generate a summary of the fire subset

> summaryFire <- summary(subsetFire)

> #display the summary

> summaryFireBefore calculating correlations, we will have to convert our nonnumeric data from the

*Method*,*SuccessfullyExecuted*, and*Result*columns into numeric form. - Re code the
*Method*column using*as.numeric(data)*:

> #represent categorical data numerically using

as.numeric(data)

> #recode the Method column into Fire = 1

> numericMethodFire <- as.numeric(subsetFire$Method) - 1 - Recode the
*SuccessfullyExecuted*column using*as.numeric(data)*:

> #recode the SuccessfullyExecuted column into N = 0 and Y = 1

> numericExecutionFire <-

as.numeric(subsetFire$SuccessfullyExecuted) - 1 - Recode the
*Result*column using*as.numeric(data)*:

> #recode the Result column into Defeat = 0 and Victory = 1

> numericResultFire <- as.numeric(subsetFire$Result) - 1With the

*Method*,*SuccessfullyExecuted*, and*Result*columns coded into numeric form, let us now add them back into our fire dataset. - Save the data in our recoded variables back into the original dataset:

> #save the data in the numeric Method, SuccessfullyExecuted,

and Result columns back into the fire attack dataset

> subsetFire$Method <- numericMethodFire

> subsetFire$SuccessfullyExecuted <- numericExecutionFire

> subsetFire$Result <- numericResultFire - Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following:

- Having replaced our original text values in the
*SuccessfullyExecuted*and*Result*columns with numeric data, we can now calculate all of the correlations in the dataset using the*cor(data)*function:

> #use cor(data) to calculate all of the correlations in the

fire attack dataset

> cor(subsetFire)*Note that the error message and NA values in our correlation output result from the fact that our***Method**column contains only a single value. This is irrelevant to our analysis and can be ignored.

## What just happened?

Initially, we calculated summary statistics for our fire attack dataset using the *summary(object)* function. From this information, we can derive the following useful insights about our past battles:

- The rating of the Shu army's performance in fire attacks has ranged from 10 to 100, with a mean of 45
- Fire attack plans have been successfully executed 10 out of 30 times (33%)
- Fire attacks have resulted in victory 8 out of 30 times (27%)
- Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory
- The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052
- The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333
- The duration of fire attacks has ranged from 1 to 14 days with a mean of 7

Next, we recoded the text values in our dataset's *Method*, *SuccessfullyExecute*d, and *Result* columns into numeric form. Aft er adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data:

- The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle's result (0.90), but not strongly correlated with the other variables.
- The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46).
- The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables.

The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model.

### Pop quiz

- Which of the following is a benefit of adding a text variable back into its original dataset aft er it has been recoded into numeric form?
- Calculation functions can be executed on the recoded variable.
- Calculation functions can be executed on the other variables in the dataset.
- Calculation functions can be executed on the entire dataset.
- There is no benefit.

Read more about this book |

*(For more resources on R, see here.)*

# Time for action – model development

Let us continue to the most extensive phase of our data analysis, which consists of developing the optimal regression model for our situation. Ultimately, we want to predict the performance rating of the Shu army under potential fire attack strategies. From our previous exploration of the data, we have reason to believe that successful execution greatly influences the outcome of battle. We can also infer that the duration of a battle has some impact on its outcome. At the same time, it appears that the number of soldiers engaged in battle does not have a large impact on the result. However, since the numbers of Shu and Wei soldiers themselves are highly correlated, there is a potential interaction effect between the two that is worth investigating. We will start by using our insights to create a set of potentially useful models:

- Use the
*glm( formula, data)*function to create a series of potential linear models that predict the*Rating*of battle (dependent variable) using one or more of the independent variables in our dataset. Then, use the*summary(object)*command to assess the statistical significance of each model:

> #create a linear regression model using the

glm(formula, data) function

> #predict the rating of battle using execution

> lmFireRating_Execution <- glm(Rating ~ SuccessfullyExecuted,

data = subsetFire)

> #generate a summary of the model

> lmFireRating_Execution_Summary <-

summary(lmFireRating_Execution)

> #display the model summary

> lmFireRating_Execution_Summary

> #keep execution in the model as an independent variableOur first model used only the successful (or unsuccessful) execution of battle plans to predict the performance of the Shu army in a fire attack. Our summary tells us that execution is an important factor to include in the model.

- Now, let us examine the impact that the duration of battle has on our model:

> #predict the rating of battle using execution and duration

> lmFireRating_ExecutionDuration <-

glm(Rating ~ SuccessfullyExecuted + DurationInDays,

data = subsetFire)

> #generate a summary of the model

> lmFireRating_ExecutionDuration_Summary <-

summary(lmFireRating_ExecutionDuration)

> #display the model summary

> lmFireRating_ExecutionDuration_Summary

>#keep duration in the model as independent variableThis model added the duration of battle to execution as a predictor of the Shu army's rating. Here, we found that duration is also an important predictor that should be included in the model.

- Next, we will inspect the prospects of including the number of Shu and Wei soldiers as predictors in our model:

> #predict the rating of battle using execution, duration,

and the number of Shu and Wei soldiers engaged

> lmFireRating_ExecutionDurationSoldiers <-

glm(Rating ~ SuccessfullyExecuted + DurationInDays +

ShuSoldiersEngaged + WeiSoldiersEngaged, data = subsetFire)

> #generate a summary of the model

> lmFireRating_ExecutionDurationSoldiers_Summary <-

summary(lmFireRating_ExecutionDurationSoldiers)

> #display the model summary

> lmFireRating_ExecutionDurationSoldiers_Summary

> #drop the number of Shu and Wei soldiers from model

as independent variablesThis time, we added the number of Shu and Wei soldiers into our model, but determined that they were not significant enough predictors of the Shu army's performance. Therefore, we elected to exclude them from our model.

- Lastly, let us investigate the potential interaction effect between the number of Shu and Wei soldiers:

> #investigate a potential interaction effect between the

number of Shu and Wei soldiers

> #center each variable by subtracting its mean from each

of its values

> centeredShuSoldiersFire <- subsetFire$ShuSoldiersEngaged

- mean(subsetFire$ShuSoldiersEngaged)

> centeredWeiSoldiersFire <- subsetFire$WeiSoldiersEngaged

- mean(subsetFire$WeiSoldiersEngaged)

> #multiply the two centered variables to create the

interaction variable

> interactionSoldiersFire <- centeredShuSoldiersFire

* centeredWeiSoldiersFire

> #predict the rating of battle using execution, duration,

and the interaction between the number of Shu and Wei

soldiers engaged

> lmFireRating_ExecutionDurationShuWeiInteraction <-

glm(Rating ~ SuccessfullyExecuted + DurationInDays +

interactionSoldiersFire, data = subsetFire)

> #generate a summary of the model

lmFireRating_ExecutionDurationShuWeiInteraction_Summary

<- summary(lmFireRating_ExecutionDurationShuWeiInteraction)

> #display the model summary

> lmFireRating_ExecutionDurationShuWeiInteraction_Summary

> #keep the interaction between the number of Shu and Wei

soldiers engaged in the model as an independent variableWe can see that the interaction effect between the number of Shu and Wei soldiers does have a meaningful impact on our model and should be included as an independent variable.

*Note that some statisticians may argue that it is inappropriate to include an interaction variable between the Shu and Wei soldiers in this model, without also including the number of Shu and Wei soldiers alone as variables in the model. In this fictitious example, there is no practically significant difference between these two options, and therefore, the interaction term has been included alone for the sake of simplicity and clarity. However, were you to incorporate interaction effects into your own regression models, you are advised to thoroughly investigate the implications of including or excluding certain variables.*

We have identified four potential models. To determine which of these is most appropriate for predicting the outcome of our fire attack, we will use an approach known as **Akaike Information Criterion**, or **AIC**:

> #use the AIC(object, ...) function to compare the models

and choose the most appropriate one

> #when comparing via AIC, the lowest value indicates the

best statistical model

> AIC(lmFireRating_Execution, lmFireRating_ExecutionDuration,

lmFireRating_ExecutionDurationSoldiers,

lmFireRating_ExecutionDurationShuWeiInteraction)

> #according to AIC, our model that includes execution, duration, and

the interaction effect is best

The AIC procedure revealed that our model containing execution, duration, and the interaction between the number of Shu and Wei soldiers is the best choice for predicting the performance of the Shu army.

## What just happened?

We just completed the process of developing potential regression models and comparing them in order to choose the best one for our analysis. Through this process, we determined that the successful execution, duration, and the interaction between the number of Shu and Wei soldiers engaged were statistically significant independent variables, whereas the number of Shu and Wei soldiers alone were not. By using an AIC test, we were able to determine that the model containing all three statistically significant variables was best for predicting the Shu army's performance in fire attacks. Therefore, our final regression equation is as follows:

Rating = 37 + 56 * execution - 1.24 * duration - 0.00000013 *

soldiers interaction

## glm(...)

Each of our models in this article were created using the *glm(formula, data)* function. We used *glm(formula, data)* here to demonstrate an alternative R function for creating regression models. In your own work, the appropriate function will be determined by the requirements of your analysis.

You may also have noticed that our *lm(formula, data)* functions listed only the variable names in the *formula* argument. This is a short-hand method for referring to our dataset's column names, as demonstrated by the following code:

lmFireRating_ExecutionDuration <- glm(Rating ~

SuccessfullyExecuted + DurationInDays, data = subsetFire)

Notice that the *subsetFire$* prefix is absent from each variable name and that the *data* argument has been defined as *subsetFire*. When the *data* argument is used, and the independent variables in the *formula* argument are unique, the *dataset$* prefix may be omitted. This technique has the effect of keeping our code more readable, without changing the results of our calculations.

## AIC(object, ...)

AIC can be used to compare regression models. It yields a series of AIC values, which indicate how well our models fit our data. AIC is used to compare multiple models relative to each other, whereby the model with the lowest AIC value best represents our data.

Similar in structure to the *anova(object, ...)* function, the *AIC(object, ...)* function accepts a series of objects (regression models in our case) as input. For example, in *AIC(A, B, C)* we are telling R to compare three objects *(A, B, and C)* using *AIC*. Thus, our AIC function compared the four regression models that we created:

> AIC(lmFireRating_Execution, lmFireRating_ExecutionDuration,

lmFireRating_ExecutionDurationSoldiers,

lmFireRating_ExecutionDurationShuWeiInteraction)

As output, *AIC(object, ...)* returned a series of AIC values used to compare our models.

The *glm(...)* function coordinates well with *AIC(object, ...)*, hence our decision to use them together in this example. Again, the appropriate techniques to use in your future analyses should be determined by the specific conditions surrounding your work.

### Pop quiz

- When can the
*dataset$*prefix be omitted from the variables in the*formula*argument of*lm(formula, data)*and*glm(formula, data)*?- When the data argument is defined.
- When the data argument is defined and all of the variables come from different datasets.
- When the data argument is defined and all of the variables have unique names.
- When the data argument is defined, all of the variables come from different datasets, and all of the variables have unique names.

- Which of the following is
**not**true of the*anova(object, ...)*and*AIC(object, ...)*functions?- Both can be used to compare regression models.
- Both receive the same arguments.
- Both represent different statistical methods.
- Both yield identical mathematical results.

Read more about this book |

*(For more resources on R, see here.)*

# Time for action – model deployment

Having selected the optimal model for predicting the outcome of our fire attack strategy, it is time to put that model to practical use. We can use it to predict the outcomes of various fire attack strategies and to identify one or more strategies that are likely to lead to victory. Subsequently, we need to ensure that our winning strategies are logistically sound and viable. Once we strike a balance between our designed strategy and our practical constraints, we will arrive at the best course of action for the Shu forces.

We set a rating value of 80 as our minimum threshold. As such, we will only consider a strategy adequate if it yields a rating of 80 or higher when all variables have been entered into our model.

In the case of our fire attack regression model, we know that to achieve our desired rating value, we must assume successful execution. We also know the number of Wei soldiers housed at the target city. Consequently, our major constraints are the number of Shu soldiers that we choose to engage in battle and the duration of the attack. We will assume a moderate attack duration.

Subsequently, we can rearrange our regression equation to solve for the number of Shu soldiers engaged and then represent it as a custom function in R:

- Use the
*coef(object)*function to isolate the independent variables in our regression model:

> #use the coef(object) function to extract the coefficients

from a regression model

> #this will make it easier to rearrange our equation by

allowing us to focus only on these values

> coef(lmFireRating_ExecutionDurationShuWeiInteraction) - Rewrite the fire attack regression equation to solve for the number of Shu soldiers engaged in battle:

> #rewrite the regression equation to solve for the number of

Shu soldiers engaged in battle

> #original equation: rating = 37 + 56 * execution - 1.24 *

duration - 0.00000013 * soldiers interaction

> #rearranged equation: Shu soldiers = (rating - 37 + 56 *

execution + 1.24 * duration) / (0.00000013 * -

Wei soldiers engaged) - Use the
*function()*command to create a custom R function to solve for the number of Shu soldiers engaged in battle, given the desired*rating*,*execution*,*duration*, and number of*WeiSoldiers*:

> #use function() to create a custom function in R

> #the function() command follows this basic format:

+ function(argument1, argument2,... argumenti) { equation }

> #custom function that solves for the maximum number of Shu

soldiers that can be deployed, given the desired rating,

execution, duration, and number of Wei soldiers

> functionFireShuSoldiers <- function(rating, execution,

duration, WeiSoldiers) {

+ (rating - 37 - 56 * execution +

+ 1.24 * duration) /

+ (0.00000013 * - WeiSoldiers)

+ } - Use the custom function to solve for the number of Shu soldiers that can be deployed, given a
*rating*of 80,*duration*of 7,*success*of 1.0, and 10,000*WeiSoldiers*:

> #solve for the number of Shu soldiers that can be deployed

given a result of 80, duration of 7, success of 1.0, and

15,000 WeiSoldiers

> functionFireShuSoldiers(80, 1.0, 7, 10000)

[1] 3323.077

Our regression model suggests that to achieve a rating of 80, our minimum threshold, we should deploy 3,323 Shu soldiers. However, from looking at the data in our fire attack subset, a force between 2,500 and 5,000 soldiers has not been previously used to launch a fire attack. Further, four past successful fire attacks on 7,500 to 12,000 Wei soldiers have deployed only 1,000 to 2,500 Shu soldiers. What would happen to our predicted rating value if we were to deploy 2,500 Shu soldiers instead of 3,323?

- Create a custom function to solve for the rating of battle when
*execution*,*duration*, and number of*ShuSoldiers*and*WeiSoldiers*are known:

> #custom function that solves for rating of battle, given the

execution, duration, number of Shu soldiers, and number of Wei

soldiers

> functionFireRating <- function(execution, duration,

ShuSoldiers, WeiSoldiers) {

+ 37 + 56 * execution -

+ 1.24 * duration -

+ 0.00000013 * (ShuSoldiers * WeiSoldiers)

+ } - Use the custom function to solve for the rating of battle, given successful execution, a 7-day duration, 2,500 Shu soldiers, and 10,000 Wei soldiers:

> What would happen to our rating value if we were to deploy

2,500 Shu soldiers instead of 3,323?

> functionFireRating(1.0, 7, 2500, 10000)

[1] 81.07

> #Is the 1.07 increase in our predicted chances for victory

worth the practical benefits derived from deploying 2,500

soldiers?

By using 2,500 soldiers, our rating value increased to 81, which is slightly above our threshold of confidence for victory. Here, we have encountered a classic dilemma for the data analyst. On one hand, our data model tells us that it is safe to use 3,323 soldiers. On the other, our knowledge of war strategy and past outcomes tells us that a number between 1,000 and 2,500 would be sufficient. Essentially, we have to identify the practical benefits or detriments from deploying a certain number of soldiers. In this case, we are inclined to think that it is beneficial to deploy fewer than 3,323, but more than 1,000. The exact number is a matter of debate and uncertainty that deserves serious consideration. It is always the strategist's challenge to weigh both the practical and statistical benefits of potential decisions. On that note, let us consider the logistics of our proposed fire attack. Our plan is to deploy 2,500 Shu soldiers over a period of 7 days to attack 10,000 Wei soldiers who are stationed 225 miles away.

- Create a custom function that calculates the gold cost of our fire attack strategy:

> #custom function that calculates the gold cost of our

strategy, given the number of Shu soldiers deployed, the

distance of the target city, and the proposed duration of

battle.

> functionGoldCost <- function(ShuSoldiers, distance, duration)

+ {

+ ShuSoldiers * (distance / 100 + 2 * (duration / 30))

+ } - Use the custom function to calculate the gold cost of our fire attack strategy:

> #gold cost of fire attack that deploys 2,500 Shu soldiers

a distance of 225 miles for a period of 7 days

> functionGoldCost(2500, 225, 7)

[1] 6791.667 - Calculate the number of provisions needed for our fire attack strategy:

> #provisions required by our fire attack strategy

> #consumption per 30 days is equal to the number of soldiers

deployed

> 2500 * (7/30)

[1] 583.3333 - Determine whether the fire attack strategy is viable given our resource limitations:

> #our gold cost of 6,792 is well below our allotment of 1,000,000

> #our required provisions of 583 are well below our allotment of

1,000,000

> #our 2,500 soldiers account for only 1.25% of our total army

personnel

> #yes, the fire attack strategy is viable given our resource

constraints

## What just happened?

We successfully used our optimal regression model to refine our battle strategy and test its viability in light of our practical resource constraints. Custom functions were used to calculate the number of soldiers necessary to yield our desired outcome, the performance rating given the parameters of our plan, and the overall gold cost of our strategy. In determining the number of soldiers to engage in our fire attack, we encountered a common occurrence whereby our data models conflicted with our practical understanding of the world. Subsequently, we had to use our expertise as data analysts to balance the consequences between the two and arrive at a sound conclusion. We then assessed the overall viability of our strategy and determined it to be sufficient in consideration of our resource allotments.

## coef(object)

Prior to rewriting our regression equation and converting it into a custom function, we executed the *coef(object)* command on our model. The *coef(object)* function, when executed on a regression model, has the effect of extracting and displaying its independent variables (or **coefficients**). By isolating these components, we were able to easily visualize our model's equation:

> coef(lmFireRating_ExecutionDurationShuWeiInteraction)

In contrast, the *summary(object)* function contains much more information than we need for this purpose, thus making it potentially confusing and difficult to locate our variables. This can be seen in the following:

> lmFireRating_ExecutionDurationShuWeiInteraction_Summary

Hence, in circumstances where we only care to see the independent variables in our model, the *coef(object)* function can be more effective than *summary(object)*.

### Pop quiz

- Under which of the following circumstances might you use the
*coef(object)*function instead of*summary(object)*?- You want to know the practical significance of the model's variables.
- You want to know the statistical significance of the model's variables.
- You want to know the model's regression equation.
- You want to know the formula used to generate the model.

# Time for action – last steps

Lastly, we need to save the workspace and console text associated with our fire attack analysis:

- Use the
*save.image(file)*function to save your R workspace to your working directory. The*file*argument should contain a meaningful filename and the*.RData*extension:

> #save the R workspace to your working directory

> save.image("rBeginnersGuide_Ch_07_fireAttackAnalysis.RData") - R will save your workspace file. Browse to the working directory on your hard drive to verify that this file has been created.
- Manually save your R console log by copying and pasting it into a text file. You may then format the console text to improve its readability.

We have now completed an entire data analysis of the fire attack strategy from beginning to end using R.

# The common steps to all R analyses

While retracing the development process behind our fire attack strategy, we encountered a key series of steps that are common to every analysis that you will conduct in R. Regardless of the exact situation or the statistical techniques used, there are certain things that must be done to yield an organized and thorough R analysis. Each of these steps is detailed.

Perhaps it goes without saying that the thing to do before beginning any R analysis is to launch R itself. Nevertheless, it is mentioned here for completeness and transparency.

## Step 1: Set your working directory

Once R is launched, the first common step is to set your working directory. This can be done using the *setwd(dir)* function and subsequently verified using the *getwd()* command:

> #Step 1: set your working directory

> #set your working directory using setwd(dir)

> #replace the sample location with one that is relevant to you

> setwd("/Users/johnmquick/rBeginnersGuide/")

> #once set, you can verify your new working directory using getwd()

> getwd()

[1] "/Users/johnmquick/rBeginnersGuide/"

### Comment your work

Note that commented lines, which are prefixed with the pound sign (#), appeared before each of our functions in step one. It is vital that you comment all of the actions that you take within the R console. This allows you to refer back to your work later and also makes your code accessible to others.

*This is an opportune time to point out that you can draft your code in other places besides the R console. For example, R has a built in editor that can be opened by going to the File New Document/Script menu or simultaneously pressing the Command + N or Ctrl + N keys. Other free editors can also be found online. The advantages of using an editor are that you can easily modify your code and see different types of code in different colors, which helps you to verify that it is properly constructed. Note however, that to execute your code, it must be placed in the R console.*

## Step 2: Import your data (or load an existing workspace)

Aft er you set the working directory, it is time to pull your data into R. This can be achieved by creating a new variable in tandem with the *read.csv(file)* command:

> #Step 2: Import data (or load an existing workspace)

> #read a dataset from a csv file into R using read.csv(file) and save

it into a new variable

> dataset <- read.csv("datafile.csv")

Alternatively, if you were continuing a prior data analysis, rather than starting a new one, you would instead load a previously saved workspace using *load.image(file)*. You can then verify the contents of your loaded workspace using the *ls(*) command.

> #load an existing workspace using load.image(file)

> load.image("existingWorkspace.RData")

> #verify the contents of your workspace using ls()

> ls()

[1] "myVariable 1"

[2] "myVariable 2"

[3] "myVariable 3"

## Step 3: Explore your data

Regardless of the type or amount of data that you have, summary statistics should be generated to explore your data. Summary statistics provide you with a general overview of your data and can reveal overarching patterns, trends, and tendencies across a dataset. Summary statistics include calculations such as means, standard deviations, and ranges, amongst others:

> #Step 3: Explore your data

> #calculate a mean using mean(data)

> mean(myData)

[1] 1000

> #calculate a standard deviation using sd(data)

> sd(myData)

[1] 100

> #calculate a range (minimum and maximum) using range(data)

> range(myData)

> [1] 500 2000

Also recall R's *summary(object)* function, which provides summary statistics along with additional vital information. It can be used with almost any object in R and will off er information specifically catered to that object:

> #generate a detailed summary for a given object using

summary(object)

> summary(object)

*Note that there are oft en other ways to make an initial examination of your data in addition to using summary statistics. When appropriate, graphing your data is an excellent way to gain a visual perspective on what it has to say. Furthermore, before conducting an analysis, you will want to ensure that your data are consistent with the assumptions necessitated by your statistical methods. This will prevent you from expending energy on inappropriate techniques and from making invalid conclusions.*

## Step 4: Conduct your analysis

Here is where your work will diff er from project to project. Depending on the type of analysis that you are conducting, you will use a variety of different techniques. The correct techniques to use will be determined by the circumstances surrounding your work.

> #Step 4: Conduct your analysis

> #The appropriate methods for this step will vary between analyses.

## Step 5: Save your workspace and console files

At the conclusion of your analysis, you will always want to save your work. To have the option to revisit and manipulate your R objects from session to session, you will need to save your R workspace using the *save.image(file)* command, as follows:

> #Step 5: Save your workspace and console files

> #save your R workspace using save.image(file)

> #remember to include the .RData file extension

> save.image("myWorkspace.RData")

To save your R console text, which contains the log of every action that you took during a given session, you will need to copy and paste it into a text file. Once copied, the console text can be formatted to improve its readability. For instance, a text file containing the five common steps of every R analysis could take the following form:

> #There are five steps that are common to every data analysis

conducted in R

> #Step 1: set your working directory

> #set your working directory using setwd(dir)

> #replace the sample location with one that is relevant to you

> setwd("/Users/johnmquick/rBeginnersGuide/")

> #once set, you can verify your new working directory using getwd()

> getwd()

[1] "/Users/johnmquick/rBeginnersGuide/"

> #Step 2: Import data (or load an existing workspace)

> #read a dataset from a csv file into R using read.csv(file) and save

it into a new variable

> dataset <- read.csv("datafile.csv")

> #OR

> #load an existing workspace using load.image(file)

> load.image("existingWorkspace.RData")

> #verify the contents of your workspace using ls()

> ls()

[1] "myVariable 1"

[2] "myVariable 2"

[3] "myVariable 3"

> #Step 3: Explore your data

> #calculate a mean using mean(data)

> mean(myData)

[1] 1000

> #calculate a standard deviation using sd(data)

> sd(myData)

[1] 100

> #calculate a range (minimum and maximum) using range(data)

> range(myData)

> [1] 500 2000

> #generate a detailed summary for a given object using

summary(object)

> summary(object)

> #Step 4: Conduct your analysis

> #The appropriate methods for this step will vary between analyses.

> #Step 5: Save your workspace and console files

> #save your R workspace using save.image(file)

> #remember to include the .RData file extension

> save.image("myWorkspace.RData")

> #save your R console text by copying it and pasting it into a text

file.

### Pop quiz

- Which of the following is not a benefit of commenting your code?
- It makes your code readable and organized.
- It makes your code accessible to others.
- It makes it easier for you to return to and recall your past work.
- It makes the analysis process faster.

# Summary

In this article, we conducted an entire data analysis in R from beginning to end. While doing so, we ensured that our work was as organized and transparent as possible, thereby making it more accessible to others. Afterwards, we identified the five steps that are common to all well-executed data analyses in R. You then used these steps to conduct, organize, and refine a battle strategy for the Shu army. Having completed this article, you should now be able to:

- Organize and clarify your raw R data analyses
- Communicate your raw R data analyses effectively
- Apply the steps common to all well-conducted R analyses

**Further resources on this subject:**

## About the Author :

## John M. Quick

He is an Educational Technology doctoral student at Arizona State University who is interested in the design, research, and use of educational innovations. Currently, his work focuses on mobile, game-based, and global learning, interactive mixed-reality systems, and innovation adoption. John's blog, which provides articles, tutorials, reviews, perspectives, and news relevant to technology and education, is available from http://www.johnmquick.com. In his spare time, John enjoys photography, nature, and travel.